Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.
Fix it by directly mocking time.sleep (was not working previously).
As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.
As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.
So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.
It also has the advantage to simplify the code.
The implementation of `HTTPError` in `requests` does not guarantee that
the `response` property will always be set. So we need to ensure it is
not `None` before looking for the return code, for example.
This also makes mypy checks pass again, as `types-request` was updated
in 2.31.0.9 to better match this particular aspect. See:
https://github.com/python/typeshed/pull/10875
Similar to cgit, it exist cases where git clone URLs for projects hosted
on a gitweb instance cannot be found when scraping project pages or cannot
be easily derived from the gitweb instance root URL.
So add an optional base_git_url parameter enabling to compute correct clone
URLs by appending project names to it.
Some gitweb instances can have some string prefixes before the displayed
git clone URLs so ensure to strip them to properly extract URLs.
Related to swh/infra/sysadm-environment#5051.
rstrip is not a method to remove a string suffix so use another
way to extract gitweb project name.
It fixes the computation of some gitweb origin URLs.
Related to swh/infra/sysadm-environment#5050.
Depending on some instances, we have some specific heuristics, some instances:
- have summary pages which do not not list metadata_url (so some
computation happens to list git:// origins which are cloneable)
- have summary page which reference metadata_url as a multiple comma separated urls
- lists relative urls of the repository so we need to join it with the main instance url
to have a complete cloneable origins (or summary page)
- lists "down" http origins (cloning those won't work) so lists those as cloneable https
ones (when the main url is behind https).
Refs. swh/devel/swh-lister#1800