swh-lister

Author	SHA1	Message	Date
Antoine Lambert	a7607abcf9	tests: Fix mocking of sleep calls with tenacity 8.4.2 Latest tenacity release adds some internal changes that broke the mocking of sleep calls in tests. Fix it by directly mocking time.sleep (was not working previously).	2024-06-28 18:15:36 +02:00
Antoine Lambert	41407e0eff	Use beautifulsoup4 CSS selectors to simplify code and type checking As the types-beautifulsoup4 package gets installed in the swh virtualenv as it is a swh-scanner test dependency, some mypy errors were reported related to beautifulsoup4 typing. As the returned type for the find method of bs4 is the following union: Tag \| NavigableString \| None, isinstance calls must be used to ensure proper typing which is not great. So prefer to use the select_one method instead where a simple None check must be done to ensure typing is correct as it is returning Optional[Tag]. In a similar manner, replace use of find_all method by select method. It also has the advantage to simplify the code.	2024-04-16 11:22:51 +02:00
Jérémy Bobbio (Lunar)	7344d264e7	Ensure HTTPError.response is not None The implementation of `HTTPError` in `requests` does not guarantee that the `response` property will always be set. So we need to ensure it is not `None` before looking for the return code, for example. This also makes mypy checks pass again, as `types-request` was updated in 2.31.0.9 to better match this particular aspect. See: https://github.com/python/typeshed/pull/10875	2023-10-18 10:41:57 +02:00
Antoine Lambert	7b932f46b5	gitweb: Add optional base_git_url parameter to lister Similar to cgit, it exist cases where git clone URLs for projects hosted on a gitweb instance cannot be found when scraping project pages or cannot be easily derived from the gitweb instance root URL. So add an optional base_git_url parameter enabling to compute correct clone URLs by appending project names to it.	2023-10-02 14:56:04 +02:00
Antoine Lambert	59a979642f	gitweb: Ensure to strip any prefix before git clone URL Some gitweb instances can have some string prefixes before the displayed git clone URLs so ensure to strip them to properly extract URLs. Related to swh/infra/sysadm-environment#5051.	2023-10-02 14:54:41 +02:00
Antoine Lambert	a04975571c	gitweb: Remove invalid use of str.rstrip rstrip is not a method to remove a string suffix so use another way to extract gitweb project name. It fixes the computation of some gitweb origin URLs. Related to swh/infra/sysadm-environment#5050.	2023-09-26 14:53:57 +02:00
Antoine R. Dumont (@ardumont)	573958ce64	Add Gitweb lister Depending on some instances, we have some specific heuristics, some instances: - have summary pages which do not not list metadata_url (so some computation happens to list git:// origins which are cloneable) - have summary page which reference metadata_url as a multiple comma separated urls - lists relative urls of the repository so we need to join it with the main instance url to have a complete cloneable origins (or summary page) - lists "down" http origins (cloning those won't work) so lists those as cloneable https ones (when the main url is behind https). Refs. swh/devel/swh-lister#1800	2023-07-10 16:50:41 +02:00

7 commits