swh-lister

Author	SHA1	Message	Date
David Douard	cccb8c21ff	Replace all remaining occurrences of the 'local' cls by 'postgresql' The former has been deprecated for ages...	2024-10-28 14:35:29 +01:00
Antoine Lambert	a7607abcf9	tests: Fix mocking of sleep calls with tenacity 8.4.2 Latest tenacity release adds some internal changes that broke the mocking of sleep calls in tests. Fix it by directly mocking time.sleep (was not working previously).	2024-06-28 18:15:36 +02:00
Antoine Lambert	41407e0eff	Use beautifulsoup4 CSS selectors to simplify code and type checking As the types-beautifulsoup4 package gets installed in the swh virtualenv as it is a swh-scanner test dependency, some mypy errors were reported related to beautifulsoup4 typing. As the returned type for the find method of bs4 is the following union: Tag \| NavigableString \| None, isinstance calls must be used to ensure proper typing which is not great. So prefer to use the select_one method instead where a simple None check must be done to ensure typing is correct as it is returning Optional[Tag]. In a similar manner, replace use of find_all method by select method. It also has the advantage to simplify the code.	2024-04-16 11:22:51 +02:00
David Douard	714fccc3c7	python: Fix black formatting after bump to 23.1.0 in pre-commit	2023-12-05 10:33:07 +01:00
Jérémy Bobbio (Lunar)	7344d264e7	Ensure HTTPError.response is not None The implementation of `HTTPError` in `requests` does not guarantee that the `response` property will always be set. So we need to ensure it is not `None` before looking for the return code, for example. This also makes mypy checks pass again, as `types-request` was updated in 2.31.0.9 to better match this particular aspect. See: https://github.com/python/typeshed/pull/10875	2023-10-18 10:41:57 +02:00
Antoine R. Dumont (@ardumont)	e91e0bf09c	cgit: Allow url to be optional Some cgit instances are at a domain's root path so we can build their url directly from their 'instance' parameter. This unifies further the cli to register a lister and the cli to schedule the listed origins from a forge. [1] ``` https://git.kernel.org https://source.codeaurora.org https://git.trueelena.org https://dev.sanctum.geek.nz https://git.trueelena.org https://git.dpkg.org https://anongit.mindrot.org https://git.aurel32.net https://gitweb.gentoo.org https://git.joeyh.name https://git.adrian.geek.nz ``` Refs. swh/devel/swh-lister#4693	2023-05-23 11:47:51 +02:00
Antoine R. Dumont (@ardumont)	45bbc29a52	cgit/tasks: Allow passing extra parameters to task This unifies with other lister tasks modules. And this allow the cgit task to be scheduled by the add-forge-now scheduler cli. Refs. swh/infra/sysadm-environment#4813	2023-03-21 12:22:07 +01:00
Nicolas Dandrimont	e785e67315	Hook up recently introduced options to all listers Hopefully one day we'll be able to replace all of this mess with PEP692 TypedDict kwargs, but that's only on track for Python 3.12.	2022-12-05 16:33:45 +01:00
Antoine R. Dumont (@ardumont)	fd1a4244a0	cgit/tests: Rename readme.md to readme With the extension, the readme is included in the swh-docs build and fails. It's not intended for the documentation build so renaming it keep it out of the doc build loop. This fixes build [1]. [1] https://jenkins.softwareheritage.org/view/all/job/DDOC/job/dev/2395/	2022-09-26 13:22:10 +02:00
Antoine Lambert	d5c30a3ce3	Update value of User-Agent HTTP request header used by listers That HTTP header value will now contain the lister name but also a link to our contact form in order for sysadmins to easily reach us if needed. The following template is used to generate it: "Software Heritage <lister_name> lister v<swh-lister version> (+https://www.softwareheritage.org/contact)"	2022-09-26 10:48:40 +02:00
Antoine Lambert	db6ce12e9e	Refactor and deduplicate HTTP requests code in listers Numerous listers were using the same page_request method or equivalent in their implementation so prefer to deduplicate that code by adding an http_request method in base lister class: swh.lister.pattern.Lister. That method simply wraps a call to requests.Session.request and logs some useful info for debugging and error reporting, also an HTTPError will be raised if a request ends up with an error. All listers using that new method now benefit of requests retry when an HTTP error occurs thanks to the use of the http_retry decorator.	2022-09-26 10:48:40 +02:00
Antoine Lambert	9c55acd286	Use generic HTTP retry policy by default and rename dedicated decorator Instead of retrying HTTP requests only for 429 status code by default, prefer to use the generic retry policy enabling to also retry for status codes >= 500 but also on ConnectionError exceptions. Rename throttling_retry decorator to http_retry to reflect this change.	2022-09-26 10:48:40 +02:00
Vincent SELLIER	9b3e565cf7	cgit: Ensure the clone url is searched on the right tab For some forges, the default tab for a repository detail is not the summary tab so the clone urls are not detected and the repository is ignored Related to T4544	2022-09-20 17:01:49 +02:00
Antoine Lambert	d38e05cff7	python: Reformat code with black 22.3.0 Related to T3922	2022-04-08 15:15:09 +02:00
Antoine Lambert	24bc671679	cgit: Enable to retry throttled HTTP requests Related to T3645	2021-10-22 15:15:05 +02:00
Antoine Lambert	6c12350863	pattern: Use URL network location as instance name when not provided Make the instance parameter of the base pattern lister optional and set lister name to URL network location when not provided. It simplifies lister creation when associated forge type have a lot of instances in the wild (e.g. gitlab or cgit) while giving more details about the listed forge instance. Also process listers for forge with multiple instances (cgit, gitea, gitlab, phabricator and tuleap) to ensure URL network location will be used when instance parameter is not provided. Related to T3403	2021-07-13 12:33:49 +02:00
Antoine R. Dumont (@ardumont)	729e76168f	cgit/lister: Fix error when a missing version is not provided Related to [1] [1] https://sentry.softwareheritage.org/share/issue/afe7279f9f2d4bdc86f4b1b068a281a5/	2021-05-28 12:09:56 +02:00
Vincent SELLIER	8e4dd178f1	cgit: remove the repository urls's trailing / Ensure the behavior is the same when a base url is provided or not Related to T3013#57810	2021-02-01 17:31:08 +01:00
Antoine R. Dumont (@ardumont)	2e22073558	cgit: Compute origin urls out of a base git url when provided. This adds a second behavior to the cgit lister to actually compute origin urls instead of parsing them out of another http request on git detailed page. This new behavior is expected to be the default behavior. The old behavior is kept for now and is expected to be used as fallback if too much false negatives are returned. Related to T2999	2021-01-29 15:33:24 +01:00
Antoine R. Dumont (@ardumont)	ae17b6b9a0	Make stateless lister constructors compatible with credentials In effect, it just allows to add credentials to cgit, cran and pypi listers. This fixes instances of error [1] [1] https://sentry.softwareheritage.org/share/issue/2c35a9f129cf4982a2dd003a232d507a/ Related to T2998	2021-01-28 14:42:56 +01:00
Vincent SELLIER	f6f9f1ca28	cgit: Don't stop the listing when a repository page is not available Related to T2988	2021-01-27 14:52:04 +01:00
Vincent SELLIER	91fcde8341	cgit: Add support for last_update information during listing Related to T2988	2021-01-27 14:17:17 +01:00
Vincent SELLIER	d62e77c1b4	cgit lister: Add missing types on the init method Related to T2984	2021-01-25 18:52:59 +01:00
Antoine Lambert	ea8ecee541	tests: Fix errors after swh-scheduler API update The PaginatedListedOriginList model has been updated in rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins field has been renamed to results.	2021-01-25 17:11:54 +01:00
Vincent SELLIER	e4a590fc7f	Port cgit lister to the new lister api Related to T2984	2021-01-25 14:57:45 +01:00
Antoine R. Dumont (@ardumont)	5d4b38999d	lister.cgit.tests: Clarify lister configuration	2020-10-30 13:30:14 +01:00
Antoine Lambert	22f7181294	python: Reorder imports with isort Related to T2610	2020-09-17 17:48:27 +02:00
Antoine R. Dumont (@ardumont)	5a5b7ef70b	tests: Separate lister instantiations Prior to this commit, all listers were instantiated at the same time even if only one was needed. This commit separates those instantiations. The only drawback to this is the db model initialization which now happens at each lister instantiation. This can be dealt with if needed at another time though.	2020-09-02 12:49:00 +02:00
Antoine R. Dumont (@ardumont)	9437a643ad	pytest: Define plugin and declare it in the root conftest Then drop all unneeded and indirect imports	2020-09-02 12:25:15 +02:00
Nicolas Dandrimont	c9963d4302	Use the new names for the swh.scheduler test fixtures	2020-07-09 17:06:50 +02:00
David Douard	93a4d8b784	Enable black - blackify all the python files, - enable black in pre-commit, - add a black tox environment.	2020-04-08 16:31:22 +02:00
Gautier Pugnonblanc Yann	e5fea84c55	review corrections	2020-02-20 09:13:49 +01:00
Gautier Pugnonblanc Yann	60adc424be	add anotation type in some lister file	2020-02-17 15:58:34 +01:00
Antoine Lambert	99fcd2b3f5	docs: Fix sphinx warnings Related to T2188	2020-01-17 16:15:11 +01:00
Antoine R. Dumont (@ardumont)	5ab9d67d67	core: Align listers' task output (hg/git tasks) with expected format Related to T2134 Related to D2409 Related to D2410	2019-12-09 15:12:17 +01:00
Antoine R. Dumont (@ardumont)	4a9608f31c	lister/tasks: Standardize return statements The following commit adapts the return statements from both lister and their associated tasks. This standardizes on what other modules (e.g. both dvcs and package loaders) do.	2019-12-02 15:49:38 +01:00
Nicolas Dandrimont	ff7fdf24db	Use a uniform User-Agent on all listers This also adds tests to make sure that we properly send our version number to upstreams.	2019-11-22 15:49:23 +01:00
Nicolas Dandrimont	78105940ff	Stop binding tasks to a specific instance of the celery app The celery.shared_task decorator allows late-binding of tasks to any celery app, which is well suited for our "task plugin" architecture.	2019-10-18 18:02:25 +02:00
Antoine R. Dumont (@ardumont)	a8cde12d72	tests: Update pytest_plugin according to latest version change	2019-10-14 18:20:15 +02:00
Antoine R. Dumont (@ardumont)	394658e53b	cgit.tests: Check the tasks from the scheduler	2019-10-09 17:57:57 +02:00
David Douard	bd11830328	cgit: reduce the batch size to 10 and add a bit of logging Since the CGit lister now perform an HTTP query for each git repos listed in the main index, it is significantly slower, so reducing the time between database commits make sense, and won't overload the database. With a bit of logging, it makes it easier to follow/debug the progress of a listing.	2019-09-04 15:37:40 +02:00
David Douard	8d9deeb8f8	plugins: add support for scheduler's task-type declaration Add a new register-task-types cli that will create missing task-type entries in the scheduler according to: - only create missing task-types (do not update them), but check that the backend_name field is consistent, - each SWHTask-based task declared in a module listed in the 'task_modules' plugin registry field will be checked and added if needed; tasks which name start wit an underscore will not be added, - added task-type will have: - the 'type' field is derived from the task's function name (with underscores replaced with dashes), - the description field is the first line of that function's docstring, - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with a simple pattern matching to have decent default values for full/incremental tasks), - these default values can be overloaded via the 'task_type' plugin registry entry. For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`). Comes with some tests.	2019-09-04 15:36:08 +02:00
David Douard	e3c0ea9d90	implement listers as plugins Listers are declared as plugins via the `swh.workers` entry_point. As such, the registry function is expected to return a dict with the `task_modules` field (as for generic worker plugins), plus: - `lister`: the lister class, - `models`: list of SQLAlchemy models used by this lister, - `init` (optionnal): hook (callable) used to initialize the lister's state (typically, create/initialize the database for this lister). If not set, the default implementation creates database tables (after optionally having deleted exisintg ones) according to models declared in the `models` register field. There is no need for explicitely add lister task modules in the main `conftest` module, but any new/extra lister to be tested must be registered (the tested lister module must be properly installed in the test environment). Also refactor a bit the cli tools: - add support for the standard --config-file option at the 'lister' group level, - move the --db-url to the 'lister' group, - drop the --lister option for the `swh lister db-init` cli tool: initializing (especially with --drop-tables) the database for a single lister is unreliable, since all tables are created using a sibgle MetaData (in the same namespace).	2019-09-03 15:02:24 +02:00
David Douard	8785fc1a4e	cgit: fix cgit's task module and tests forgot some `url_prefix` there.	2019-09-03 12:01:55 +02:00
David Douard	3816b4d3bf	cgit: rewrite the CGit lister Simplify the code: - do only inherit from ListerBase - implement HTTP queries directly using requests - get rid of convoluted code Make the origin_url gathered from the git repo's "project" page instead of building it from the 'url_prefix' hack. Now, the lister WILL make substancially more requests, since it will make one request per listed git repo, but the provided origin_url should be pretty reliable now. When several url are provided as clonable URLs, choose the http/https one first, otherwise, choose the first one of the list. Add proper tests for the cgit lister. Also, get rid of the 'time_updated' column in the model.	2019-09-02 12:29:31 +02:00
Archit Agrawal	0bf24469b7	swh.lister.cgit: Remove repo page visit step Remove the need to visit every page and extract the origin url by introducing a parameter url_prefix. The origin url is in format <prefix>/<repo_name> where The prefix is same for all the repos for a particular cgit instance.	2019-06-28 20:02:07 +05:30
Archit Agrawal	7e3c79bb1d	swh.lister.cgit: Add pagination support Some cgit instance have a pagination. Modifiy lister to find all the pages and list all the repos from all the pages.	2019-06-28 19:27:25 +05:30
Archit Agrawal	b972a2a88d	swh.lister.cgit Implemented a lister to list the repos for a given CGit instance. Closes T1659	2019-06-28 19:27:25 +05:30

48 commits