swh-lister

Author	SHA1	Message	Date
Antoine Lambert	24bc671679	cgit: Enable to retry throttled HTTP requests Related to T3645	2021-10-22 15:15:05 +02:00
Antoine Lambert	6c12350863	pattern: Use URL network location as instance name when not provided Make the instance parameter of the base pattern lister optional and set lister name to URL network location when not provided. It simplifies lister creation when associated forge type have a lot of instances in the wild (e.g. gitlab or cgit) while giving more details about the listed forge instance. Also process listers for forge with multiple instances (cgit, gitea, gitlab, phabricator and tuleap) to ensure URL network location will be used when instance parameter is not provided. Related to T3403	2021-07-13 12:33:49 +02:00
Antoine R. Dumont (@ardumont)	729e76168f	cgit/lister: Fix error when a missing version is not provided Related to [1] [1] https://sentry.softwareheritage.org/share/issue/afe7279f9f2d4bdc86f4b1b068a281a5/	2021-05-28 12:09:56 +02:00
Vincent SELLIER	8e4dd178f1	cgit: remove the repository urls's trailing / Ensure the behavior is the same when a base url is provided or not Related to T3013#57810	2021-02-01 17:31:08 +01:00
Antoine R. Dumont (@ardumont)	2e22073558	cgit: Compute origin urls out of a base git url when provided. This adds a second behavior to the cgit lister to actually compute origin urls instead of parsing them out of another http request on git detailed page. This new behavior is expected to be the default behavior. The old behavior is kept for now and is expected to be used as fallback if too much false negatives are returned. Related to T2999	2021-01-29 15:33:24 +01:00
Antoine R. Dumont (@ardumont)	ae17b6b9a0	Make stateless lister constructors compatible with credentials In effect, it just allows to add credentials to cgit, cran and pypi listers. This fixes instances of error [1] [1] https://sentry.softwareheritage.org/share/issue/2c35a9f129cf4982a2dd003a232d507a/ Related to T2998	2021-01-28 14:42:56 +01:00
Vincent SELLIER	f6f9f1ca28	cgit: Don't stop the listing when a repository page is not available Related to T2988	2021-01-27 14:52:04 +01:00
Vincent SELLIER	91fcde8341	cgit: Add support for last_update information during listing Related to T2988	2021-01-27 14:17:17 +01:00
Vincent SELLIER	d62e77c1b4	cgit lister: Add missing types on the init method Related to T2984	2021-01-25 18:52:59 +01:00
Antoine Lambert	ea8ecee541	tests: Fix errors after swh-scheduler API update The PaginatedListedOriginList model has been updated in rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins field has been renamed to results.	2021-01-25 17:11:54 +01:00
Vincent SELLIER	e4a590fc7f	Port cgit lister to the new lister api Related to T2984	2021-01-25 14:57:45 +01:00
Antoine R. Dumont (@ardumont)	5d4b38999d	lister.cgit.tests: Clarify lister configuration	2020-10-30 13:30:14 +01:00
Antoine Lambert	22f7181294	python: Reorder imports with isort Related to T2610	2020-09-17 17:48:27 +02:00
Antoine R. Dumont (@ardumont)	5a5b7ef70b	tests: Separate lister instantiations Prior to this commit, all listers were instantiated at the same time even if only one was needed. This commit separates those instantiations. The only drawback to this is the db model initialization which now happens at each lister instantiation. This can be dealt with if needed at another time though.	2020-09-02 12:49:00 +02:00
Antoine R. Dumont (@ardumont)	9437a643ad	pytest: Define plugin and declare it in the root conftest Then drop all unneeded and indirect imports	2020-09-02 12:25:15 +02:00
Nicolas Dandrimont	c9963d4302	Use the new names for the swh.scheduler test fixtures	2020-07-09 17:06:50 +02:00
David Douard	93a4d8b784	Enable black - blackify all the python files, - enable black in pre-commit, - add a black tox environment.	2020-04-08 16:31:22 +02:00
Gautier Pugnonblanc Yann	e5fea84c55	review corrections	2020-02-20 09:13:49 +01:00
Gautier Pugnonblanc Yann	60adc424be	add anotation type in some lister file	2020-02-17 15:58:34 +01:00
Antoine Lambert	99fcd2b3f5	docs: Fix sphinx warnings Related to T2188	2020-01-17 16:15:11 +01:00
Antoine R. Dumont (@ardumont)	5ab9d67d67	core: Align listers' task output (hg/git tasks) with expected format Related to T2134 Related to D2409 Related to D2410	2019-12-09 15:12:17 +01:00
Antoine R. Dumont (@ardumont)	4a9608f31c	lister/tasks: Standardize return statements The following commit adapts the return statements from both lister and their associated tasks. This standardizes on what other modules (e.g. both dvcs and package loaders) do.	2019-12-02 15:49:38 +01:00
Nicolas Dandrimont	ff7fdf24db	Use a uniform User-Agent on all listers This also adds tests to make sure that we properly send our version number to upstreams.	2019-11-22 15:49:23 +01:00
Nicolas Dandrimont	78105940ff	Stop binding tasks to a specific instance of the celery app The celery.shared_task decorator allows late-binding of tasks to any celery app, which is well suited for our "task plugin" architecture.	2019-10-18 18:02:25 +02:00
Antoine R. Dumont (@ardumont)	a8cde12d72	tests: Update pytest_plugin according to latest version change	2019-10-14 18:20:15 +02:00
Antoine R. Dumont (@ardumont)	394658e53b	cgit.tests: Check the tasks from the scheduler	2019-10-09 17:57:57 +02:00
David Douard	bd11830328	cgit: reduce the batch size to 10 and add a bit of logging Since the CGit lister now perform an HTTP query for each git repos listed in the main index, it is significantly slower, so reducing the time between database commits make sense, and won't overload the database. With a bit of logging, it makes it easier to follow/debug the progress of a listing.	2019-09-04 15:37:40 +02:00
David Douard	8d9deeb8f8	plugins: add support for scheduler's task-type declaration Add a new register-task-types cli that will create missing task-type entries in the scheduler according to: - only create missing task-types (do not update them), but check that the backend_name field is consistent, - each SWHTask-based task declared in a module listed in the 'task_modules' plugin registry field will be checked and added if needed; tasks which name start wit an underscore will not be added, - added task-type will have: - the 'type' field is derived from the task's function name (with underscores replaced with dashes), - the description field is the first line of that function's docstring, - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with a simple pattern matching to have decent default values for full/incremental tasks), - these default values can be overloaded via the 'task_type' plugin registry entry. For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`). Comes with some tests.	2019-09-04 15:36:08 +02:00
David Douard	e3c0ea9d90	implement listers as plugins Listers are declared as plugins via the `swh.workers` entry_point. As such, the registry function is expected to return a dict with the `task_modules` field (as for generic worker plugins), plus: - `lister`: the lister class, - `models`: list of SQLAlchemy models used by this lister, - `init` (optionnal): hook (callable) used to initialize the lister's state (typically, create/initialize the database for this lister). If not set, the default implementation creates database tables (after optionally having deleted exisintg ones) according to models declared in the `models` register field. There is no need for explicitely add lister task modules in the main `conftest` module, but any new/extra lister to be tested must be registered (the tested lister module must be properly installed in the test environment). Also refactor a bit the cli tools: - add support for the standard --config-file option at the 'lister' group level, - move the --db-url to the 'lister' group, - drop the --lister option for the `swh lister db-init` cli tool: initializing (especially with --drop-tables) the database for a single lister is unreliable, since all tables are created using a sibgle MetaData (in the same namespace).	2019-09-03 15:02:24 +02:00
David Douard	8785fc1a4e	cgit: fix cgit's task module and tests forgot some `url_prefix` there.	2019-09-03 12:01:55 +02:00
David Douard	3816b4d3bf	cgit: rewrite the CGit lister Simplify the code: - do only inherit from ListerBase - implement HTTP queries directly using requests - get rid of convoluted code Make the origin_url gathered from the git repo's "project" page instead of building it from the 'url_prefix' hack. Now, the lister WILL make substancially more requests, since it will make one request per listed git repo, but the provided origin_url should be pretty reliable now. When several url are provided as clonable URLs, choose the http/https one first, otherwise, choose the first one of the list. Add proper tests for the cgit lister. Also, get rid of the 'time_updated' column in the model.	2019-09-02 12:29:31 +02:00
Archit Agrawal	0bf24469b7	swh.lister.cgit: Remove repo page visit step Remove the need to visit every page and extract the origin url by introducing a parameter url_prefix. The origin url is in format <prefix>/<repo_name> where The prefix is same for all the repos for a particular cgit instance.	2019-06-28 20:02:07 +05:30
Archit Agrawal	7e3c79bb1d	swh.lister.cgit: Add pagination support Some cgit instance have a pagination. Modifiy lister to find all the pages and list all the repos from all the pages.	2019-06-28 19:27:25 +05:30
Archit Agrawal	b972a2a88d	swh.lister.cgit Implemented a lister to list the repos for a given CGit instance. Closes T1659	2019-06-28 19:27:25 +05:30

34 commits