swh-lister

Author	SHA1	Message	Date
Stefano Zacchiroli	7dfd811e16	CRAN lister: make shelling out decoding compatible with Python 3.5	2019-10-28 15:35:21 +01:00
Stefano Zacchiroli	974f80f966	typing: minimal changes to make a no-op mypy run pass	2019-10-28 15:35:21 +01:00
Nicolas Dandrimont	78105940ff	Stop binding tasks to a specific instance of the celery app The celery.shared_task decorator allows late-binding of tasks to any celery app, which is well suited for our "task plugin" architecture.	2019-10-18 18:02:25 +02:00
Antoine R. Dumont (@ardumont)	a64ae9641d	debian.lister: Add integration test which checks scheduled tasks Related T2032	2019-10-15 12:21:24 +02:00
Antoine R. Dumont (@ardumont)	960868badb	pypi.tests: Remove trailing _ in test method name	2019-10-15 12:19:10 +02:00
Antoine R. Dumont (@ardumont)	b73c657ea7	npm.lister: Align docstrings	2019-10-15 11:19:48 +02:00
Antoine R. Dumont (@ardumont)	b4867ccda9	npm.tests: Add an integration test on listing with pagination Related T2032	2019-10-15 10:49:29 +02:00
Antoine R. Dumont (@ardumont)	a8cde12d72	tests: Update pytest_plugin according to latest version change	2019-10-14 18:20:15 +02:00
Antoine R. Dumont (@ardumont)	fcd8521622	tests/conftest: Log the db url used by tests	2019-10-14 14:47:56 +02:00
Antoine R. Dumont (@ardumont)	6b1c3d1fee	lister.core.db_utils: Remove dead code	2019-10-12 03:40:59 +02:00
Antoine R. Dumont (@ardumont)	f92ac83646	bitbucket.lister: Add integration test which checks scheduled tasks Related T2032	2019-10-12 03:39:47 +02:00
Antoine R. Dumont (@ardumont)	0b8b1419e1	github.lister: Add integration test which checks scheduled tasks Related T2032	2019-10-12 03:28:39 +02:00
Antoine R. Dumont (@ardumont)	1889875f67	gitlab.lister: Add integration test which checks scheduled tasks Related T2032	2019-10-12 03:11:31 +02:00
Antoine R. Dumont (@ardumont)	903b644c63	phabricator.lister: Add integration test which checks scheduled tasks Related T2032	2019-10-11 15:30:39 +02:00
Antoine R. Dumont (@ardumont)	f3bf9ae50f	packagist.lister: Add integration test which checks scheduled tasks Related T2032	2019-10-11 14:52:56 +02:00
Antoine R. Dumont (@ardumont)	678b7ea5bd	npm.lister: Add integration test which checks the scheduled tasks Related T2032	2019-10-11 14:07:40 +02:00
Antoine R. Dumont (@ardumont)	599af25ad6	pypi.lister: Add integration test which checks the scheduled tasks Related T2032	2019-10-11 13:24:55 +02:00
Antoine R. Dumont (@ardumont)	8d50e0d941	cran.lister: Fix cran lister and add proper integration test Which checks the cran lister tasks written in the scheduler. Related `d30d574dbe` Related `5ea9d5ed39` Related T2032	2019-10-11 13:19:22 +02:00
Antoine R. Dumont (@ardumont)	ef2c1847e4	gnu.lister: Move tests datadir into its dedicated folder Relatd D2076#inline-13551	2019-10-10 11:50:11 +02:00
Antoine R. Dumont (@ardumont)	0f0b840178	gnu.tests: Checks lister output from scheduler This also adds an swh-listers fixture which allows to retrieve a test ready lister from its name (e.g gnu). Those listers have access to a scheduler fixture so we can check the listing output from the scheduler instance.	2019-10-09 18:23:51 +02:00
Antoine R. Dumont (@ardumont)	394658e53b	cgit.tests: Check the tasks from the scheduler	2019-10-09 17:57:57 +02:00
Antoine R. Dumont (@ardumont)	04ca318680	simple_lister: Extract common behavior in base class	2019-10-09 17:35:12 +02:00
Antoine R. Dumont (@ardumont)	61ce38a0b0	core.models: Fix typo	2019-10-09 17:35:12 +02:00
Antoine R. Dumont (@ardumont)	3ce6c5c6ef	lister/gnu: Modify gnu lister's loading task creation Loader Task signature for the loader gnu is now: - args: - package - package urls - kwargs: tarballs: List of Dict with keys archive (unchanged), 'time' (was 'date'), length (new)	2019-10-04 11:05:58 +02:00
Antoine R. Dumont (@ardumont)	00bb6c7bbf	lister/gnu: Remove unneeded get_file method	2019-10-04 11:02:17 +02:00
Antoine R. Dumont (@ardumont)	322c9dc7e2	lister/gnu: Modify default policy to oneshot	2019-10-04 11:01:49 +02:00
Antoine R. Dumont (@ardumont)	d30d574dbe	cran.lister: Refactor and fix cran lister Prior to this commit, the code was actually duplicated with an old version which would not work. Related D1492#41287	2019-10-02 11:06:59 +02:00
Antoine Lambert	04d8fdf8df	github/lister: Prevent erroneous scheduler tasks disabling Closes T2014	2019-09-19 14:30:30 +02:00
Antoine Lambert	7572228f7c	listers: Ensure run can be called without bounds arguments Closes T2001	2019-09-17 15:09:04 +02:00
Antoine Lambert	4c8d7baf75	phabricator/lister: Prevent erroneous scheduler tasks disabling Previously, the Phabricator lister was disabling some loading tasks while it was not supposed to. More precisely, due to an invalid index provided to a database query, the latest created scheduler task was disabled each time a new page of results was provided to the lister by the Phabricator API. Moreover, database queries were not filtered according to the Phabricator instance resulting in possible disabling of scheduler tasks from other instances. Closes T2000	2019-09-16 20:05:48 +02:00
Antoine Lambert	e83902c2a3	phabricator/lister: Fix get_next_target_from_response return type Without that fix, errors are raised when one wants to list Phabricator repositories in a specific index range. The issue is due to a comparison between a string and an integer. So convert next extracted repository index to integer to match the corresponding model type. Closes T1997	2019-09-16 13:36:46 +02:00
Antoine Lambert	1ebe762ea6	phabricator/lister: Do not override max_index when bootstrapping Turns out all newly listed repositories were filtered out because of that. Consequently, no entries in the listers database and no scheduler loading tasks were created when listing a Phabricator instance. Closes T1999	2019-09-16 13:34:22 +02:00
Antoine Lambert	7c8f4dc9a8	packagist/lister: Fix typos in docstring	2019-09-12 20:46:42 +02:00
David Douard	780c0ef999	lister/base: remove the reference to the storage from ListerBase it is not used anymore.	2019-09-05 10:39:50 +02:00
David Douard	b810876ef8	tasks: normalize the url argument name of most lister Since all the listing tasks accepts an url as first argument (whatever the argument name is), it makes sense to use a simple common argument name for this. I've chosen 'url' instead of api_baseurl/forge_url/url. Also kill now useless `new_lister()` functions.	2019-09-04 15:38:01 +02:00
David Douard	631b8e7668	models: use the same declarative base class for all models This is needed to fix the db-init implementation so the debian loader (which does use the SQLBase from swh.storage) have its models declared in the MetaData used by the initialize() function.	2019-09-04 15:37:40 +02:00
David Douard	bd11830328	cgit: reduce the batch size to 10 and add a bit of logging Since the CGit lister now perform an HTTP query for each git repos listed in the main index, it is significantly slower, so reducing the time between database commits make sense, and won't overload the database. With a bit of logging, it makes it easier to follow/debug the progress of a listing.	2019-09-04 15:37:40 +02:00
David Douard	8d9deeb8f8	plugins: add support for scheduler's task-type declaration Add a new register-task-types cli that will create missing task-type entries in the scheduler according to: - only create missing task-types (do not update them), but check that the backend_name field is consistent, - each SWHTask-based task declared in a module listed in the 'task_modules' plugin registry field will be checked and added if needed; tasks which name start wit an underscore will not be added, - added task-type will have: - the 'type' field is derived from the task's function name (with underscores replaced with dashes), - the description field is the first line of that function's docstring, - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with a simple pattern matching to have decent default values for full/incremental tasks), - these default values can be overloaded via the 'task_type' plugin registry entry. For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`). Comes with some tests.	2019-09-04 15:36:08 +02:00
David Douard	e3c0ea9d90	implement listers as plugins Listers are declared as plugins via the `swh.workers` entry_point. As such, the registry function is expected to return a dict with the `task_modules` field (as for generic worker plugins), plus: - `lister`: the lister class, - `models`: list of SQLAlchemy models used by this lister, - `init` (optionnal): hook (callable) used to initialize the lister's state (typically, create/initialize the database for this lister). If not set, the default implementation creates database tables (after optionally having deleted exisintg ones) according to models declared in the `models` register field. There is no need for explicitely add lister task modules in the main `conftest` module, but any new/extra lister to be tested must be registered (the tested lister module must be properly installed in the test environment). Also refactor a bit the cli tools: - add support for the standard --config-file option at the 'lister' group level, - move the --db-url to the 'lister' group, - drop the --lister option for the `swh lister db-init` cli tool: initializing (especially with --drop-tables) the database for a single lister is unreliable, since all tables are created using a sibgle MetaData (in the same namespace).	2019-09-03 15:02:24 +02:00
David Douard	c67a926f26	npm: make NpmVisitModel use the main declarative base class from core.models This is needed by the (refactored) db init mechanism, since this later uses the main declarative base class (thus the main MetaData instance) to gather tables to be created/dropped.	2019-09-03 15:02:24 +02:00
David Douard	342964eda7	phabricator: fix the FullPhabricatorLister task forgot the forge_url -> api_baseurl renaming in there.	2019-09-03 12:01:55 +02:00
David Douard	8785fc1a4e	cgit: fix cgit's task module and tests forgot some `url_prefix` there.	2019-09-03 12:01:55 +02:00
David Douard	87cec2f5c3	phabricator: refactor PhabricatorLister's constructor - use the 'standard' api_baseurl as init argument, - make it optional, with default to forge.softwareheritage.org, - use origin_url as id.	2019-09-02 12:29:38 +02:00
David Douard	befe9a6d57	gitlab: make GitLabLister's api_baseurl init argument optional and simplify a bit the code of the constructor.	2019-09-02 12:29:38 +02:00
David Douard	b87cd5d309	github: make GitHubLister's api_baseurl init argument optional	2019-09-02 12:29:38 +02:00
David Douard	8950b0b32d	bitbucket: make BitBucketLister's api_baseurl init argument optional	2019-09-02 12:29:38 +02:00
David Douard	22f2f2c43c	core: make it possible to specify the api_baseurl init argument in override_config This is required to be able to make lister classes instanciation easier and more reliable, especially in the context of cli tools like 'swh lister run', for which we want to be able to specify any lister init argument as extra parameter of the command.	2019-09-02 12:29:38 +02:00
David Douard	3816b4d3bf	cgit: rewrite the CGit lister Simplify the code: - do only inherit from ListerBase - implement HTTP queries directly using requests - get rid of convoluted code Make the origin_url gathered from the git repo's "project" page instead of building it from the 'url_prefix' hack. Now, the lister WILL make substancially more requests, since it will make one request per listed git repo, but the provided origin_url should be pretty reliable now. When several url are provided as clonable URLs, choose the http/https one first, otherwise, choose the first one of the list. Add proper tests for the cgit lister. Also, get rid of the 'time_updated' column in the model.	2019-09-02 12:29:31 +02:00
David Douard	e0ce68377d	bitbucket: simplify a bit BitBucketLister's constructor get rid of the "smart" flush_packet_db computation.	2019-08-30 17:56:19 +02:00
David Douard	d807d15f65	phabricator: randomly select the API token in the provided list instead of picking the first one, so this behavior is consistent with ListerHttpTransport's one.	2019-08-30 17:56:19 +02:00

1 2 3 4 5 ...

286 commits