swh-lister

Author	SHA1	Message	Date
Antoine Lambert	4c8d7baf75	phabricator/lister: Prevent erroneous scheduler tasks disabling Previously, the Phabricator lister was disabling some loading tasks while it was not supposed to. More precisely, due to an invalid index provided to a database query, the latest created scheduler task was disabled each time a new page of results was provided to the lister by the Phabricator API. Moreover, database queries were not filtered according to the Phabricator instance resulting in possible disabling of scheduler tasks from other instances. Closes T2000	2019-09-16 20:05:48 +02:00
Antoine Lambert	e83902c2a3	phabricator/lister: Fix get_next_target_from_response return type Without that fix, errors are raised when one wants to list Phabricator repositories in a specific index range. The issue is due to a comparison between a string and an integer. So convert next extracted repository index to integer to match the corresponding model type. Closes T1997	2019-09-16 13:36:46 +02:00
Antoine Lambert	1ebe762ea6	phabricator/lister: Do not override max_index when bootstrapping Turns out all newly listed repositories were filtered out because of that. Consequently, no entries in the listers database and no scheduler loading tasks were created when listing a Phabricator instance. Closes T1999	2019-09-16 13:34:22 +02:00
Antoine Lambert	7c8f4dc9a8	packagist/lister: Fix typos in docstring	2019-09-12 20:46:42 +02:00
Antoine R. Dumont (@ardumont)	7377439c5e	MANIFEST.in: Include cgit tests data folder	2019-09-09 12:13:45 +02:00
Antoine R. Dumont (@ardumont)	481b30c540	docs: Fix toc Related T1984	2019-09-06 12:31:06 +02:00
David Douard	780c0ef999	lister/base: remove the reference to the storage from ListerBase it is not used anymore.	2019-09-05 10:39:50 +02:00
David Douard	b810876ef8	tasks: normalize the url argument name of most lister Since all the listing tasks accepts an url as first argument (whatever the argument name is), it makes sense to use a simple common argument name for this. I've chosen 'url' instead of api_baseurl/forge_url/url. Also kill now useless `new_lister()` functions.	2019-09-04 15:38:01 +02:00
David Douard	631b8e7668	models: use the same declarative base class for all models This is needed to fix the db-init implementation so the debian loader (which does use the SQLBase from swh.storage) have its models declared in the MetaData used by the initialize() function.	2019-09-04 15:37:40 +02:00
David Douard	bd11830328	cgit: reduce the batch size to 10 and add a bit of logging Since the CGit lister now perform an HTTP query for each git repos listed in the main index, it is significantly slower, so reducing the time between database commits make sense, and won't overload the database. With a bit of logging, it makes it easier to follow/debug the progress of a listing.	2019-09-04 15:37:40 +02:00
David Douard	8d9deeb8f8	plugins: add support for scheduler's task-type declaration Add a new register-task-types cli that will create missing task-type entries in the scheduler according to: - only create missing task-types (do not update them), but check that the backend_name field is consistent, - each SWHTask-based task declared in a module listed in the 'task_modules' plugin registry field will be checked and added if needed; tasks which name start wit an underscore will not be added, - added task-type will have: - the 'type' field is derived from the task's function name (with underscores replaced with dashes), - the description field is the first line of that function's docstring, - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with a simple pattern matching to have decent default values for full/incremental tasks), - these default values can be overloaded via the 'task_type' plugin registry entry. For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`). Comes with some tests.	2019-09-04 15:36:08 +02:00
David Douard	e3c0ea9d90	implement listers as plugins Listers are declared as plugins via the `swh.workers` entry_point. As such, the registry function is expected to return a dict with the `task_modules` field (as for generic worker plugins), plus: - `lister`: the lister class, - `models`: list of SQLAlchemy models used by this lister, - `init` (optionnal): hook (callable) used to initialize the lister's state (typically, create/initialize the database for this lister). If not set, the default implementation creates database tables (after optionally having deleted exisintg ones) according to models declared in the `models` register field. There is no need for explicitely add lister task modules in the main `conftest` module, but any new/extra lister to be tested must be registered (the tested lister module must be properly installed in the test environment). Also refactor a bit the cli tools: - add support for the standard --config-file option at the 'lister' group level, - move the --db-url to the 'lister' group, - drop the --lister option for the `swh lister db-init` cli tool: initializing (especially with --drop-tables) the database for a single lister is unreliable, since all tables are created using a sibgle MetaData (in the same namespace).	2019-09-03 15:02:24 +02:00
David Douard	c67a926f26	npm: make NpmVisitModel use the main declarative base class from core.models This is needed by the (refactored) db init mechanism, since this later uses the main declarative base class (thus the main MetaData instance) to gather tables to be created/dropped.	2019-09-03 15:02:24 +02:00
David Douard	342964eda7	phabricator: fix the FullPhabricatorLister task forgot the forge_url -> api_baseurl renaming in there.	2019-09-03 12:01:55 +02:00
David Douard	8785fc1a4e	cgit: fix cgit's task module and tests forgot some `url_prefix` there.	2019-09-03 12:01:55 +02:00
David Douard	87cec2f5c3	phabricator: refactor PhabricatorLister's constructor - use the 'standard' api_baseurl as init argument, - make it optional, with default to forge.softwareheritage.org, - use origin_url as id.	2019-09-02 12:29:38 +02:00
David Douard	befe9a6d57	gitlab: make GitLabLister's api_baseurl init argument optional and simplify a bit the code of the constructor.	2019-09-02 12:29:38 +02:00
David Douard	b87cd5d309	github: make GitHubLister's api_baseurl init argument optional	2019-09-02 12:29:38 +02:00
David Douard	8950b0b32d	bitbucket: make BitBucketLister's api_baseurl init argument optional	2019-09-02 12:29:38 +02:00
David Douard	22f2f2c43c	core: make it possible to specify the api_baseurl init argument in override_config This is required to be able to make lister classes instanciation easier and more reliable, especially in the context of cli tools like 'swh lister run', for which we want to be able to specify any lister init argument as extra parameter of the command.	2019-09-02 12:29:38 +02:00
David Douard	3816b4d3bf	cgit: rewrite the CGit lister Simplify the code: - do only inherit from ListerBase - implement HTTP queries directly using requests - get rid of convoluted code Make the origin_url gathered from the git repo's "project" page instead of building it from the 'url_prefix' hack. Now, the lister WILL make substancially more requests, since it will make one request per listed git repo, but the provided origin_url should be pretty reliable now. When several url are provided as clonable URLs, choose the http/https one first, otherwise, choose the first one of the list. Add proper tests for the cgit lister. Also, get rid of the 'time_updated' column in the model.	2019-09-02 12:29:31 +02:00
David Douard	e0ce68377d	bitbucket: simplify a bit BitBucketLister's constructor get rid of the "smart" flush_packet_db computation.	2019-08-30 17:56:19 +02:00
David Douard	d807d15f65	phabricator: randomly select the API token in the provided list instead of picking the first one, so this behavior is consistent with ListerHttpTransport's one.	2019-08-30 17:56:19 +02:00
David Douard	814779404c	phabricator: small refactoring/simplification of the request_params method and get rid of the unneeded _build_query_params method.	2019-08-30 17:56:19 +02:00
David Douard	83d138759c	phabricator: kill PhabricatorLister's api_token argument stick to the existing credentials mechanism provided by ListerHttpTransport.	2019-08-30 17:56:19 +02:00
David Douard	6f56d2c8d7	core: move credentials' docstring from request_params to request_instance_credentials and fix empty values returned by this later (empty list instead of ampty dict).	2019-08-30 17:56:19 +02:00
Antoine R. Dumont (@ardumont)	09f3605a7e	docs: Remove spurious blank spaces	2019-08-29 09:57:59 +02:00
Antoine R. Dumont (@ardumont)	4b2ab0488a	cli: Unify new_lister method name to get_lister	2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)	dee9fe93bf	cli: Bootstrap tests on cli	2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)	e0664c10cd	lister.cli: Allow to list forges with policy and priority Example use case: swh lister run --lister gitlab \ --priority high \ --policy oneshot \ --db-url postgresql://postgres@localhost:5432/swh-listers \ api_baseurl=https://gitlab.ow2.org/api/v4/ Related T1919	2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)	87d2a16df0	listers: Allow to override policy and priority for scheduled tasks Prior to this commit, the policy and priority were hard-coded. The default values are now the old hard-coded values. This will allow to develop a cli to trigger forges listing with oneshot policy and some priority tasks. Thus ingesting those faster and without manual interventation as we currently do.	2019-08-28 11:57:10 +02:00
Archit Agrawal	5727f15cf3	swh.lister.packagist Implement a packagist lister to list the names and metadata url of all the packages. Closes 1776	2019-07-19 19:59:30 +05:30
Archit Agrawal	08ade29e6d	swh.lister.pypi: Add tests Add tests for pypi lister Closes T1890	2019-07-18 17:13:13 +05:30
Archit Agrawal	f424f07c7e	swh.lister.core: Add test for simple lister There were previously no tests for the listers which are using the class SimpleLister(like pypi) Refractored test_lister.py of lister core to accomodate tests for SimpleLister keeping the tests undisturbed for other lister.	2019-07-18 17:13:13 +05:30
Stefano Zacchiroli	9c97291abd	add code of conduct document	2019-07-11 16:29:36 +02:00
Stefano Zacchiroli	60a6f12bfe	CONTRIBUTORS: add Sushant Sushant	2019-07-04 14:41:24 +02:00
Stefano Zacchiroli	bb2dc77788	bitbucket lister: fix typo in docstring	2019-07-04 14:40:02 +02:00
Stefano Zacchiroli	226dfe945f	CONTRIBUTORS: add Avi Kelman	2019-07-04 14:39:37 +02:00
Antoine R. Dumont (@ardumont)	6bd5cca151	MANIFEST.in: Include *.txt samples for tests to run during packaging	2019-06-28 18:21:30 +02:00
Antoine R. Dumont (@ardumont)	897a19ad84	MANIFEST.in: Include *.html samples for tests to run	2019-06-28 18:19:21 +02:00
Antoine R. Dumont (@ardumont)	c507948da8	bin: Drop dead code	2019-06-28 18:17:15 +02:00
Antoine R. Dumont (@ardumont)	32c5cf22c2	Add Archit Agrawal as contributors	2019-06-28 17:44:02 +02:00
Archit Agrawal	0bf24469b7	swh.lister.cgit: Remove repo page visit step Remove the need to visit every page and extract the origin url by introducing a parameter url_prefix. The origin url is in format <prefix>/<repo_name> where The prefix is same for all the repos for a particular cgit instance.	2019-06-28 20:02:07 +05:30
Archit Agrawal	7e3c79bb1d	swh.lister.cgit: Add pagination support Some cgit instance have a pagination. Modifiy lister to find all the pages and list all the repos from all the pages.	2019-06-28 19:27:25 +05:30
Archit Agrawal	b972a2a88d	swh.lister.cgit Implemented a lister to list the repos for a given CGit instance. Closes T1659	2019-06-28 19:27:25 +05:30
Antoine Lambert	d85bcdac5b	simple_lister: Split models into smaller chunks to avoid oversized db transactions Related T1659	2019-06-28 15:44:47 +02:00
Archit Agrawal	5ea9d5ed39	swh.lister.cran: Add description in task_dict Add description in task_dict method because the only metadata that can be found for a package at CRAN is its decsription. That can only br achived from the build in API in R, which ister is already using. Hence instead of getting metadata in loader, it is passed by lister.	2019-06-27 14:57:51 +05:30
Valentin Lorentz	52b1de87c5	Finish dropping the 'description' column. I missed some in `aef7d5952e`.	2019-06-26 14:46:27 +02:00
Antoine R. Dumont (@ardumont)	e54531510c	indexing_lister: Add docstrings to flush_packet_db & default_min_bound Related D1635	2019-06-26 11:27:41 +02:00
Antoine R. Dumont (@ardumont)	3d473c307c	lister: Type correctly the 'indexable' column instead of converting that column as a string As a side effect, bitbucket wise, we provided improperly the after query parameter as a date not url encoded. This resulted in improper api response from bitbucket's (we received from time to time the same next index as the current one). Related T1826	2019-06-26 10:58:54 +02:00

... 3 4 5 6 7 ...

580 commits