Commit graph

253 commits

Author SHA1 Message Date
David Douard
780c0ef999 lister/base: remove the reference to the storage from ListerBase
it is not used anymore.
2019-09-05 10:39:50 +02:00
David Douard
b810876ef8 tasks: normalize the url argument name of most lister
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.

Also kill now useless `new_lister()` functions.
2019-09-04 15:38:01 +02:00
David Douard
631b8e7668 models: use the same declarative base class for all models
This is needed to fix the db-init implementation so the debian loader (which
does use the SQLBase from swh.storage) have its models declared in the
MetaData used by the initialize() function.
2019-09-04 15:37:40 +02:00
David Douard
bd11830328 cgit: reduce the batch size to 10 and add a bit of logging
Since the CGit lister now perform an HTTP query for each git repos listed in
the main index, it is significantly slower, so reducing the time between
database commits make sense, and won't overload the database.

With a bit of logging, it makes it easier to follow/debug the progress of
a listing.
2019-09-04 15:37:40 +02:00
David Douard
8d9deeb8f8 plugins: add support for scheduler's task-type declaration
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:

- only create missing task-types (do not update them), but check that the
  backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
  plugin registry field will be checked and added if needed; tasks which name
  start wit an underscore will not be added,
- added task-type will have:
  - the 'type' field is derived from the task's function name (with underscores
    replaced with dashes),
  - the description field is the first line of that function's docstring,
  - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
    a simple pattern matching to have decent default values for full/incremental
    tasks),
  - these default values can be overloaded via the 'task_type' plugin registry
    entry.

For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).

Comes with some tests.
2019-09-04 15:36:08 +02:00
David Douard
e3c0ea9d90 implement listers as plugins
Listers are declared as plugins via the `swh.workers` entry_point.

As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:

- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
  (typically, create/initialize the database for this lister).
  If not set, the default implementation creates database tables (after
  optionally having deleted exisintg ones) according to models declared in
  the `models` register field.

There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).

Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
  level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
  initializing (especially with --drop-tables) the database for a single
  lister is unreliable, since all tables are created using a sibgle MetaData
  (in the same namespace).
2019-09-03 15:02:24 +02:00
David Douard
c67a926f26 npm: make NpmVisitModel use the main declarative base class from core.models
This is needed by the (refactored) db init mechanism, since this later uses
the main declarative base class (thus the main MetaData instance) to gather
tables to be created/dropped.
2019-09-03 15:02:24 +02:00
David Douard
342964eda7 phabricator: fix the FullPhabricatorLister task
forgot the forge_url -> api_baseurl renaming in there.
2019-09-03 12:01:55 +02:00
David Douard
8785fc1a4e cgit: fix cgit's task module and tests
forgot some `url_prefix` there.
2019-09-03 12:01:55 +02:00
David Douard
87cec2f5c3 phabricator: refactor PhabricatorLister's constructor
- use the 'standard' api_baseurl as init argument,
- make it optional, with default to forge.softwareheritage.org,
- use origin_url as id.
2019-09-02 12:29:38 +02:00
David Douard
befe9a6d57 gitlab: make GitLabLister's api_baseurl init argument optional
and simplify a bit the code of the constructor.
2019-09-02 12:29:38 +02:00
David Douard
b87cd5d309 github: make GitHubLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00
David Douard
8950b0b32d bitbucket: make BitBucketLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00
David Douard
22f2f2c43c core: make it possible to specify the api_baseurl init argument in override_config
This is required to be able to make lister classes instanciation easier and more
reliable, especially in the context of cli tools like 'swh lister run', for which
we want to be able to specify any lister init argument as extra parameter of the
command.
2019-09-02 12:29:38 +02:00
David Douard
3816b4d3bf cgit: rewrite the CGit lister
Simplify the code:
- do only inherit from ListerBase
- implement HTTP queries directly using requests
- get rid of convoluted code

Make the origin_url gathered from the git repo's "project" page instead of
building it from the 'url_prefix' hack. Now, the lister WILL make substancially
more requests, since it will make one request per listed git repo, but
the provided origin_url should be pretty reliable now.

When several url are provided as clonable URLs, choose the http/https one first,
otherwise, choose the first one of the list.

Add proper tests for the cgit lister.

Also, get rid of the 'time_updated' column in the model.
2019-09-02 12:29:31 +02:00
David Douard
e0ce68377d bitbucket: simplify a bit BitBucketLister's constructor
get rid of the "smart" flush_packet_db computation.
2019-08-30 17:56:19 +02:00
David Douard
d807d15f65 phabricator: randomly select the API token in the provided list
instead of picking the first one, so this behavior is consistent with
ListerHttpTransport's one.
2019-08-30 17:56:19 +02:00
David Douard
814779404c phabricator: small refactoring/simplification of the request_params method
and get rid of the unneeded _build_query_params method.
2019-08-30 17:56:19 +02:00
David Douard
83d138759c phabricator: kill PhabricatorLister's api_token argument
stick to the existing credentials mechanism provided by ListerHttpTransport.
2019-08-30 17:56:19 +02:00
David Douard
6f56d2c8d7 core: move credentials' docstring from request_params to request_instance_credentials
and fix empty values returned by this later (empty list instead of ampty dict).
2019-08-30 17:56:19 +02:00
Antoine R. Dumont (@ardumont)
4b2ab0488a
cli: Unify new_lister method name to get_lister 2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)
dee9fe93bf
cli: Bootstrap tests on cli 2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)
e0664c10cd
lister.cli: Allow to list forges with policy and priority
Example use case:

swh lister run --lister gitlab \
               --priority high \
               --policy oneshot \
               --db-url postgresql://postgres@localhost:5432/swh-listers \
               api_baseurl=https://gitlab.ow2.org/api/v4/

Related T1919
2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)
87d2a16df0
listers: Allow to override policy and priority for scheduled tasks
Prior to this commit, the policy and priority were hard-coded.
The default values are now the old hard-coded values.

This will allow to develop a cli to trigger forges listing with oneshot policy
and some priority tasks. Thus ingesting those faster and without manual
interventation as we currently do.
2019-08-28 11:57:10 +02:00
Archit Agrawal
5727f15cf3 swh.lister.packagist
Implement a packagist lister to list the
names and metadata url of all the
packages.

Closes 1776
2019-07-19 19:59:30 +05:30
Archit Agrawal
08ade29e6d swh.lister.pypi: Add tests
Add tests for pypi lister
Closes T1890
2019-07-18 17:13:13 +05:30
Archit Agrawal
f424f07c7e swh.lister.core: Add test for simple lister
There were previously no tests for the listers
which are using the class SimpleLister(like pypi)
Refractored test_lister.py of lister core to
accomodate tests for SimpleLister keeping the tests
undisturbed for other lister.
2019-07-18 17:13:13 +05:30
Stefano Zacchiroli
bb2dc77788 bitbucket lister: fix typo in docstring 2019-07-04 14:40:02 +02:00
Archit Agrawal
0bf24469b7 swh.lister.cgit: Remove repo page visit step
Remove the need to visit every page and extract the
origin url by introducing a parameter url_prefix.
The origin url is in format <prefix>/<repo_name> where
The prefix is same for all the repos for a particular
cgit instance.
2019-06-28 20:02:07 +05:30
Archit Agrawal
7e3c79bb1d swh.lister.cgit: Add pagination support
Some cgit instance have a pagination. Modifiy
lister to find all the pages and list all the repos
from all the pages.
2019-06-28 19:27:25 +05:30
Archit Agrawal
b972a2a88d swh.lister.cgit
Implemented a lister to list the repos for a given CGit instance.

Closes T1659
2019-06-28 19:27:25 +05:30
Antoine Lambert
d85bcdac5b simple_lister: Split models into smaller chunks to avoid oversized db transactions
Related T1659
2019-06-28 15:44:47 +02:00
Archit Agrawal
5ea9d5ed39 swh.lister.cran: Add description in task_dict
Add description in task_dict method because
the only metadata that can be found for a
package at CRAN is  its decsription.  That can
only br achived from the build in API in R,
which ister is already using. Hence instead of
getting metadata in loader, it is passed
by lister.
2019-06-27 14:57:51 +05:30
Valentin Lorentz
52b1de87c5 Finish dropping the 'description' column.
I missed some in aef7d5952e.
2019-06-26 14:46:27 +02:00
Antoine R. Dumont (@ardumont)
e54531510c
indexing_lister: Add docstrings to flush_packet_db & default_min_bound
Related D1635
2019-06-26 11:27:41 +02:00
Antoine R. Dumont (@ardumont)
3d473c307c
lister: Type correctly the 'indexable' column
instead of converting that column as a string

As a side effect, bitbucket wise, we provided improperly the after query
parameter as a date not url encoded. This resulted in improper api response from
bitbucket's (we received from time to time the same next index as the current
one).

Related T1826
2019-06-26 10:58:54 +02:00
Antoine R. Dumont (@ardumont)
b99617f976
relister: Fix consistently the behavior for the first time relisting
If nothing has been done prior to a full relisting, there is actually nothing
to list. So the relister in question does nothing.

In that context, the IndexingLister class's `db_partition_indices` method now
returns an empty list instead of raising a ValueError when there is nothing to
list.

Related T1826
Related e129e48
2019-06-25 14:48:17 +02:00
Antoine R. Dumont (@ardumont)
6662ae8db5
indexing_lister: Allow to define flush packet size
Prior to this commit, indexing lister instances were flushing every packet of
20. This can now be defined per sub classes.
2019-06-25 14:48:16 +02:00
Antoine R. Dumont (@ardumont)
5ec3067b0d
Clean up code
- Remove unneeded return instructions
- Clarify tests code regarding request_index computations
2019-06-25 14:48:13 +02:00
Antoine R. Dumont (@ardumont)
45428c25df
bitbucket: Unify logging instructions 2019-06-25 14:09:59 +02:00
Antoine R. Dumont (@ardumont)
9aa8a6f7ae
bitbucket: Allow to specify the number of repos per api request
This is independent but still, it somehow fixes the issue occurring on T1826.

Related T1826
2019-06-21 17:50:23 +02:00
Antoine R. Dumont (@ardumont)
e129e48c31
bitbucket: Fix full lister with fallback [start, end] if not provided
Related T1826
2019-06-21 15:46:51 +02:00
Antoine R. Dumont (@ardumont)
b3463ecddc
Drop SWH prefix in classes everywhere
It's redundant with the swh modules in itself.
2019-06-20 19:08:46 +02:00
Archit Agrawal
f76b96b825 swh.lister.gnu: Change origin type to tar
Change origin type from 'gnu' to 'tar'
2019-06-19 17:21:02 +05:30
Valentin Lorentz
aef7d5952e Remove columns 'description' and 'origin_id'.
They are useless.
2019-06-19 10:29:15 +02:00
Antoine R. Dumont (@ardumont)
e13912f711
phabricator_tests: Add missing headers 2019-06-18 14:41:50 +02:00
Antoine R. Dumont (@ardumont)
df2754e5a6
phabricator.tasks: Remove unused code
Related T1824
Related P438
2019-06-18 14:41:20 +02:00
Antoine R. Dumont (@ardumont)
af681ac128
phabricator: model: Reference the forge's instance name in model
As phabricator is an "instance" lister (there exists multiple instances of
phabricator in the wild), we need to reference that information.

In effect, this aligns phabricator lister with for example the gitlab one.

Related T1801
Related P434
2019-06-18 07:19:14 +02:00
Antoine R. Dumont (@ardumont)
fc92c79b7e
models: Unify tablenames using singular as main archive's convention
Related P434
2019-06-18 07:18:34 +02:00
Antoine R. Dumont (@ardumont)
6d11705908
phabricator.lister: Use credentials setup from configuration file
Prior to this commit, this expected the api.token to be provided at task
initialization. That behavior has been kept for cli purposes. It's no good for
production purposes though (as this leaks the credentials in the scheduler db).

So now, the credentials is fetched from the lister's configuration file as the
other listers do.

Another change is the authentication mechanism which is slighly different. It's
not using a basic `auth` mechanism. It's expecting an `api.token` query
parameter so the `request_params` is overriden to provide that.

Related T1809
2019-06-17 16:19:23 +02:00