Commit graph

285 commits

Author SHA1 Message Date
Stefano Zacchiroli
7dfd811e16 CRAN lister: make shelling out decoding compatible with Python 3.5 2019-10-28 15:35:21 +01:00
Stefano Zacchiroli
974f80f966 typing: minimal changes to make a no-op mypy run pass 2019-10-28 15:35:21 +01:00
Nicolas Dandrimont
78105940ff Stop binding tasks to a specific instance of the celery app
The celery.shared_task decorator allows late-binding of tasks to any celery app,
which is well suited for our "task plugin" architecture.
2019-10-18 18:02:25 +02:00
Antoine R. Dumont (@ardumont)
a64ae9641d
debian.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-15 12:21:24 +02:00
Antoine R. Dumont (@ardumont)
960868badb
pypi.tests: Remove trailing _ in test method name 2019-10-15 12:19:10 +02:00
Antoine R. Dumont (@ardumont)
b73c657ea7
npm.lister: Align docstrings 2019-10-15 11:19:48 +02:00
Antoine R. Dumont (@ardumont)
b4867ccda9
npm.tests: Add an integration test on listing with pagination
Related T2032
2019-10-15 10:49:29 +02:00
Antoine R. Dumont (@ardumont)
a8cde12d72
tests: Update pytest_plugin according to latest version change 2019-10-14 18:20:15 +02:00
Antoine R. Dumont (@ardumont)
fcd8521622
tests/conftest: Log the db url used by tests 2019-10-14 14:47:56 +02:00
Antoine R. Dumont (@ardumont)
6b1c3d1fee
lister.core.db_utils: Remove dead code 2019-10-12 03:40:59 +02:00
Antoine R. Dumont (@ardumont)
f92ac83646
bitbucket.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-12 03:39:47 +02:00
Antoine R. Dumont (@ardumont)
0b8b1419e1
github.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-12 03:28:39 +02:00
Antoine R. Dumont (@ardumont)
1889875f67
gitlab.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-12 03:11:31 +02:00
Antoine R. Dumont (@ardumont)
903b644c63
phabricator.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-11 15:30:39 +02:00
Antoine R. Dumont (@ardumont)
f3bf9ae50f
packagist.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-11 14:52:56 +02:00
Antoine R. Dumont (@ardumont)
678b7ea5bd
npm.lister: Add integration test which checks the scheduled tasks
Related T2032
2019-10-11 14:07:40 +02:00
Antoine R. Dumont (@ardumont)
599af25ad6
pypi.lister: Add integration test which checks the scheduled tasks
Related T2032
2019-10-11 13:24:55 +02:00
Antoine R. Dumont (@ardumont)
8d50e0d941
cran.lister: Fix cran lister and add proper integration test
Which checks the cran lister tasks written in the scheduler.

Related d30d574dbe
Related 5ea9d5ed39

Related T2032
2019-10-11 13:19:22 +02:00
Antoine R. Dumont (@ardumont)
ef2c1847e4
gnu.lister: Move tests datadir into its dedicated folder
Relatd D2076#inline-13551
2019-10-10 11:50:11 +02:00
Antoine R. Dumont (@ardumont)
0f0b840178
gnu.tests: Checks lister output from scheduler
This also adds an swh-listers fixture which allows to retrieve a test ready
lister from its name (e.g gnu). Those listers have access to a scheduler
fixture so we can check the listing output from the scheduler instance.
2019-10-09 18:23:51 +02:00
Antoine R. Dumont (@ardumont)
394658e53b
cgit.tests: Check the tasks from the scheduler 2019-10-09 17:57:57 +02:00
Antoine R. Dumont (@ardumont)
04ca318680
simple_lister: Extract common behavior in base class 2019-10-09 17:35:12 +02:00
Antoine R. Dumont (@ardumont)
61ce38a0b0
core.models: Fix typo 2019-10-09 17:35:12 +02:00
Antoine R. Dumont (@ardumont)
3ce6c5c6ef
lister/gnu: Modify gnu lister's loading task creation
Loader Task signature for the loader gnu is now:
- args:
  - package
  - package urls

- kwargs:
  tarballs: List of Dict with keys archive (unchanged), 'time' (was 'date'),
      length (new)
2019-10-04 11:05:58 +02:00
Antoine R. Dumont (@ardumont)
00bb6c7bbf
lister/gnu: Remove unneeded get_file method 2019-10-04 11:02:17 +02:00
Antoine R. Dumont (@ardumont)
322c9dc7e2
lister/gnu: Modify default policy to oneshot 2019-10-04 11:01:49 +02:00
Antoine R. Dumont (@ardumont)
d30d574dbe
cran.lister: Refactor and fix cran lister
Prior to this commit, the code was actually duplicated with an old version
which would not work.

Related D1492#41287
2019-10-02 11:06:59 +02:00
Antoine Lambert
04d8fdf8df github/lister: Prevent erroneous scheduler tasks disabling
Closes T2014
2019-09-19 14:30:30 +02:00
Antoine Lambert
7572228f7c listers: Ensure run can be called without bounds arguments
Closes T2001
2019-09-17 15:09:04 +02:00
Antoine Lambert
4c8d7baf75 phabricator/lister: Prevent erroneous scheduler tasks disabling
Previously, the Phabricator lister was disabling some loading tasks while it was not
supposed to. More precisely, due to an invalid index provided to a database query,
the latest created scheduler task was disabled each time a new page of results was
provided to the lister by the Phabricator API. Moreover, database queries were not
filtered according to the Phabricator instance resulting in possible disabling of
scheduler tasks from other instances.

Closes T2000
2019-09-16 20:05:48 +02:00
Antoine Lambert
e83902c2a3 phabricator/lister: Fix get_next_target_from_response return type
Without that fix, errors are raised when one wants to list Phabricator repositories
in a specific index range. The issue is due to a comparison between a string and
an integer. So convert next extracted repository index to integer to match the
corresponding model type.

Closes T1997
2019-09-16 13:36:46 +02:00
Antoine Lambert
1ebe762ea6 phabricator/lister: Do not override max_index when bootstrapping
Turns out all newly listed repositories were filtered out because of that.
Consequently, no entries in the listers database and no scheduler loading
tasks were created when listing a Phabricator instance.

Closes T1999
2019-09-16 13:34:22 +02:00
Antoine Lambert
7c8f4dc9a8 packagist/lister: Fix typos in docstring 2019-09-12 20:46:42 +02:00
David Douard
780c0ef999 lister/base: remove the reference to the storage from ListerBase
it is not used anymore.
2019-09-05 10:39:50 +02:00
David Douard
b810876ef8 tasks: normalize the url argument name of most lister
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.

Also kill now useless `new_lister()` functions.
2019-09-04 15:38:01 +02:00
David Douard
631b8e7668 models: use the same declarative base class for all models
This is needed to fix the db-init implementation so the debian loader (which
does use the SQLBase from swh.storage) have its models declared in the
MetaData used by the initialize() function.
2019-09-04 15:37:40 +02:00
David Douard
bd11830328 cgit: reduce the batch size to 10 and add a bit of logging
Since the CGit lister now perform an HTTP query for each git repos listed in
the main index, it is significantly slower, so reducing the time between
database commits make sense, and won't overload the database.

With a bit of logging, it makes it easier to follow/debug the progress of
a listing.
2019-09-04 15:37:40 +02:00
David Douard
8d9deeb8f8 plugins: add support for scheduler's task-type declaration
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:

- only create missing task-types (do not update them), but check that the
  backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
  plugin registry field will be checked and added if needed; tasks which name
  start wit an underscore will not be added,
- added task-type will have:
  - the 'type' field is derived from the task's function name (with underscores
    replaced with dashes),
  - the description field is the first line of that function's docstring,
  - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
    a simple pattern matching to have decent default values for full/incremental
    tasks),
  - these default values can be overloaded via the 'task_type' plugin registry
    entry.

For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).

Comes with some tests.
2019-09-04 15:36:08 +02:00
David Douard
e3c0ea9d90 implement listers as plugins
Listers are declared as plugins via the `swh.workers` entry_point.

As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:

- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
  (typically, create/initialize the database for this lister).
  If not set, the default implementation creates database tables (after
  optionally having deleted exisintg ones) according to models declared in
  the `models` register field.

There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).

Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
  level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
  initializing (especially with --drop-tables) the database for a single
  lister is unreliable, since all tables are created using a sibgle MetaData
  (in the same namespace).
2019-09-03 15:02:24 +02:00
David Douard
c67a926f26 npm: make NpmVisitModel use the main declarative base class from core.models
This is needed by the (refactored) db init mechanism, since this later uses
the main declarative base class (thus the main MetaData instance) to gather
tables to be created/dropped.
2019-09-03 15:02:24 +02:00
David Douard
342964eda7 phabricator: fix the FullPhabricatorLister task
forgot the forge_url -> api_baseurl renaming in there.
2019-09-03 12:01:55 +02:00
David Douard
8785fc1a4e cgit: fix cgit's task module and tests
forgot some `url_prefix` there.
2019-09-03 12:01:55 +02:00
David Douard
87cec2f5c3 phabricator: refactor PhabricatorLister's constructor
- use the 'standard' api_baseurl as init argument,
- make it optional, with default to forge.softwareheritage.org,
- use origin_url as id.
2019-09-02 12:29:38 +02:00
David Douard
befe9a6d57 gitlab: make GitLabLister's api_baseurl init argument optional
and simplify a bit the code of the constructor.
2019-09-02 12:29:38 +02:00
David Douard
b87cd5d309 github: make GitHubLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00
David Douard
8950b0b32d bitbucket: make BitBucketLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00
David Douard
22f2f2c43c core: make it possible to specify the api_baseurl init argument in override_config
This is required to be able to make lister classes instanciation easier and more
reliable, especially in the context of cli tools like 'swh lister run', for which
we want to be able to specify any lister init argument as extra parameter of the
command.
2019-09-02 12:29:38 +02:00
David Douard
3816b4d3bf cgit: rewrite the CGit lister
Simplify the code:
- do only inherit from ListerBase
- implement HTTP queries directly using requests
- get rid of convoluted code

Make the origin_url gathered from the git repo's "project" page instead of
building it from the 'url_prefix' hack. Now, the lister WILL make substancially
more requests, since it will make one request per listed git repo, but
the provided origin_url should be pretty reliable now.

When several url are provided as clonable URLs, choose the http/https one first,
otherwise, choose the first one of the list.

Add proper tests for the cgit lister.

Also, get rid of the 'time_updated' column in the model.
2019-09-02 12:29:31 +02:00
David Douard
e0ce68377d bitbucket: simplify a bit BitBucketLister's constructor
get rid of the "smart" flush_packet_db computation.
2019-08-30 17:56:19 +02:00
David Douard
d807d15f65 phabricator: randomly select the API token in the provided list
instead of picking the first one, so this behavior is consistent with
ListerHttpTransport's one.
2019-08-30 17:56:19 +02:00