This also adds an swh-listers fixture which allows to retrieve a test ready
lister from its name (e.g gnu). Those listers have access to a scheduler
fixture so we can check the listing output from the scheduler instance.
Loader Task signature for the loader gnu is now:
- args:
- package
- package urls
- kwargs:
tarballs: List of Dict with keys archive (unchanged), 'time' (was 'date'),
length (new)
Previously, the Phabricator lister was disabling some loading tasks while it was not
supposed to. More precisely, due to an invalid index provided to a database query,
the latest created scheduler task was disabled each time a new page of results was
provided to the lister by the Phabricator API. Moreover, database queries were not
filtered according to the Phabricator instance resulting in possible disabling of
scheduler tasks from other instances.
Closes T2000
Without that fix, errors are raised when one wants to list Phabricator repositories
in a specific index range. The issue is due to a comparison between a string and
an integer. So convert next extracted repository index to integer to match the
corresponding model type.
Closes T1997
Turns out all newly listed repositories were filtered out because of that.
Consequently, no entries in the listers database and no scheduler loading
tasks were created when listing a Phabricator instance.
Closes T1999
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.
Also kill now useless `new_lister()` functions.
This is needed to fix the db-init implementation so the debian loader (which
does use the SQLBase from swh.storage) have its models declared in the
MetaData used by the initialize() function.
Since the CGit lister now perform an HTTP query for each git repos listed in
the main index, it is significantly slower, so reducing the time between
database commits make sense, and won't overload the database.
With a bit of logging, it makes it easier to follow/debug the progress of
a listing.
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:
- only create missing task-types (do not update them), but check that the
backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
plugin registry field will be checked and added if needed; tasks which name
start wit an underscore will not be added,
- added task-type will have:
- the 'type' field is derived from the task's function name (with underscores
replaced with dashes),
- the description field is the first line of that function's docstring,
- default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
a simple pattern matching to have decent default values for full/incremental
tasks),
- these default values can be overloaded via the 'task_type' plugin registry
entry.
For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).
Comes with some tests.
Listers are declared as plugins via the `swh.workers` entry_point.
As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:
- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
(typically, create/initialize the database for this lister).
If not set, the default implementation creates database tables (after
optionally having deleted exisintg ones) according to models declared in
the `models` register field.
There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).
Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
initializing (especially with --drop-tables) the database for a single
lister is unreliable, since all tables are created using a sibgle MetaData
(in the same namespace).
This is needed by the (refactored) db init mechanism, since this later uses
the main declarative base class (thus the main MetaData instance) to gather
tables to be created/dropped.
This is required to be able to make lister classes instanciation easier and more
reliable, especially in the context of cli tools like 'swh lister run', for which
we want to be able to specify any lister init argument as extra parameter of the
command.
Simplify the code:
- do only inherit from ListerBase
- implement HTTP queries directly using requests
- get rid of convoluted code
Make the origin_url gathered from the git repo's "project" page instead of
building it from the 'url_prefix' hack. Now, the lister WILL make substancially
more requests, since it will make one request per listed git repo, but
the provided origin_url should be pretty reliable now.
When several url are provided as clonable URLs, choose the http/https one first,
otherwise, choose the first one of the list.
Add proper tests for the cgit lister.
Also, get rid of the 'time_updated' column in the model.