Since python-debian 1.0 release, an extra paragraph is returned
when calling Sources.iter_paragraphs that does not have the
expected schema so ensure to ignore it.
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.
Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.
Reated to swh/infra/sysadm-environment#5030.
mypy doesn't catch that multiple uses of
`self.listed_origins[origin_url]` in the same statement should be identical.
Using a temporary local variable for it seems to help.
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.
That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.
All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
A debian package can have sources coming from multiple suites
so we need to ensure to update the last_update field in the
ListedOrigin model if the current processed suite has a greater
modification time for its sources index.
Related to T2400
Use the value of the "Last-Modified" header from the HTTP response
resulting of the debian sources index HTTP request.
It will prevent to create loading tasks for debian packages with no
changes since last listing.
Related to T2400
All debian suites do not necessarily have the same set of components.
So prefer to log that a component is missing for a suite instead of
raising an excption that will stop the listing.
For a given package, the debian lister generates a dictionary mapping
distribution and version to a list of files to be processed by the
debian loader.
For each file to process, the debian loader expects to find an URI
in order to download it and then use its content to ingest package
source code into the archive.
However, it turns out these URIs were not computed by the lister
in its current implementation making any debian loading task fail
due to these missing info.
So add the computation of these URIS and ensure they will be provided
in the debian loader input parameters.
Related to T2400
Some distributions (e.g. debian-security) have a slightly different URL
for retrieving source packages metadata.
So add a new URL template to process when trying to download such data.
Related to T3032#58239
Port debian lister to `swh.lister.pattern.Lister` API.
The new implementation will produce one instance of ListedOrigin model
per package, notably containing the set of parameters expected by the
debian loader.
The lister is also stateful, meaning only new packages and those with
new found versions since the last listing will be returned.
Closes T2979
The scheduler fixture introduced truncates tables in between tests. The debian
tests unfortunately share state and it broke when that changed. This fixes the
test by avoiding the truncation of the scheduler db table "task".
Ideally those tests need to be reworked to avoid sharing state between tests.
[1] https://jenkins.softwareheritage.org/job/DLS/job/tests/1043
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.
The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.
This is no longer shared between the new debian loader and the lister.
The swh.storage.schemata module is still part of the swh.storage module though.
As this is still a dependency for the current swh.loader.debian production
loader. This will be cleaned up later.
Related D2135
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.
Also kill now useless `new_lister()` functions.
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:
- only create missing task-types (do not update them), but check that the
backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
plugin registry field will be checked and added if needed; tasks which name
start wit an underscore will not be added,
- added task-type will have:
- the 'type' field is derived from the task's function name (with underscores
replaced with dashes),
- the description field is the first line of that function's docstring,
- default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
a simple pattern matching to have decent default values for full/incremental
tasks),
- these default values can be overloaded via the 'task_type' plugin registry
entry.
For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).
Comes with some tests.
Listers are declared as plugins via the `swh.workers` entry_point.
As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:
- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
(typically, create/initialize the database for this lister).
If not set, the default implementation creates database tables (after
optionally having deleted exisintg ones) according to models declared in
the `models` register field.
There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).
Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
initializing (especially with --drop-tables) the database for a single
lister is unreliable, since all tables are created using a sibgle MetaData
(in the same namespace).