The CRAN lister improvements introduced in 91e4e33 originally used pyreadr
to read a RDS file from Python instead of rpy2.
As swh-lister was still packaged for debian at the time, the choice of using
rpy2 instead was made as a debian package is available for it while it is not
for pyreadr.
Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
based implementation which has the advantages of being faster and not depending
on the R language runtime.
Related to swh/meta#1709.
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.
Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.
Reated to swh/infra/sysadm-environment#5030.
Previously, the lister was relying on the use of the CRANtools R module
but it has the drawback to only list the latest version of each registered
package in the CRAN registry.
In order to get all possible versions for each CRAN package, prefer to exploit
the content of the weekly dump of the CRAN database in RDS format.
To read the content of the RDS file from Python, the rpy2 package is used as
it has the advantage to be packaged in debian.
Related to swh/meta#1709.
In listers collecting artifacts for each package to load, add artifacts
checksums, when that info is available, in parameters sent to loaders
in order to check downloaded artifact integrity.
A CRAN package can appear twice in the JSON list returned by the
list_all_packages.R script, most recent version of the package
appearing first.
So handle that edge case to avoid error when sending origins to
the scheduler.
R package last update date can be found in the "Packaged" field of
package info returned by tools::CRAN_package_db().
So retrieve it and parse it as a datetime to provide as last_update
parameter value in ListedOrigin model.
Closes T2989
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.
The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
Prior to this commit, we sent the origin url as a versioned artifact.
Now we send the origin url as a CRAN's canonical one, and the associated list
of artifacts found there (only 1 today).
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:
- only create missing task-types (do not update them), but check that the
backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
plugin registry field will be checked and added if needed; tasks which name
start wit an underscore will not be added,
- added task-type will have:
- the 'type' field is derived from the task's function name (with underscores
replaced with dashes),
- the description field is the first line of that function's docstring,
- default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
a simple pattern matching to have decent default values for full/incremental
tasks),
- these default values can be overloaded via the 'task_type' plugin registry
entry.
For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).
Comes with some tests.
Listers are declared as plugins via the `swh.workers` entry_point.
As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:
- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
(typically, create/initialize the database for this lister).
If not set, the default implementation creates database tables (after
optionally having deleted exisintg ones) according to models declared in
the `models` register field.
There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).
Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
initializing (especially with --drop-tables) the database for a single
lister is unreliable, since all tables are created using a sibgle MetaData
(in the same namespace).
Prior to this commit, the policy and priority were hard-coded.
The default values are now the old hard-coded values.
This will allow to develop a cli to trigger forges listing with oneshot policy
and some priority tasks. Thus ingesting those faster and without manual
interventation as we currently do.
Add description in task_dict method because
the only metadata that can be found for a
package at CRAN is its decsription. That can
only br achived from the build in API in R,
which ister is already using. Hence instead of
getting metadata in loader, it is passed
by lister.