Recent changes in base Lister class implementation turn the call to
self.scheduler.update_lister mandatory to update the last termination
date for a lister.
It has some side effects on the GitHub lister as there is one incremental
instance plus multiple range ones relisting previously discovered repos
executed in parallel.
Range GitHub listers should not override the shared incremental lister
state as StaleData exceptions might be raised otherwise, so override
the set_state_in_scheduler Lister method to ensure that.
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.
Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.
Reated to swh/infra/sysadm-environment#5030.
requests_ratelimited fixture from swh-core was renamed to
github_requests_ratelimited.
remaining_requests parameter was added to the github_response_callback
function from swh-core, making it no longer compatible with requests_mock
callback for json responses.
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.
This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.
Related to infra/sysadm-environment#4655
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.
The following template is used to generate it:
"Software Heritage <lister_name> lister v<swh-lister version>
(+https://www.softwareheritage.org/contact)"
In some circumstances, GitHub will return two separate repos with the
same html_url in the same page. This makes the lister fail with a
cardinality error.
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
This replaces the test data with some manually generated answers, which allows
us to test a few more cases for instantiating the lister.
This also expands test coverage to test behavior on rate-limited requests.
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.
The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.
Also kill now useless `new_lister()` functions.
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:
- only create missing task-types (do not update them), but check that the
backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
plugin registry field will be checked and added if needed; tasks which name
start wit an underscore will not be added,
- added task-type will have:
- the 'type' field is derived from the task's function name (with underscores
replaced with dashes),
- the description field is the first line of that function's docstring,
- default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
a simple pattern matching to have decent default values for full/incremental
tasks),
- these default values can be overloaded via the 'task_type' plugin registry
entry.
For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).
Comes with some tests.
Listers are declared as plugins via the `swh.workers` entry_point.
As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:
- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
(typically, create/initialize the database for this lister).
If not set, the default implementation creates database tables (after
optionally having deleted exisintg ones) according to models declared in
the `models` register field.
There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).
Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
initializing (especially with --drop-tables) the database for a single
lister is unreliable, since all tables are created using a sibgle MetaData
(in the same namespace).