GitLab API can return errors 500 when listing projects
(see https://gitlab.com/gitlab-org/gitlab/-/issues/262629).
To avoid ending the listing prematurely, skip buggy URLs and move
to next pages.
Related to T3442
Increase number of origins per page to the maximum value allowed
by GitLab API (100) to send less requests.
Ask for simple responses to reduce size of JSON data.
Temporarily server failures can happen when listing a GitLab instance,
HTTP status codes 502, 503 or 520 are returned in that case.
So adapt lister requests retry policy to execute requests again when
such errors are encountered.
Related to T3442
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.
It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.
Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.
Related to T3403
The previous pagination implementation has a hard-coded limit server side [1]
[1]
```
{"error":"Offset pagination has a maximum allowed offset of 50000 for requests that return objects of type Project. Remaining records can be retrieved using keyset pagination."}
```
Related to T2994
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
Note that the current implementation will start back the new visit from the last
next_page link seen (that's what is stored in the lister state to avoid computing back
the url). This means that this page will be seen at least 2 times, on the first visit
and on the next. This should not pose any problems as the listing is idempotent.
Related to T2987
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.
The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.
Also kill now useless `new_lister()` functions.
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:
- only create missing task-types (do not update them), but check that the
backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
plugin registry field will be checked and added if needed; tasks which name
start wit an underscore will not be added,
- added task-type will have:
- the 'type' field is derived from the task's function name (with underscores
replaced with dashes),
- the description field is the first line of that function's docstring,
- default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
a simple pattern matching to have decent default values for full/incremental
tasks),
- these default values can be overloaded via the 'task_type' plugin registry
entry.
For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).
Comes with some tests.
Listers are declared as plugins via the `swh.workers` entry_point.
As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:
- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
(typically, create/initialize the database for this lister).
If not set, the default implementation creates database tables (after
optionally having deleted exisintg ones) according to models declared in
the `models` register field.
There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).
Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
initializing (especially with --drop-tables) the database for a single
lister is unreliable, since all tables are created using a sibgle MetaData
(in the same namespace).
instead of converting that column as a string
As a side effect, bitbucket wise, we provided improperly the after query
parameter as a date not url encoded. This resulted in improper api response from
bitbucket's (we received from time to time the same next index as the current
one).
Related T1826
If nothing has been done prior to a full relisting, there is actually nothing
to list. So the relister in question does nothing.
In that context, the IndexingLister class's `db_partition_indices` method now
returns an empty list instead of raising a ValueError when there is nothing to
list.
Related T1826
Related e129e48
This should have been removed along with the code in b816212.
The request authentication has been reworked so that all listers use the same
credentials dict.
Related b816212
Related T1772
using the host of the given url.
This allows to create a lister task by simply specify the API base url
and prevent 'inconsistent by default' behavior, eg. with:
swh-scheduler task add swh-lister-gitlab-full \
api_baseurl=https://0xacab.org/api/v4
the created task does not use 'gitlab' as instance name (but '0xacab.org'
here).
It's still possible to explicitely specify the instance name if needed.