Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.
Fix it by directly mocking time.sleep (was not working previously).
The implementation of `HTTPError` in `requests` does not guarantee that
the `response` property will always be set. So we need to ensure it is
not `None` before looking for the return code, for example.
This also makes mypy checks pass again, as `types-request` was updated
in 2.31.0.9 to better match this particular aspect. See:
https://github.com/python/typeshed/pull/10875
This pushes the rather elementary logic within the lister's scope. This will simplify
and unify cli call between lister and scheduler clis. This will also allow to reduce
erroneous operations which can happen for example in the add-forge-now.
With the following, we will only have to provide the type and the instance, then
everything will be scheduled properly.
Refs. swh/devel/swh-lister#4693
Some GitLab instances use specific namespaces for transient repositories
that it doesn't make sense to archive (for example, gitlab.org has a set
of QA namespaces used for integration testing of their production
deployments; drupal has an `issues/` namespace with forks of repos that
are only used for collaboration on merge requests, and aren't that
useful to be archived).
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.
The following template is used to generate it:
"Software Heritage <lister_name> lister v<swh-lister version>
(+https://www.softwareheritage.org/contact)"
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.
That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.
All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.
Rename throttling_retry decorator to http_retry to reflect this change.
GitLab API can return errors 500 when listing projects
(see https://gitlab.com/gitlab-org/gitlab/-/issues/262629).
To avoid ending the listing prematurely, skip buggy URLs and move
to next pages.
Related to T3442
Increase number of origins per page to the maximum value allowed
by GitLab API (100) to send less requests.
Ask for simple responses to reduce size of JSON data.
Temporarily server failures can happen when listing a GitLab instance,
HTTP status codes 502, 503 or 520 are returned in that case.
So adapt lister requests retry policy to execute requests again when
such errors are encountered.
Related to T3442
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.
It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.
Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.
Related to T3403
The previous pagination implementation has a hard-coded limit server side [1]
[1]
```
{"error":"Offset pagination has a maximum allowed offset of 50000 for requests that return objects of type Project. Remaining records can be retrieved using keyset pagination."}
```
Related to T2994
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
Note that the current implementation will start back the new visit from the last
next_page link seen (that's what is stored in the lister state to avoid computing back
the url). This means that this page will be seen at least 2 times, on the first visit
and on the next. This should not pose any problems as the listing is idempotent.
Related to T2987
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.
The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.