With or without retry (for a future version of swh.core).
This skips the origin when this sporadically happens. It should get picked up by another
listing eventually.
The listing is currently failing to finish when the github server hangs up on the
process. Adding this behavior allows to skip the issue without breaking the listing.
The current lister implementation lists very few metadata with the hard-coded /p/ base
url (404 on mostly all packages). The packagist api implementation must have evolved
since the initial implementation of the lister (and the first deployment on staging).
Following the upstream documentation [1], it's sensible to first use the /p2/ as it's
performant from the packagist api side. It's then fallbacking to use /p2/+~dev url
scheme, then the /p/ scheme and finally the /packages/ base url if previous result are
either not found or empty (different than no modification since the last visit).
It keeps the initial implementation behavior of stopping immediately if a 304
NotModifiedSince is returned by the server.
[1] https://repo.packagist.org/apidoc
Prior to this commit, the newly introduced check on url validity was consuming the
stream of origins. In effect, this would no longer write origin records regularly.
For all listers, that would translate to flush origins only at the end of the listing
which could take a while for some (e.g. packagist lister has been running for more than
12h currently without writing anything in the scheduler).
That lister is really near the cgit & gitweb implementations. But the dom data is again
structured differently though so this implementation stands on its own.
Refs. swh/meta#5048
Gitiles instance returns voluntarily a malformed json output (json prefixed with
``)]}'\n``) [2]. The lister deals with it to properly parse the json response
nonetheless. It drops the prefix and then parses the json.
If at some point, they drop this prefix to return json directly, the lister will be able
to deal with it too. There are 2 tests one with 'standard' gitile format and another
with standard json to account for both case.
Refs. swh/meta#5045
[2] https://github.com/google/gitiles/issues/263
Depending on some instances, we have some specific heuristics, some instances:
- have summary pages which do not not list metadata_url (so some
computation happens to list git:// origins which are cloneable)
- have summary page which reference metadata_url as a multiple comma separated urls
- lists relative urls of the repository so we need to join it with the main instance url
to have a complete cloneable origins (or summary page)
- lists "down" http origins (cloning those won't work) so lists those as cloneable https
ones (when the main url is behind https).
Refs. swh/devel/swh-lister#1800
When relisting an opam instance and the opam root directory is already
populated, the '--set-default' parameter must be provided otherwise the
following error is reported:
No switch is currently set. Please use 'opam switch' to set or install a switch
Related to swh/infra/sysadm-environment#4971.
Use subprocess.run instead of subprocess.call and subprocess.Popen to
call opam commands and set check parameter to True in order to raise
CalledProcessError exception when an opam command failed.
This should help spotting issues with the opam lister.
Related to swh/infra/sysadm-environment#4971.
In contrary of gitea listing which does not require to provide the q query
parameter, it is required for the gogs case.
After reading the gogs source code, the /repos/search endpoint generates
a sql request of the form: "SELECT * FROM repos WHERE name LIKE '%{q}%'".
By setting the q parameter value to "_", the LIKE clause acts as a
wildcard and all repositories are ensured to be returned.
Fixes#4698.
Pagure is a git-centered forge, python based using pygit2.
Its REST API enables to easily list all projects hosted in an
instance so the lister implementation is quite simple.
Related to swh/meta#5043.
The default behavior of subprocess is to pull executables from a
hardcoded list, which doesn't work when opam is installed manually in
the user's home directory.
mypy doesn't catch that multiple uses of
`self.listed_origins[origin_url]` in the same statement should be identical.
Using a temporary local variable for it seems to help.
Prior to this, it was sending only 'directory' types for all vcs trees. Multiple
directory loaders now exist whose visit type are currently diverging, so the scheduling
would not happen correctly without it. This commit is the required adaptation for the
scheduling to work appropriately.
Refs. swh/meta#4979
This pushes the rather elementary logic within the lister's scope. This will simplify
and unify cli call between lister and scheduler clis. This will also allow to reduce
erroneous operations which can happen for example in the add-forge-now.
With the following, we will only have to provide the type and the instance, then
everything will be scheduled properly.
Refs. swh/devel/swh-lister#4693
Starting with the first url. As soon as one detection succeeds, this stops and yields
the result. Otherwise, continue with the detection on the next mirror url.
This should fix the current misbehavior [1] when multiple mirror urls are not ok but the
first one is.
[1] https://gitlab.softwareheritage.org/swh/infra/sysadm-environment/-/issues/4868#note_137483
Refs. swh/infra/sysadm-environment#4868
Instead of fully consuming the get_origins_from_page generator into
a list and truncate it, prefer to consume the generator origin per
origin and abort the process when the max number of origin per page
is reached.
Indeed some non trivial listers like the cgit one can perform costly
processing, HTTP request for instance, for each origin in a page.
So better not consuming the full generator in a row to avoid such
side effects.
This unifies with other lister tasks modules. And this allow the cgit task to
be scheduled by the add-forge-now scheduler cli.
Refs. swh/infra/sysadm-environment#4813
Some URLs of the repositories endpoint from BitBucket REST API 2.0
can return an error 500. In that case, skip the buggy repositories
page and get next one to continue listing and avoid to end it
prematurely.
Related to #4239
requests_ratelimited fixture from swh-core was renamed to
github_requests_ratelimited.
remaining_requests parameter was added to the github_response_callback
function from swh-core, making it no longer compatible with requests_mock
callback for json responses.
In order to remove warnings about /apidoc/*.rst files being included
multiple times in toc when building full swh documentation, prefer to
include module indices only when building standalone package documentation.
Also include them the proper sphinx way.
Related to T4496