Some Guix packages correspond to subset exports of a subversion source
tree at a given revision, typically the Tex Live ones.
In that case, we must pass an extra parameter to the svn-export loader
to specify the sub-paths to export but also use a unique origin URL
for each package to archive as otherwise the same one would be used
and only a single package would be archived.
Related to swh/infra/sysadm-environment#5263.
In addition to query parameters also check if any part of URL path
contains a tarball filename.
It fixes the detection of some tarball URLs provided in Guix manifest.
Related to swh/meta#3781.
Guix now provides a "submodule" info in the sources.jon file it
produced so exploit it to set the new "submodules" parameter of
the git-checkout loader in order to retrieve submodules only when
it is required.
Related to swh/devel/swh-loader-git#4751.
Prior to this, it was sending only 'directory' types for all vcs trees. Multiple
directory loaders now exist whose visit type are currently diverging, so the scheduling
would not happen correctly without it. This commit is the required adaptation for the
scheduling to work appropriately.
Refs. swh/meta#4979
Starting with the first url. As soon as one detection succeeds, this stops and yields
the result. Otherwise, continue with the detection on the next mirror url.
This should fix the current misbehavior [1] when multiple mirror urls are not ok but the
first one is.
[1] https://gitlab.softwareheritage.org/swh/infra/sysadm-environment/-/issues/4868#note_137483
Refs. swh/infra/sysadm-environment#4868
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.
This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.
Related to infra/sysadm-environment#4655
Prior to this, some urls were detected as file because their version name were wrongly
detected as extension, hence not matching tarball extensions.
Related to T3781
For now, those can be faulty as the manifest is missing 'critical' information about how
to recompute the hash (e.g. fs layout, executable bit, ...).
Related to T4608
Related to T3781
In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
integrity one except it's for "recursive" outputHashMode. This adds the necessary
assertions around this case so correct data is sent to loaders as well.
Related to T3781
This actually includes all query param values as paths to check. When paths have
extensions, it then pattern matches against tarballs if any. When no extension is
detected, it's doing as before, fallbacks to head query the url to have more information
on the file.
Prior to this commit, this only looked over a hard-coded list of values (for hard-coded
keys: file, f, name, url) detected through docker runs. This way of doing it should
decrease future misdetections (when new unknown "keys" show up in the wild).
Related to T3781
Without this, some git repositories are detected as file (due to upstream
misqualification too). This does some extra effort to detect those to avoid sending
noise to loaders.
This also refactors some common code to build vcs artifacts to avoid duplication.
Related to T3781
Without this, some tarballs hidden within query parameters are not detected. This does
some extra effort to detect those to avoid sending noise to loaders.
Related to T3781
Without this distinction the current directory or content loader will fail the download
as they currently expect the checksums to be about the tarball. When a recursive
"integrity" is provided, it's actually about the uncompressed tarball as per the
nix-store computation.
It's detailed within the code.
Related to T3294
Related to T3781
Some origins are listed as urls while they are not. They are possibly vcs. So this
commit tries to detect and and deal with those if possible. If not possible, they are
skipped.
Related to T3781
Related to P1470
The end goal is to ingest sparsely the origins, that would avoid hitting the various
servers around the same time for colocated origins in the upstream manifest (especially
file or tarball).
Related to T3781