Instead of using an undocumented rubygems HTTP endpoint that only
gives us the names of the gems, prefer to exploit the daily PostgreSQL
dump of the rubygems.org database.
It enables to list all gems but also all versions of a gem and its
release artifacts. For each relase artifact, the following info are
extracted: version, download URL, sha256 checksum, release date
plus a couple of extra metadata.
The lister will now set list of artifacts and list of metadata as extra
loader arguments when sending a listed origin to the scheduler database.
A last_update date is also computed which should ensure loading tasks
for rubygems will be scheduled only when new releases are available since
last loadings.
To be noted, the lister will spawn a temporary postgres instance so this
require the initdb executable from postgres server installation to be
available in the execution environment.
Related to T1777
For now, those can be faulty as the manifest is missing 'critical' information about how
to recompute the hash (e.g. fs layout, executable bit, ...).
Related to T4608
Related to T3781
In order to reduce http api call amount made by the loader, download a
crates.io database dump, and parse its csv files to get a last_update
value for each versions of a Crate.
Those values are sent to the loader through extra_loader_arguments
'crates_metadata'.
'artifacts' and 'crates_metadata' now uses "version" as key.
Related T4104, D8171
In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
integrity one except it's for "recursive" outputHashMode. This adds the necessary
assertions around this case so correct data is sent to loaders as well.
Related to T3781
This actually includes all query param values as paths to check. When paths have
extensions, it then pattern matches against tarballs if any. When no extension is
detected, it's doing as before, fallbacks to head query the url to have more information
on the file.
Prior to this commit, this only looked over a hard-coded list of values (for hard-coded
keys: file, f, name, url) detected through docker runs. This way of doing it should
decrease future misdetections (when new unknown "keys" show up in the wild).
Related to T3781
Without this, some git repositories are detected as file (due to upstream
misqualification too). This does some extra effort to detect those to avoid sending
noise to loaders.
This also refactors some common code to build vcs artifacts to avoid duplication.
Related to T3781
Without this, some tarballs hidden within query parameters are not detected. This does
some extra effort to detect those to avoid sending noise to loaders.
Related to T3781
Without this distinction the current directory or content loader will fail the download
as they currently expect the checksums to be about the tarball. When a recursive
"integrity" is provided, it's actually about the uncompressed tarball as per the
nix-store computation.
It's detailed within the code.
Related to T3294
Related to T3781
Some origins are listed as urls while they are not. They are possibly vcs. So this
commit tries to detect and and deal with those if possible. If not possible, they are
skipped.
Related to T3781
Related to P1470
The end goal is to ingest sparsely the origins, that would avoid hitting the various
servers around the same time for colocated origins in the upstream manifest (especially
file or tarball).
Related to T3781
In listers collecting artifacts for each package to load, add artifacts
checksums, when that info is available, in parameters sent to loaders
in order to check downloaded artifact integrity.
Prefer to execute lister through a celery task as it also enables to
catch possible issues with task implementation.
Also use docker compose v2 commands.
Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.
However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).
This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.
The following template is used to generate it:
"Software Heritage <lister_name> lister v<swh-lister version>
(+https://www.softwareheritage.org/contact)"
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.
That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.
All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.
Rename throttling_retry decorator to http_retry to reflect this change.
For some forges, the default tab for a repository detail is not the
summary tab so the clone urls are not detected and the repository
is ignored
Related to T4544
In order to get a last_update for each ListedOrigin sent to scheduler
database, send an extra HTTP request for each listed package to the
/api/packages/<package_name> endpoint of pub.dev API.
A pub.dev developer inform us that endpoint is heavily used and cached
so there is no particular issues to query that endpoint for each package
in a row periodically.
Simplify code for downloading packages index as gzip and deflate
transfer-encodings are automatically decoded by requests, also
do not stream response for a couple of megabytes and store
HTTP responses in memory.
Also add more debug logs to track lister execution.