Some GitLab instances use specific namespaces for transient repositories
that it doesn't make sense to archive (for example, gitlab.org has a set
of QA namespaces used for integration testing of their production
deployments; drupal has an `issues/` namespace with forks of repos that
are only used for collaboration on merge requests, and aren't that
useful to be archived).
This cuts down one more manual step in the add forge now validation
process: we can add the relevant origins to the staging scheduler
without enabling them at all.
This will allow more automation of the staging add forge now process:
for known-good listers, we can limit the number of origins being
processed and reduce the amount of manual steps taken for each instance.
The SQL dump contains ownership instructions that can't be run if you
don't have the right users in your database clusters. When someone has a
psqlrc with ON_ERROR_STOP, this fails the load of the dump.
Use the opportunity to trigger an exception when psql returns a non-zero
exit code, rather than continue with an empty/inconsistent database.
In a similar way to the debian lister, use the following versions in the
packages dictionary provided to the generic rpm loader:
- dict keys are package versions prefixed by the fedora release and
edition they have been found (fedora{release}/{edition}/{version}),
they will be used as branch names targeting releases in the snapshot
created by the rpm loader
- version fields in dict values are the package intrinsic versions parsed
from package repository metadata excluding any ".fcXY" suffixes to avoid
the loader to create multiple releases targeting the same directory,
they will be used as release names in the snapshot created by the rpm
loader
Related to T4448
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.
This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.
Related to infra/sysadm-environment#4655
Prior to this, some urls were detected as file because their version name were wrongly
detected as extension, hence not matching tarball extensions.
Related to T3781
swh-scheduler will deduplicate listed origins according to their URL
and visit type but not according to their extra loader arguments.
Previously, listed origins were yielded after each processed artifact
in a page so we could lose some package version info due to the
deduplication process.
So ensure to yield listed origins once all artifacts in a page have
been processed.
Prior to this commit, the lister assumed authentication was required. It exists public
gogs instances which do not require it.
This also updates documentation to mention the usual api location. This is useful when
people wants to actually trigger a listing as a pre-check flight.
This drops repetitive instruction in the gitea lister as well.
Co-authored with Antoine Lambert (@anlambert) <anlambert@softwareheritage.org>.
Related to infra/sysadm-environment#4644
- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.
Also change flake8's repo config to github (the gitlab mirror
being outdated).
CPAN API can return versions that are not of str type: either
int or float.
When version equals 0, it means that version failed to be parsed
by CPAN so we try to extract it from release name in that case.
Otherwise we ensure to convert the version to str type.
Related to T2833
Instead of querying the metacpan distribution endpoint to list origins,
prefer to use the release endpoint instead enabling to list all artifacts
associated to CPAN packages by scrolling results.
Compared to previous implementation, it enables to compute a last_update
date for all CPAN packages but also to obtain artifact sha256 checksums
that will be used by the CPAN loader to check downloads integrity.
As the multiple versions of a module are spread across multiple pages
from the CPAN API, origins are sent to the scheduler once all pages
processed, it is also faster to proceed that way.
Related to T2833
Instead of using an undocumented rubygems HTTP endpoint that only
gives us the names of the gems, prefer to exploit the daily PostgreSQL
dump of the rubygems.org database.
It enables to list all gems but also all versions of a gem and its
release artifacts. For each relase artifact, the following info are
extracted: version, download URL, sha256 checksum, release date
plus a couple of extra metadata.
The lister will now set list of artifacts and list of metadata as extra
loader arguments when sending a listed origin to the scheduler database.
A last_update date is also computed which should ensure loading tasks
for rubygems will be scheduled only when new releases are available since
last loadings.
To be noted, the lister will spawn a temporary postgres instance so this
require the initdb executable from postgres server installation to be
available in the execution environment.
Related to T1777
For now, those can be faulty as the manifest is missing 'critical' information about how
to recompute the hash (e.g. fs layout, executable bit, ...).
Related to T4608
Related to T3781
In order to reduce http api call amount made by the loader, download a
crates.io database dump, and parse its csv files to get a last_update
value for each versions of a Crate.
Those values are sent to the loader through extra_loader_arguments
'crates_metadata'.
'artifacts' and 'crates_metadata' now uses "version" as key.
Related T4104, D8171
In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
integrity one except it's for "recursive" outputHashMode. This adds the necessary
assertions around this case so correct data is sent to loaders as well.
Related to T3781
This actually includes all query param values as paths to check. When paths have
extensions, it then pattern matches against tarballs if any. When no extension is
detected, it's doing as before, fallbacks to head query the url to have more information
on the file.
Prior to this commit, this only looked over a hard-coded list of values (for hard-coded
keys: file, f, name, url) detected through docker runs. This way of doing it should
decrease future misdetections (when new unknown "keys" show up in the wild).
Related to T3781
Without this, some git repositories are detected as file (due to upstream
misqualification too). This does some extra effort to detect those to avoid sending
noise to loaders.
This also refactors some common code to build vcs artifacts to avoid duplication.
Related to T3781
Without this, some tarballs hidden within query parameters are not detected. This does
some extra effort to detect those to avoid sending noise to loaders.
Related to T3781
Without this distinction the current directory or content loader will fail the download
as they currently expect the checksums to be about the tarball. When a recursive
"integrity" is provided, it's actually about the uncompressed tarball as per the
nix-store computation.
It's detailed within the code.
Related to T3294
Related to T3781
Some origins are listed as urls while they are not. They are possibly vcs. So this
commit tries to detect and and deal with those if possible. If not possible, they are
skipped.
Related to T3781
Related to P1470
The end goal is to ingest sparsely the origins, that would avoid hitting the various
servers around the same time for colocated origins in the upstream manifest (especially
file or tarball).
Related to T3781