After a first attempt with D7812 this one use a different strategy to
retrieve origins.
Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
Parse metadata from 'desc' file to build origins url.
Scrap the origin url to get artifacts metadata that list all versions of a package.
It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
Related T4233
That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are
canonicalized to https://github.com/${user_repo} format.
This avoids duplication of origins.
Related to T4232
Pass the raw bytes of pom file content in xmltodict.parse and let
it do the string decoding based on the encoding declared in pom file.
If the string decoding failed due to an invalid declared encoding,
xml.parsers.expat.ExpatError will be raised and will be caught by
the lister, ignoring the pom file and continuing listing.
Related to T3874
It exists cases where the modification time for a jar archive in
a maven index is null which was leading to a processing error
by the lister.
So handle that case to avoid premature exit of the listing process.
Related to T3874
When parsing pom files, we are only interested to extract a VCS URL
(git, hg, svn) in order to create associated loading tasks.
In that case, the groupId and artifactId are not used by the lister
so better removing their extraction, plus it will prevent errors when
those info are missing in pom files.
Previously the maven lister was creating an origin for each source
archive (jar, zip) it discovered during the listing process.
This is not the way Software Heritage decided to archive sources
coming from package managers. Instead one origin should be created
per package and all its versions should be found as releases in the
snapshot produced by the package loader.
So modify the maven lister in order to create one origin per package
grouping all its versions.
This change also modifies the way incremental listing is handled,
ListedOrigin instances will be yielded only if we discovered new
versions of a package since the last listing.
Tests have been updated to reflect these changes.
Related to T3874
Previously we had as many origins as version for a crate package, url was a link
to a specific crate version package.
Refactor to have one origin per package name and add an 'artifacts' entry to
extra_loader_arguments that list all versions, package url and checksum.
Origin url is now a link to the related http api endpoint for a package name.
Related to T4104
The Attic folder that can sometimes be found in a CVS respository
is a special one used by CVS to store RCS files and should not be
considered as a valid module name when listing CVS projects.
This aligns the behavior with other listers (e.g. sourceforge, ...) to continue listing
if some information is not retrievable at all.
Related to T3874
The Crates lister retrieves crates package for Rust lang.
It basically fetches https://github.com/rust-lang/crates.io-index.git
to a temp directory and then walks through each file to get the
crate's info.
Commit 6a7479553e modified the origin URLs for CVS projects
hosted on SourceForge but it also broke incremental listing
due to a no longer valid assertion, so fix that issue.
CVS projects are different from other VCS ones, they use the rsync
protocol, a list of modules needs to be fetched from an info page
and multiple origin URLs can be produced for a same project.
Related to T3789
Prior to this commit, the listing could fail when either reading a page or the page of
results (lauchpad api raises RestfulError). This now retries when those kind of
exceptions happen. If the error persists (after multiple tryouts and exponential
backoff), the listing continues nonetheless (with warning logs).
Note that if the page ends up being empty, it's no longer accounted for.
This actually allows the listing to finish in case of issues.
Related to T3945
Bazaar support was removed a long time ago and predates a lot of the new
mechanisms in place in the API. Unfortunately, it looks like a lot of
the URLs are offline now, but there are still a few projects that can be
listed, this is pretty low-effort.
I would like to use it as the metadata authority URI in the loader,
instead of '{p_url.scheme}://{p_url.netloc}/', which I do not think
is accurate, as it is possible to have multiple Maven instances at
the same netloc.
A debian package can have sources coming from multiple suites
so we need to ensure to update the last_update field in the
ListedOrigin model if the current processed suite has a greater
modification time for its sources index.
Related to T2400
Use the value of the "Last-Modified" header from the HTTP response
resulting of the debian sources index HTTP request.
It will prevent to create loading tasks for debian packages with no
changes since last listing.
Related to T2400
All debian suites do not necessarily have the same set of components.
So prefer to log that a component is missing for a suite instead of
raising an excption that will stop the listing.
For a given package, the debian lister generates a dictionary mapping
distribution and version to a list of files to be processed by the
debian loader.
For each file to process, the debian loader expects to find an URI
in order to download it and then use its content to ingest package
source code into the archive.
However, it turns out these URIs were not computed by the lister
in its current implementation making any debian loading task fail
due to these missing info.
So add the computation of these URIS and ensure they will be provided
in the debian loader input parameters.
Related to T2400
In some circumstances, GitHub will return two separate repos with the
same html_url in the same page. This makes the lister fail with a
cardinality error.
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.
Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.
Related to T1724