Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.
That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.
All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.
Rename throttling_retry decorator to http_retry to reflect this change.
For some forges, the default tab for a repository detail is not the
summary tab so the clone urls are not detected and the repository
is ignored
Related to T4544
In order to get a last_update for each ListedOrigin sent to scheduler
database, send an extra HTTP request for each listed package to the
/api/packages/<package_name> endpoint of pub.dev API.
A pub.dev developer inform us that endpoint is heavily used and cached
so there is no particular issues to query that endpoint for each package
in a row periodically.
Simplify code for downloading packages index as gzip and deflate
transfer-encodings are automatically decoded by requests, also
do not stream response for a couple of megabytes and store
HTTP responses in memory.
Also add more debug logs to track lister execution.
By using a single equality instead of checking len() then zip()
to check one by one, pytest can find the common/missing elements
and print them nicely when the two lists are unequal.
This removes code and adds support for incremental pagination.
While both are essentially the same lister now, it still makes sense to
keep the Gitea lister separate, in order to:
1. display them in different categories on https://archive.softwareheritage.org/
2. support possible divergence of APIs in the future
xmltodict cannot parse POM files with multi-byte encoding so prefer to
use the XML parser of BeautifulSoup based on lxml instead.
Also drop xmltodict requirement as it is no longer used in swh-lister
codebase.
After a first attempt with D7812 this one use a different strategy to
retrieve origins.
Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
Parse metadata from 'desc' file to build origins url.
Scrap the origin url to get artifacts metadata that list all versions of a package.
It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.
Related T4233
That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are
canonicalized to https://github.com/${user_repo} format.
This avoids duplication of origins.
Related to T4232
Pass the raw bytes of pom file content in xmltodict.parse and let
it do the string decoding based on the encoding declared in pom file.
If the string decoding failed due to an invalid declared encoding,
xml.parsers.expat.ExpatError will be raised and will be caught by
the lister, ignoring the pom file and continuing listing.
Related to T3874
It exists cases where the modification time for a jar archive in
a maven index is null which was leading to a processing error
by the lister.
So handle that case to avoid premature exit of the listing process.
Related to T3874
When parsing pom files, we are only interested to extract a VCS URL
(git, hg, svn) in order to create associated loading tasks.
In that case, the groupId and artifactId are not used by the lister
so better removing their extraction, plus it will prevent errors when
those info are missing in pom files.
Previously the maven lister was creating an origin for each source
archive (jar, zip) it discovered during the listing process.
This is not the way Software Heritage decided to archive sources
coming from package managers. Instead one origin should be created
per package and all its versions should be found as releases in the
snapshot produced by the package loader.
So modify the maven lister in order to create one origin per package
grouping all its versions.
This change also modifies the way incremental listing is handled,
ListedOrigin instances will be yielded only if we discovered new
versions of a package since the last listing.
Tests have been updated to reflect these changes.
Related to T3874
Previously we had as many origins as version for a crate package, url was a link
to a specific crate version package.
Refactor to have one origin per package name and add an 'artifacts' entry to
extra_loader_arguments that list all versions, package url and checksum.
Origin url is now a link to the related http api endpoint for a package name.
Related to T4104