Latest beautifulsoup4 release (4.13) seems to have fixed issues
related to unexpected encodings in XML files so a test that was
passing previously is now failing.
Update that test to check origin URL and visit type can be
successfully extracted from a POM file with unexpected encoding.
Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.
Fix it by directly mocking time.sleep (was not working previously).
As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.
As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.
So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.
It also has the advantage to simplify the code.
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.
This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.
Related to infra/sysadm-environment#4655
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.
The following template is used to generate it:
"Software Heritage <lister_name> lister v<swh-lister version>
(+https://www.softwareheritage.org/contact)"
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.
That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.
All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.
Rename throttling_retry decorator to http_retry to reflect this change.
xmltodict cannot parse POM files with multi-byte encoding so prefer to
use the XML parser of BeautifulSoup based on lxml instead.
Also drop xmltodict requirement as it is no longer used in swh-lister
codebase.
That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are
canonicalized to https://github.com/${user_repo} format.
This avoids duplication of origins.
Related to T4232
Pass the raw bytes of pom file content in xmltodict.parse and let
it do the string decoding based on the encoding declared in pom file.
If the string decoding failed due to an invalid declared encoding,
xml.parsers.expat.ExpatError will be raised and will be caught by
the lister, ignoring the pom file and continuing listing.
Related to T3874
It exists cases where the modification time for a jar archive in
a maven index is null which was leading to a processing error
by the lister.
So handle that case to avoid premature exit of the listing process.
Related to T3874
When parsing pom files, we are only interested to extract a VCS URL
(git, hg, svn) in order to create associated loading tasks.
In that case, the groupId and artifactId are not used by the lister
so better removing their extraction, plus it will prevent errors when
those info are missing in pom files.
Previously the maven lister was creating an origin for each source
archive (jar, zip) it discovered during the listing process.
This is not the way Software Heritage decided to archive sources
coming from package managers. Instead one origin should be created
per package and all its versions should be found as releases in the
snapshot produced by the package loader.
So modify the maven lister in order to create one origin per package
grouping all its versions.
This change also modifies the way incremental listing is handled,
ListedOrigin instances will be yielded only if we discovered new
versions of a package since the last listing.
Tests have been updated to reflect these changes.
Related to T3874
This aligns the behavior with other listers (e.g. sourceforge, ...) to continue listing
if some information is not retrievable at all.
Related to T3874
I would like to use it as the metadata authority URI in the loader,
instead of '{p_url.scheme}://{p_url.netloc}/', which I do not think
is accurate, as it is possible to have multiple Maven instances at
the same netloc.
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.
Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.
Related to T1724