Commit graph

35 commits

Author SHA1 Message Date
Antoine Lambert
4b3a12fe76
maven, sourceforge: Fix mypy errors 2025-02-17 13:44:30 +01:00
Antoine Lambert
a3d66736a4
maven: Update test that is now failing since beautifulsoup4 4.13
Latest beautifulsoup4 release (4.13) seems to have fixed issues
related to unexpected encodings in XML files so a test that was
passing previously is now failing.

Update that test to check origin URL and visit type can be
successfully extracted from a POM file with unexpected encoding.
2025-02-10 14:28:33 +01:00
Antoine Lambert
a7607abcf9 tests: Fix mocking of sleep calls with tenacity 8.4.2
Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.

Fix it by directly mocking time.sleep (was not working previously).
2024-06-28 18:15:36 +02:00
Antoine Lambert
41407e0eff Use beautifulsoup4 CSS selectors to simplify code and type checking
As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.

As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.

So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.

It also has the advantage to simplify the code.
2024-04-16 11:22:51 +02:00
David Douard
714fccc3c7 python: Fix black formatting after bump to 23.1.0 in pre-commit 2023-12-05 10:33:07 +01:00
Valentin Lorentz
1c964cccd3 maven/README: Fix links 2023-09-14 12:03:12 +00:00
Antoine Lambert
fc2bd1e937 mypy: Bump to 1.0.1 and fix new typing errors
Related to swh/meta#4960
2023-02-17 17:56:07 +01:00
Valentin Lorentz
62b0835193 maven: Discard templated origin URLs 2023-02-10 10:43:36 +00:00
Nicolas Dandrimont
e785e67315 Hook up recently introduced options to all listers
Hopefully one day we'll be able to replace all of this mess with PEP692
TypedDict kwargs, but that's only on track for Python 3.12.
2022-12-05 16:33:45 +01:00
Valentin Lorentz
8ea4200909 Validate origin URLs before sending to the scheduler 2022-11-04 15:58:45 +01:00
Antoine R. Dumont (@ardumont)
92d494261f
lister: Make sure lister that requires github tokens can use it
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.

This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.

Related to infra/sysadm-environment#4655
2022-10-26 17:23:40 +02:00
Valentin Lorentz
db2f2f8265 maven: Use real data from github API + rely on requests_mock_datadir 2022-10-13 18:28:17 +02:00
Valentin Lorentz
f7ac524a55 maven: Use requests_mock_datadir to simplify mocking. 2022-10-13 17:57:55 +02:00
Valentin Lorentz
3dbe77156c maven: Make assertions more useful
By using set equality, pytest can diff both operands; whereas equality
comparisons failures are harder to read.
2022-10-13 17:41:11 +02:00
Antoine Lambert
d5c30a3ce3 Update value of User-Agent HTTP request header used by listers
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.

The following template is used to generate it:

"Software Heritage <lister_name> lister v<swh-lister version>
 (+https://www.softwareheritage.org/contact)"
2022-09-26 10:48:40 +02:00
Antoine Lambert
db6ce12e9e Refactor and deduplicate HTTP requests code in listers
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.

That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.

All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
2022-09-26 10:48:40 +02:00
Antoine Lambert
9c55acd286 Use generic HTTP retry policy by default and rename dedicated decorator
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.

Rename throttling_retry decorator to http_retry to reflect this change.
2022-09-26 10:48:40 +02:00
Antoine Lambert
cee6bcb514 maven: Use BeautifulSoup instead of xmltodict for parsing pom files
xmltodict cannot parse POM files with multi-byte encoding so prefer to
use the XML parser of BeautifulSoup based on lxml instead.

Also drop xmltodict requirement as it is no longer used in swh-lister
codebase.
2022-08-09 11:11:45 +02:00
Antoine R. Dumont (@ardumont)
263db667d0
Adapt maven lister to list canonical gh urls if any
That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are
canonicalized to https://github.com/${user_repo} format.

This avoids duplication of origins.

Related to T4232
2022-05-23 14:47:11 +02:00
Antoine Lambert
3f6c7edc24 maven: Prevent UnicodeDecodeError when processing pom file
Pass the raw bytes of pom file content in xmltodict.parse and let
it do the string decoding based on the encoding declared in pom file.

If the string decoding failed due to an invalid declared encoding,
xml.parsers.expat.ExpatError will be raised and will be caught by
the lister, ignoring the pom file and continuing listing.

Related to T3874
2022-05-02 14:01:58 +02:00
Antoine Lambert
0222a8f5c4 maven: Handle null mtime value in index for jar archive
It exists cases where the modification time for a jar archive in
a maven index is null which was leading to a processing error
by the lister.

So handle that case to avoid premature exit of the listing process.

Related to T3874
2022-04-29 13:59:17 +02:00
Antoine Lambert
378613ad82 maven: Remove extraction of groupId and artifactId from pom files
When parsing pom files, we are only interested to extract a VCS URL
(git, hg, svn) in order to create associated loading tasks.

In that case, the groupId and artifactId are not used by the lister
so better removing their extraction, plus it will prevent errors when
those info are missing in pom files.
2022-04-29 11:15:03 +02:00
Antoine Lambert
22bcd9deb2 maven: Create one origin per package instead of one per package version
Previously the maven lister was creating an origin for each source
archive (jar, zip) it discovered during the listing process.

This is not the way Software Heritage decided to archive sources
coming from package managers. Instead one origin should be created
per package and all its versions should be found as releases in the
snapshot produced by the package loader.

So modify the maven lister in order to create one origin per package
grouping all its versions.

This change also modifies the way incremental listing is handled,
ListedOrigin instances will be yielded only if we discovered new
versions of a package since the last listing.

Tests have been updated to reflect these changes.

Related to T3874
2022-04-29 10:57:04 +02:00
Antoine Lambert
334c54091e maven: Remove duplicated code related to setting instance from netloc
That processing is already handled in the base Lister class constructor.
2022-04-25 17:31:02 +02:00
Antoine R. Dumont (@ardumont)
10bb8db345
maven: Fix argument of type 'NoneType' is not iterable
Related to T3874
2022-04-14 15:33:24 +02:00
Antoine R. Dumont (@ardumont)
7c8428d01c
maven: Continue listing if unable to retrieve pom information
This aligns the behavior with other listers (e.g. sourceforge, ...) to continue listing
if some information is not retrievable at all.

Related to T3874
2022-04-13 17:59:20 +02:00
Antoine R. Dumont (@ardumont)
e4b27a1e98
maven: log error message when not able to retrieve the index to read
Without this, the lister legitimately cannot list anything.
2022-04-13 17:41:44 +02:00
Antoine Lambert
d38e05cff7 python: Reformat code with black 22.3.0
Related to T3922
2022-04-08 15:15:09 +02:00
Antoine R. Dumont (@ardumont)
7ff1390378
maven: Fix last update datetime
We need to avoid using naive datetime as this fails during conversion.

Related to T3746
Related to P1280
2022-02-09 16:59:40 +01:00
Boris Baldassari
d4e1e8212a maven: Fix undef last_update in ListedOrigins. 2022-02-08 07:51:01 +01:00
Boris Baldassari
24eeabfade maven: dismiss origins if they are malformed - e.g. wrong pom scm format, add test. 2022-02-08 07:51:01 +01:00
Antoine R. Dumont (@ardumont)
a599493b48
maven: Let logging instruction do the formatting 2022-01-25 11:32:23 +01:00
Antoine R. Dumont (@ardumont)
8667b04abc
maven: Add more debug logging instruction
And log the metadata dictionary.
2022-01-25 11:32:23 +01:00
Valentin Lorentz
fa7ecc8fbd maven: Pass the base URL of the Maven instance to the loader
I would like to use it as the metadata authority URI in the loader,
instead of '{p_url.scheme}://{p_url.netloc}/', which I do not think
is accurate, as it is possible to have multiple Maven instances at
the same netloc.
2021-12-07 13:51:00 +01:00
Boris Baldassari
8991c625ea lister: Add new maven lister
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724
2021-11-29 17:33:13 +01:00