swh-lister

Author	SHA1	Message	Date
Antoine Lambert	4b3a12fe76	maven, sourceforge: Fix mypy errors	2025-02-17 13:44:30 +01:00
Antoine Lambert	a3d66736a4	maven: Update test that is now failing since beautifulsoup4 4.13 Latest beautifulsoup4 release (4.13) seems to have fixed issues related to unexpected encodings in XML files so a test that was passing previously is now failing. Update that test to check origin URL and visit type can be successfully extracted from a POM file with unexpected encoding.	2025-02-10 14:28:33 +01:00
Antoine Lambert	a7607abcf9	tests: Fix mocking of sleep calls with tenacity 8.4.2 Latest tenacity release adds some internal changes that broke the mocking of sleep calls in tests. Fix it by directly mocking time.sleep (was not working previously).	2024-06-28 18:15:36 +02:00
Antoine Lambert	41407e0eff	Use beautifulsoup4 CSS selectors to simplify code and type checking As the types-beautifulsoup4 package gets installed in the swh virtualenv as it is a swh-scanner test dependency, some mypy errors were reported related to beautifulsoup4 typing. As the returned type for the find method of bs4 is the following union: Tag \| NavigableString \| None, isinstance calls must be used to ensure proper typing which is not great. So prefer to use the select_one method instead where a simple None check must be done to ensure typing is correct as it is returning Optional[Tag]. In a similar manner, replace use of find_all method by select method. It also has the advantage to simplify the code.	2024-04-16 11:22:51 +02:00
David Douard	714fccc3c7	python: Fix black formatting after bump to 23.1.0 in pre-commit	2023-12-05 10:33:07 +01:00
Valentin Lorentz	1c964cccd3	maven/README: Fix links	2023-09-14 12:03:12 +00:00
Antoine Lambert	fc2bd1e937	mypy: Bump to 1.0.1 and fix new typing errors Related to swh/meta#4960	2023-02-17 17:56:07 +01:00
Valentin Lorentz	62b0835193	maven: Discard templated origin URLs	2023-02-10 10:43:36 +00:00
Nicolas Dandrimont	e785e67315	Hook up recently introduced options to all listers Hopefully one day we'll be able to replace all of this mess with PEP692 TypedDict kwargs, but that's only on track for Python 3.12.	2022-12-05 16:33:45 +01:00
Valentin Lorentz	8ea4200909	Validate origin URLs before sending to the scheduler	2022-11-04 15:58:45 +01:00
Antoine R. Dumont (@ardumont)	92d494261f	lister: Make sure lister that requires github tokens can use it Deploying the nixguix lister, I realized that even though the credentials configuration is properly set for all listers, the listers actually requiring github origin canonicalization do not have access to the github credentials. It's lost during the constructor to only focus on the lister's credentials. Which currently translates to listers being rate-limited. This commit fixes it by pushing the self.github_session instantiation in the constructor when the lister explicitely requires the github session. Hence lifting the rate limit for maven, packagist, nixguix, and github listers. Related to infra/sysadm-environment#4655	2022-10-26 17:23:40 +02:00
Valentin Lorentz	db2f2f8265	maven: Use real data from github API + rely on requests_mock_datadir	2022-10-13 18:28:17 +02:00
Valentin Lorentz	f7ac524a55	maven: Use requests_mock_datadir to simplify mocking.	2022-10-13 17:57:55 +02:00
Valentin Lorentz	3dbe77156c	maven: Make assertions more useful By using set equality, pytest can diff both operands; whereas equality comparisons failures are harder to read.	2022-10-13 17:41:11 +02:00
Antoine Lambert	d5c30a3ce3	Update value of User-Agent HTTP request header used by listers That HTTP header value will now contain the lister name but also a link to our contact form in order for sysadmins to easily reach us if needed. The following template is used to generate it: "Software Heritage <lister_name> lister v<swh-lister version> (+https://www.softwareheritage.org/contact)"	2022-09-26 10:48:40 +02:00
Antoine Lambert	db6ce12e9e	Refactor and deduplicate HTTP requests code in listers Numerous listers were using the same page_request method or equivalent in their implementation so prefer to deduplicate that code by adding an http_request method in base lister class: swh.lister.pattern.Lister. That method simply wraps a call to requests.Session.request and logs some useful info for debugging and error reporting, also an HTTPError will be raised if a request ends up with an error. All listers using that new method now benefit of requests retry when an HTTP error occurs thanks to the use of the http_retry decorator.	2022-09-26 10:48:40 +02:00
Antoine Lambert	9c55acd286	Use generic HTTP retry policy by default and rename dedicated decorator Instead of retrying HTTP requests only for 429 status code by default, prefer to use the generic retry policy enabling to also retry for status codes >= 500 but also on ConnectionError exceptions. Rename throttling_retry decorator to http_retry to reflect this change.	2022-09-26 10:48:40 +02:00
Antoine Lambert	cee6bcb514	maven: Use BeautifulSoup instead of xmltodict for parsing pom files xmltodict cannot parse POM files with multi-byte encoding so prefer to use the XML parser of BeautifulSoup based on lxml instead. Also drop xmltodict requirement as it is no longer used in swh-lister codebase.	2022-08-09 11:11:45 +02:00
Antoine R. Dumont (@ardumont)	263db667d0	Adapt maven lister to list canonical gh urls if any That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are canonicalized to https://github.com/${user_repo} format. This avoids duplication of origins. Related to T4232	2022-05-23 14:47:11 +02:00
Antoine Lambert	3f6c7edc24	maven: Prevent UnicodeDecodeError when processing pom file Pass the raw bytes of pom file content in xmltodict.parse and let it do the string decoding based on the encoding declared in pom file. If the string decoding failed due to an invalid declared encoding, xml.parsers.expat.ExpatError will be raised and will be caught by the lister, ignoring the pom file and continuing listing. Related to T3874	2022-05-02 14:01:58 +02:00
Antoine Lambert	0222a8f5c4	maven: Handle null mtime value in index for jar archive It exists cases where the modification time for a jar archive in a maven index is null which was leading to a processing error by the lister. So handle that case to avoid premature exit of the listing process. Related to T3874	2022-04-29 13:59:17 +02:00
Antoine Lambert	378613ad82	maven: Remove extraction of groupId and artifactId from pom files When parsing pom files, we are only interested to extract a VCS URL (git, hg, svn) in order to create associated loading tasks. In that case, the groupId and artifactId are not used by the lister so better removing their extraction, plus it will prevent errors when those info are missing in pom files.	2022-04-29 11:15:03 +02:00
Antoine Lambert	22bcd9deb2	maven: Create one origin per package instead of one per package version Previously the maven lister was creating an origin for each source archive (jar, zip) it discovered during the listing process. This is not the way Software Heritage decided to archive sources coming from package managers. Instead one origin should be created per package and all its versions should be found as releases in the snapshot produced by the package loader. So modify the maven lister in order to create one origin per package grouping all its versions. This change also modifies the way incremental listing is handled, ListedOrigin instances will be yielded only if we discovered new versions of a package since the last listing. Tests have been updated to reflect these changes. Related to T3874	2022-04-29 10:57:04 +02:00
Antoine Lambert	334c54091e	maven: Remove duplicated code related to setting instance from netloc That processing is already handled in the base Lister class constructor.	2022-04-25 17:31:02 +02:00
Antoine R. Dumont (@ardumont)	10bb8db345	maven: Fix argument of type 'NoneType' is not iterable Related to T3874	2022-04-14 15:33:24 +02:00
Antoine R. Dumont (@ardumont)	7c8428d01c	maven: Continue listing if unable to retrieve pom information This aligns the behavior with other listers (e.g. sourceforge, ...) to continue listing if some information is not retrievable at all. Related to T3874	2022-04-13 17:59:20 +02:00
Antoine R. Dumont (@ardumont)	e4b27a1e98	maven: log error message when not able to retrieve the index to read Without this, the lister legitimately cannot list anything.	2022-04-13 17:41:44 +02:00
Antoine Lambert	d38e05cff7	python: Reformat code with black 22.3.0 Related to T3922	2022-04-08 15:15:09 +02:00
Antoine R. Dumont (@ardumont)	7ff1390378	maven: Fix last update datetime We need to avoid using naive datetime as this fails during conversion. Related to T3746 Related to P1280	2022-02-09 16:59:40 +01:00
Boris Baldassari	d4e1e8212a	maven: Fix undef last_update in ListedOrigins.	2022-02-08 07:51:01 +01:00
Boris Baldassari	24eeabfade	maven: dismiss origins if they are malformed - e.g. wrong pom scm format, add test.	2022-02-08 07:51:01 +01:00
Antoine R. Dumont (@ardumont)	a599493b48	maven: Let logging instruction do the formatting	2022-01-25 11:32:23 +01:00
Antoine R. Dumont (@ardumont)	8667b04abc	maven: Add more debug logging instruction And log the metadata dictionary.	2022-01-25 11:32:23 +01:00
Valentin Lorentz	fa7ecc8fbd	maven: Pass the base URL of the Maven instance to the loader I would like to use it as the metadata authority URI in the loader, instead of '{p_url.scheme}://{p_url.netloc}/', which I do not think is accurate, as it is possible to have multiple Maven instances at the same netloc.	2021-12-07 13:51:00 +01:00
Boris Baldassari	8991c625ea	lister: Add new maven lister The Maven lister retrieves the maven central indexes, exports them in a convenient text format, and parse them to identify all src archives and pom files in the maven repository. Then the pom files are downloaded and analysed to find and yield any scm reference. Note: This is a new version of the maven lister diff D6133 which takes into account the initial round of reviews. Related to T1724	2021-11-29 17:33:13 +01:00

35 commits