swh-lister

Author	SHA1	Message	Date
Antoine Lambert	db6ce12e9e	Refactor and deduplicate HTTP requests code in listers Numerous listers were using the same page_request method or equivalent in their implementation so prefer to deduplicate that code by adding an http_request method in base lister class: swh.lister.pattern.Lister. That method simply wraps a call to requests.Session.request and logs some useful info for debugging and error reporting, also an HTTPError will be raised if a request ends up with an error. All listers using that new method now benefit of requests retry when an HTTP error occurs thanks to the use of the http_retry decorator.	2022-09-26 10:48:40 +02:00
Antoine Lambert	9c55acd286	Use generic HTTP retry policy by default and rename dedicated decorator Instead of retrying HTTP requests only for 429 status code by default, prefer to use the generic retry policy enabling to also retry for status codes >= 500 but also on ConnectionError exceptions. Rename throttling_retry decorator to http_retry to reflect this change.	2022-09-26 10:48:40 +02:00
Vincent SELLIER	9b3e565cf7	cgit: Ensure the clone url is searched on the right tab For some forges, the default tab for a repository detail is not the summary tab so the clone urls are not detected and the repository is ignored Related to T4544	2022-09-20 17:01:49 +02:00
KShivendu	bd35d54398	gogs: Skip pages with error 500 This also affects the gitea lister	2022-09-20 19:05:20 +05:30
Antoine Lambert	fa65f270ed	golang: Update lister name Align with other lister names by turning it to lowercase.	2022-09-19 13:17:40 +02:00
Antoine Lambert	f1a1b30fd1	arch: Set log level to debug for URL requests	2022-09-13 12:09:13 +02:00
Antoine Lambert	a55f171ed5	arch: Use tempfile module to create temporary directory It ensures created temporary directories will be removed once they are no longer needed.	2022-09-13 12:08:02 +02:00
Antoine R. Dumont (@ardumont)	67211adb60	pubdev.lister: Decrease verbosity This matches other lister verbosity. Related to T4517	2022-09-09 12:31:43 +02:00
Antoine Lambert	c819cc237d	pubdev: Update User-Agent request header value Use a value that matches good practice recommended by pub.dev REST API doc. https://github.com/dart-lang/pub/blob/master/doc/repository-spec-v2.md	2022-09-07 12:15:34 +02:00
Antoine Lambert	44560c2383	pubdev: Retrieve last publication date for each listed package In order to get a last_update for each ListedOrigin sent to scheduler database, send an extra HTTP request for each listed package to the /api/packages/<package_name> endpoint of pub.dev API. A pub.dev developer inform us that endpoint is heavily used and cached so there is no particular issues to query that endpoint for each package in a row periodically.	2022-09-02 16:50:12 +02:00
Antoine Lambert	49b79b0759	pubdev: Modify origin URL for listed packages Use https://pub.dev/packages/<package_name> instead of https://pub.dev/api/packages/<package_name>	2022-09-02 16:48:29 +02:00
Antoine Lambert	b6c69e5075	aur: Create also a git origin for each listed package repository It will enable to archive the history of the PKGBUILD file associated to the AUR package.	2022-09-02 15:58:05 +02:00
Antoine Lambert	d76fbb3447	aur: Modify origin URL for listed packages Use https://aur.archlinux.org/packages/<package_name> instead of https://aur.archlinux.org/<package_name>.git	2022-09-02 15:57:57 +02:00
Antoine Lambert	92baa2b45c	aur: Store packages index in memory instead of disk Simplify code for downloading packages index as gzip and deflate transfer-encodings are automatically decoded by requests, also do not stream response for a couple of megabytes and store HTTP responses in memory. Also add more debug logs to track lister execution.	2022-09-02 15:48:20 +02:00
Antoine Lambert	7638f2028b	golang/tests: Fix black formatting	2022-09-01 11:47:35 +02:00
Raphaël Gomès	c6ce862d32	Add incremental function to Golang Lister	2022-08-30 14:32:18 +02:00
Raphaël Gomès	60405e78ae	Add non-incremental Golang modules lister This uses https://index.golang.org. An associated loader will be sent in the near future, as well as an incremental version of this lister. [1] https://go.dev/ref/mod#goproxy-protocol	2022-08-30 14:32:02 +02:00
Franck Bret	0acf5b0f4f	Arch: Add throttling retry for scrapping and resources download	2022-08-30 09:50:29 +02:00
Franck Bret	b7b11887a0	Bower: Set VISIT_TYPE as 'git' Origins url for Bower are git repositories. Set the VISIT_type as 'git'. No need for a specific 'Bower' package loader.	2022-08-29 17:15:09 +02:00
Franck Bret	ceae8c42b5	Bower: List origins from registry.bower.io	2022-08-29 15:55:00 +02:00
Franck Bret	5410b6e3f3	Pub.dev lister for Dart and Flutter packages Stateless lister for https://pub.dev based on http api to list package names	2022-08-26 10:24:08 +02:00
Valentin Lorentz	ce72969de5	aur: Simplify pathlib logic	2022-08-25 09:41:50 +02:00
Valentin Lorentz	766fbbcc91	arch: Un-nest long method	2022-08-25 09:41:44 +02:00
Valentin Lorentz	b7ec6cb120	tests: Simplify origin comparison and improve pytest diff on failure By using a single equality instead of checking len() then zip() to check one by one, pytest can find the common/missing elements and print them nicely when the two lists are unequal.	2022-08-24 17:21:24 +02:00
Valentin Lorentz	4b511b4181	arch: Use lazy interpolation in logging statements	2022-08-23 13:43:07 +02:00
Valentin Lorentz	31c44330e8	gogs: Lower unnecessarily verbose logging statement	2022-08-23 13:40:19 +02:00
Valentin Lorentz	17a219ece0	gitea: Inherit from Gogs lister This removes code and adds support for incremental pagination. While both are essentially the same lister now, it still makes sense to keep the Gitea lister separate, in order to: 1. display them in different categories on https://archive.softwareheritage.org/ 2. support possible divergence of APIs in the future	2022-08-23 13:38:32 +02:00
Valentin Lorentz	dde7865ac4	arch: Fix broken ref	2022-08-19 19:07:55 +02:00
Franck Bret	7dd412e553	arch: Extra_loader_arguments consistency + documentation Split extraloader_arguments artifacts to artifacts and arch_metadata Add lister documentation at module level Related T4233	2022-08-19 15:43:58 +02:00
Valentin Lorentz	3ab90cc0cd	aur: Fix broken ref	2022-08-19 14:14:37 +02:00
Franck Bret	97b353bf0b	Arch User Repository (AUR) lister Add 'aur' module to swh-lister with data fixtures and tests. For now, origin url are package vcs (Git) url.	2022-08-19 12:43:15 +02:00
KShivendu	6a53a6ad06	feat: Make the Gogs lister incremental	2022-08-17 15:01:32 +05:30
Antoine Lambert	cee6bcb514	maven: Use BeautifulSoup instead of xmltodict for parsing pom files xmltodict cannot parse POM files with multi-byte encoding so prefer to use the XML parser of BeautifulSoup based on lxml instead. Also drop xmltodict requirement as it is no longer used in swh-lister codebase.	2022-08-09 11:11:45 +02:00
Valentin Lorentz	d51bce0a1c	crates: Fix broken ref	2022-08-08 20:43:54 +02:00
Franck Bret	751c3df1b7	crates: Add a developer documentation at module level Mainly move documentation content from docs/user to crates module (See D8199 for details) Related T4104	2022-08-08 14:48:45 +02:00
Franck Bret	a6f796b268	crates.lister: Implement incremental mode: Add incremental mode support based on a 'last_commit' state, used to get new package versions from git diff range of commits.	2022-08-05 13:41:57 +02:00
KShivendu	d34a6232a6	gogs: Introduce Gogs lister	2022-08-03 16:22:06 +05:30
Franck Bret	1bf11aa26d	Add arch lister module (origins from archives). After a first attempt with D7812 this one use a different strategy to retrieve origins. Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages. Parse metadata from 'desc' file to build origins url. Scrap the origin url to get artifacts metadata that list all versions of a package. It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package. Related T4233	2022-06-15 09:11:57 +02:00
Antoine R. Dumont (@ardumont)	263db667d0	Adapt maven lister to list canonical gh urls if any That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are canonicalized to https://github.com/${user_repo} format. This avoids duplication of origins. Related to T4232	2022-05-23 14:47:11 +02:00
Antoine R. Dumont (@ardumont)	2ffe9c2aea	Use swh.core.github.pytest_plugin in github tests Related to T4232	2022-05-20 16:06:11 +02:00
Pratyush Desai	aa8c8cb3bc	add strict asyncio_mode in pytest.ini	2022-05-09 12:13:28 +02:00
Antoine Lambert	3f6c7edc24	maven: Prevent UnicodeDecodeError when processing pom file Pass the raw bytes of pom file content in xmltodict.parse and let it do the string decoding based on the encoding declared in pom file. If the string decoding failed due to an invalid declared encoding, xml.parsers.expat.ExpatError will be raised and will be caught by the lister, ignoring the pom file and continuing listing. Related to T3874	2022-05-02 14:01:58 +02:00
Antoine Lambert	0222a8f5c4	maven: Handle null mtime value in index for jar archive It exists cases where the modification time for a jar archive in a maven index is null which was leading to a processing error by the lister. So handle that case to avoid premature exit of the listing process. Related to T3874	2022-04-29 13:59:17 +02:00
Antoine Lambert	378613ad82	maven: Remove extraction of groupId and artifactId from pom files When parsing pom files, we are only interested to extract a VCS URL (git, hg, svn) in order to create associated loading tasks. In that case, the groupId and artifactId are not used by the lister so better removing their extraction, plus it will prevent errors when those info are missing in pom files.	2022-04-29 11:15:03 +02:00
Antoine Lambert	22bcd9deb2	maven: Create one origin per package instead of one per package version Previously the maven lister was creating an origin for each source archive (jar, zip) it discovered during the listing process. This is not the way Software Heritage decided to archive sources coming from package managers. Instead one origin should be created per package and all its versions should be found as releases in the snapshot produced by the package loader. So modify the maven lister in order to create one origin per package grouping all its versions. This change also modifies the way incremental listing is handled, ListedOrigin instances will be yielded only if we discovered new versions of a package since the last listing. Tests have been updated to reflect these changes. Related to T3874	2022-04-29 10:57:04 +02:00
Franck Bret	985b71e80c	crates: Create one origin per package instead of per version Previously we had as many origins as version for a crate package, url was a link to a specific crate version package. Refactor to have one origin per package name and add an 'artifacts' entry to extra_loader_arguments that list all versions, package url and checksum. Origin url is now a link to the related http api endpoint for a package name. Related to T4104	2022-04-28 16:10:33 +02:00
Valentin Lorentz	c251594a1f	Bump mypy to v0.942	2022-04-26 13:05:44 +02:00
Valentin Lorentz	d715aaf903	Make user_agent a parameter of GitHubSession So it can be set when used by other packages	2022-04-26 11:08:53 +02:00
Valentin Lorentz	2d04244cc9	Move GitHubSession from github/lister.py to github/utils.py So it can be reused by other packages without importing lister.py itself	2022-04-26 11:08:49 +02:00
Valentin Lorentz	9ee4a99f15	github: Refactor rate-limiting out of the GitHubLister class This will allow the GitHub Metadata Fetcher to reuse the logic by importing the GitHubSession class.	2022-04-26 11:08:45 +02:00

1 2 3 4 5 ...

744 commits