swh-lister

Author	SHA1	Message	Date
Antoine Lambert	108816f232	rubygems: Use gems database dump to improve listing output Instead of using an undocumented rubygems HTTP endpoint that only gives us the names of the gems, prefer to exploit the daily PostgreSQL dump of the rubygems.org database. It enables to list all gems but also all versions of a gem and its release artifacts. For each relase artifact, the following info are extracted: version, download URL, sha256 checksum, release date plus a couple of extra metadata. The lister will now set list of artifacts and list of metadata as extra loader arguments when sending a listed origin to the scheduler database. A last_update date is also computed which should ensure loading tasks for rubygems will be scheduled only when new releases are available since last loadings. To be noted, the lister will spawn a temporary postgres instance so this require the initdb executable from postgres server installation to be available in the execution environment. Related to T1777	2022-10-07 16:54:48 +02:00
Antoine R. Dumont (@ardumont)	c22f41a6d7	nixguix: Exclude faulty "recursive" file origins from listing For now, those can be faulty as the manifest is missing 'critical' information about how to recompute the hash (e.g. fs layout, executable bit, ...). Related to T4608 Related to T3781	2022-10-07 14:33:38 +02:00
Antoine R. Dumont (@ardumont)	5a53243bd3	nixguix: Refactor by renaming success or failure the different datasets It's more explicit that way. Related to T3781	2022-10-05 22:55:54 +02:00
Franck Bret	4a09f660b3	Crates.io: Add last_update for each version of a crate In order to reduce http api call amount made by the loader, download a crates.io database dump, and parse its csv files to get a last_update value for each versions of a Crate. Those values are sent to the loader through extra_loader_arguments 'crates_metadata'. 'artifacts' and 'crates_metadata' now uses "version" as key. Related T4104, D8171	2022-10-05 17:10:28 +02:00
Antoine R. Dumont (@ardumont)	2e6e282d44	nixguix: Deal with manifest entries without an integrity field In that case, this fallbacks to use the "outputHash" which is an equivalent field of the integrity one except it's for "recursive" outputHashMode. This adds the necessary assertions around this case so correct data is sent to loaders as well. Related to T3781	2022-10-05 16:11:38 +02:00
Antoine R. Dumont (@ardumont)	f2377c283a	nixguix: Improve is_tarball detection pattern This actually includes all query param values as paths to check. When paths have extensions, it then pattern matches against tarballs if any. When no extension is detected, it's doing as before, fallbacks to head query the url to have more information on the file. Prior to this commit, this only looked over a hard-coded list of values (for hard-coded keys: file, f, name, url) detected through docker runs. This way of doing it should decrease future misdetections (when new unknown "keys" show up in the wild). Related to T3781	2022-10-05 12:00:43 +02:00
Antoine R. Dumont (@ardumont)	2ee103e2bc	nixguix: Improve further tarball detection The current content type detection was a bit off mostly for content which includes charset. This commit fixes it. Related to T3781	2022-10-05 11:11:08 +02:00
Antoine R. Dumont (@ardumont)	ff80a91f0a	nixguix: Improve git origins detection Without this, some git repositories are detected as file (due to upstream misqualification too). This does some extra effort to detect those to avoid sending noise to loaders. This also refactors some common code to build vcs artifacts to avoid duplication. Related to T3781	2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)	2fbd66778f	nixguix: Improve tarball detection Without this, some tarballs hidden within query parameters are not detected. This does some extra effort to detect those to avoid sending noise to loaders. Related to T3781	2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)	944d4b5b60	nixguix: Add support for listing origins with "recursive" integrity Without this distinction the current directory or content loader will fail the download as they currently expect the checksums to be about the tarball. When a recursive "integrity" is provided, it's actually about the uncompressed tarball as per the nix-store computation. It's detailed within the code. Related to T3294 Related to T3781	2022-10-04 17:58:50 +02:00
Antoine R. Dumont (@ardumont)	5daead68ad	nixguix: Add support for pseudo url with missing schema Related to T3294 Related to T3781	2022-10-04 16:21:38 +02:00
Antoine R. Dumont (@ardumont)	0f8f293f96	nixguix: Deal with connection error with server When that arises, we skip the origins. Related to T3781	2022-10-04 14:57:01 +02:00
Antoine R. Dumont (@ardumont)	d92474bbda	nixguix: Refactor by cleaning up unneeded code Related to T3781	2022-10-04 14:45:57 +02:00
Antoine R. Dumont (@ardumont)	06b11dd5f6	nixguix: Deal with impossible communication with server When that arises, we skip the origins. Related to T3781	2022-10-04 14:07:42 +02:00
Antoine R. Dumont (@ardumont)	a94b75f366	nixguix: Deal with mistyped origins Some origins are listed as urls while they are not. They are possibly vcs. So this commit tries to detect and and deal with those if possible. If not possible, they are skipped. Related to T3781 Related to P1470	2022-10-04 13:58:39 +02:00
Antoine R. Dumont (@ardumont)	1b4fe51f62	nixguix: Randomize order of listed origins The end goal is to ingest sparsely the origins, that would avoid hitting the various servers around the same time for colocated origins in the upstream manifest (especially file or tarball). Related to T3781	2022-10-04 11:54:12 +02:00
Antoine R. Dumont (@ardumont)	94b6dbea0a	nixguix: Document lister Related to T3781	2022-10-03 18:26:36 +02:00
Antoine R. Dumont (@ardumont)	6d2e7aa178	nixguix: Register task Related to T3781	2022-10-03 18:26:36 +02:00
Antoine R. Dumont (@ardumont)	fbfdf88ea4	nixguix: Add lister Related to T3781	2022-10-03 18:26:36 +02:00
Antoine Lambert	fa1205c4df	Send package artifact checksums to loaders when info is available In listers collecting artifacts for each package to load, add artifacts checksums, when that info is available, in parameters sent to loaders in order to check downloaded artifact integrity.	2022-09-30 18:44:11 +02:00
Franck Bret	6f40d2c1a5	Conda: switch artifacts from dict to list 'artifacts' extra_loader_arguments should be a list	2022-09-30 15:55:53 +02:00
Franck Bret	52ccf49e11	RubyGems: List origins from https://rubygems.org Related T1777	2022-09-29 14:19:06 +02:00
Antoine Lambert	dabb1a2ae5	Update instructions for running a lister in docker Prefer to execute lister through a celery task as it also enables to catch possible issues with task implementation. Also use docker compose v2 commands.	2022-09-29 11:26:40 +02:00
Antoine Lambert	5426883c49	debian: Remove no longer needed code to get accurate origins count The base lister class now ensures the count of listed origins will be accurate.	2022-09-29 11:14:42 +02:00
Antoine Lambert	8d85b2e4e8	pattern: Ensure accurate origin counts returned by run method Previously, the run method was returning the total count of ListedOrigin objects sent to scheduler database. However, some listers can send multiple ListedOrigin objects for a given origin URL during the listing process, for instance when an origin is contained in multiple pages (e.g. gogs listing) or when the listing is gathering multiple versions of an origin spread across multiple pages (e.g. maven listing). This changes ensures an accurate count of listed origins by maintaining a set of origin URLs associated to the sent ListedOrigin objects.	2022-09-29 11:14:08 +02:00
Franck Bret	3928fc9ee9	Nuget: Lister for NuGet the package manager for .NET Related T1718	2022-09-27 14:56:36 +02:00
Franck Bret	cd596eb2b4	Puppet: Lister for Puppet modules The puppet lister retrieves origins from from https://forge.puppet.com/modules Related T4519	2022-09-27 14:44:13 +02:00
Franck Bret	a4aec3894e	Cpan: List Perl module origins from cpan.org Related T2833	2022-09-27 14:29:33 +02:00
Franck Bret	6696a8424a	Hackage: List origins from hackage.haskell.org, The Haskell Package Repository Use http api point to get package names and build origin urls.	2022-09-27 14:22:03 +02:00
Franck Bret	8ff418fbc2	Conda: List origins for Anaconda, the package manager that provides tooling for datascience Related T4547	2022-09-27 14:17:26 +02:00
Antoine R. Dumont (@ardumont)	fd1a4244a0	cgit/tests: Rename readme.md to readme With the extension, the readme is included in the swh-docs build and fails. It's not intended for the documentation build so renaming it keep it out of the doc build loop. This fixes build [1]. [1] https://jenkins.softwareheritage.org/view/all/job/DDOC/job/dev/2395/	2022-09-26 13:22:10 +02:00
Antoine Lambert	d5c30a3ce3	Update value of User-Agent HTTP request header used by listers That HTTP header value will now contain the lister name but also a link to our contact form in order for sysadmins to easily reach us if needed. The following template is used to generate it: "Software Heritage <lister_name> lister v<swh-lister version> (+https://www.softwareheritage.org/contact)"	2022-09-26 10:48:40 +02:00
Antoine Lambert	db6ce12e9e	Refactor and deduplicate HTTP requests code in listers Numerous listers were using the same page_request method or equivalent in their implementation so prefer to deduplicate that code by adding an http_request method in base lister class: swh.lister.pattern.Lister. That method simply wraps a call to requests.Session.request and logs some useful info for debugging and error reporting, also an HTTPError will be raised if a request ends up with an error. All listers using that new method now benefit of requests retry when an HTTP error occurs thanks to the use of the http_retry decorator.	2022-09-26 10:48:40 +02:00
Antoine Lambert	9c55acd286	Use generic HTTP retry policy by default and rename dedicated decorator Instead of retrying HTTP requests only for 429 status code by default, prefer to use the generic retry policy enabling to also retry for status codes >= 500 but also on ConnectionError exceptions. Rename throttling_retry decorator to http_retry to reflect this change.	2022-09-26 10:48:40 +02:00
Vincent SELLIER	9b3e565cf7	cgit: Ensure the clone url is searched on the right tab For some forges, the default tab for a repository detail is not the summary tab so the clone urls are not detected and the repository is ignored Related to T4544	2022-09-20 17:01:49 +02:00
KShivendu	bd35d54398	gogs: Skip pages with error 500 This also affects the gitea lister	2022-09-20 19:05:20 +05:30
Antoine Lambert	fa65f270ed	golang: Update lister name Align with other lister names by turning it to lowercase.	2022-09-19 13:17:40 +02:00
Antoine Lambert	f1a1b30fd1	arch: Set log level to debug for URL requests	2022-09-13 12:09:13 +02:00
Antoine Lambert	a55f171ed5	arch: Use tempfile module to create temporary directory It ensures created temporary directories will be removed once they are no longer needed.	2022-09-13 12:08:02 +02:00
Antoine R. Dumont (@ardumont)	67211adb60	pubdev.lister: Decrease verbosity This matches other lister verbosity. Related to T4517	2022-09-09 12:31:43 +02:00
Antoine Lambert	c819cc237d	pubdev: Update User-Agent request header value Use a value that matches good practice recommended by pub.dev REST API doc. https://github.com/dart-lang/pub/blob/master/doc/repository-spec-v2.md	2022-09-07 12:15:34 +02:00
Antoine Lambert	44560c2383	pubdev: Retrieve last publication date for each listed package In order to get a last_update for each ListedOrigin sent to scheduler database, send an extra HTTP request for each listed package to the /api/packages/<package_name> endpoint of pub.dev API. A pub.dev developer inform us that endpoint is heavily used and cached so there is no particular issues to query that endpoint for each package in a row periodically.	2022-09-02 16:50:12 +02:00
Antoine Lambert	49b79b0759	pubdev: Modify origin URL for listed packages Use https://pub.dev/packages/<package_name> instead of https://pub.dev/api/packages/<package_name>	2022-09-02 16:48:29 +02:00
Antoine Lambert	b6c69e5075	aur: Create also a git origin for each listed package repository It will enable to archive the history of the PKGBUILD file associated to the AUR package.	2022-09-02 15:58:05 +02:00
Antoine Lambert	d76fbb3447	aur: Modify origin URL for listed packages Use https://aur.archlinux.org/packages/<package_name> instead of https://aur.archlinux.org/<package_name>.git	2022-09-02 15:57:57 +02:00
Antoine Lambert	92baa2b45c	aur: Store packages index in memory instead of disk Simplify code for downloading packages index as gzip and deflate transfer-encodings are automatically decoded by requests, also do not stream response for a couple of megabytes and store HTTP responses in memory. Also add more debug logs to track lister execution.	2022-09-02 15:48:20 +02:00
Antoine Lambert	7638f2028b	golang/tests: Fix black formatting	2022-09-01 11:47:35 +02:00
Raphaël Gomès	c6ce862d32	Add incremental function to Golang Lister	2022-08-30 14:32:18 +02:00
Raphaël Gomès	60405e78ae	Add non-incremental Golang modules lister This uses https://index.golang.org. An associated loader will be sent in the near future, as well as an incremental version of this lister. [1] https://go.dev/ref/mod#goproxy-protocol	2022-08-30 14:32:02 +02:00
Franck Bret	0acf5b0f4f	Arch: Add throttling retry for scrapping and resources download	2022-08-30 09:50:29 +02:00

1 2 3 4 5 ...

876 commits