swh-lister

Author	SHA1	Message	Date
Nicolas Dandrimont	e785e67315	Hook up recently introduced options to all listers Hopefully one day we'll be able to replace all of this mess with PEP692 TypedDict kwargs, but that's only on track for Python 3.12.	2022-12-05 16:33:45 +01:00
Nicolas Dandrimont	5ea79ee3e0	gitlab: allow ignoring projects with certain path prefixes Some GitLab instances use specific namespaces for transient repositories that it doesn't make sense to archive (for example, gitlab.org has a set of QA namespaces used for integration testing of their production deployments; drupal has an `issues/` namespace with forks of repos that are only used for collaboration on merge requests, and aren't that useful to be archived).	2022-12-05 15:36:40 +01:00
Nicolas Dandrimont	64267f8f50	Add a flag to not enable origins listed by a lister This cuts down one more manual step in the add forge now validation process: we can add the relevant origins to the staging scheduler without enabling them at all.	2022-12-05 14:53:42 +01:00
Nicolas Dandrimont	b815737054	Add built-in page and origin count limit to listers This will allow more automation of the staging add forge now process: for known-good listers, we can limit the number of origins being processed and reduce the amount of manual steps taken for each instance.	2022-12-05 14:53:42 +01:00
Nicolas Dandrimont	a66e24bfa2	Ignore psqlrc when loading the rubygems database dump The SQL dump contains ownership instructions that can't be run if you don't have the right users in your database clusters. When someone has a psqlrc with ON_ERROR_STOP, this fails the load of the dump. Use the opportunity to trigger an exception when psql returns a non-zero exit code, rather than continue with an empty/inconsistent database.	2022-12-05 13:52:23 +01:00
Antoine Lambert	f4aafe026b	fedora: Update versions in packages dict provided as loader argument In a similar way to the debian lister, use the following versions in the packages dictionary provided to the generic rpm loader: - dict keys are package versions prefixed by the fedora release and edition they have been found (fedora{release}/{edition}/{version}), they will be used as branch names targeting releases in the snapshot created by the rpm loader - version fields in dict values are the package intrinsic versions parsed from package repository metadata excluding any ".fcXY" suffixes to avoid the loader to create multiple releases targeting the same directory, they will be used as release names in the snapshot created by the rpm loader Related to T4448	2022-11-21 14:14:17 +01:00
Franck Bret	065b3f81a1	Hackage: Implement incremental mode Use http api lastUpload argument in search query to retrieve new or updated origins since last run Related to T4597	2022-11-18 13:48:45 +01:00
KShivendu	6ad61aec23	feat(fedora): Introduce fedora lister Summary: Lister to ingest fedora mirrors (.rpm) Reviewers: #reviewers, vlorentz Subscribers: vlorentz, olasd Maniphest Tasks: T4448 Differential Revision: https://forge.softwareheritage.org/D8386	2022-11-15 15:53:52 +05:30
Franck Bret	ea146ce297	Nuget: Implement incremental listing The lister is incremental and based on the value of ``commitTimeStamp`` retrieved on index http api endpoint. Related T1718	2022-11-14 09:30:54 +01:00
Franck Bret	e1f3f87c73	Puppet: Lister implements incremental mode Use with_release_since api argument to retrieve modules that have been updated since the last date the lister has been executed. Related T4519	2022-11-08 14:29:07 +01:00
Valentin Lorentz	e8699422d7	nixguix: Reject Git SSH URLs and pseudo-URLs For consistency with Maven and Packagist listers	2022-11-04 15:58:50 +01:00
Valentin Lorentz	8ea4200909	Validate origin URLs before sending to the scheduler	2022-11-04 15:58:45 +01:00
Antoine Lambert	60707a45dd	pubdev: Update outdated lister documentation	2022-10-28 11:22:15 +02:00
Antoine R. Dumont (@ardumont)	92d494261f	lister: Make sure lister that requires github tokens can use it Deploying the nixguix lister, I realized that even though the credentials configuration is properly set for all listers, the listers actually requiring github origin canonicalization do not have access to the github credentials. It's lost during the constructor to only focus on the lister's credentials. Which currently translates to listers being rate-limited. This commit fixes it by pushing the self.github_session instantiation in the constructor when the lister explicitely requires the github session. Hence lifting the rate limit for maven, packagist, nixguix, and github listers. Related to infra/sysadm-environment#4655	2022-10-26 17:23:40 +02:00
Antoine R. Dumont (@ardumont)	81688ca17e	nixguix: Use content-disposition from http head request if provided As a last fallback after the content-type check, instead of raising immediately. Related to T3781	2022-10-26 11:58:54 +02:00
Antoine R. Dumont (@ardumont)	026fea21da	nixguix: Deal with edge case url with version instead of extension Prior to this, some urls were detected as file because their version name were wrongly detected as extension, hence not matching tarball extensions. Related to T3781	2022-10-26 10:06:16 +02:00
Franck Bret	8355fee25f	Puppet: Switch artifacts from dict to list	2022-10-25 14:49:09 +02:00
Antoine R. Dumont (@ardumont)	ca4ab7f277	nixguix: Allow lister to ignore specific extensions Those extensions can be extended through configuration. They default to some binary format already encountered during docker runs. Related to T3781	2022-10-25 12:09:01 +02:00
Antoine R. Dumont (@ardumont)	d96a39d5b0	nixguix/test: Add all supported tarball extensions to test manifest Next step is to add some extensions filtering so might as well harden the test dataset first. Related to T3781	2022-10-25 11:28:56 +02:00
Antoine Lambert	4f6b3f3f09	conda: Yield listed origins after all artifacts in a page are processed swh-scheduler will deduplicate listed origins according to their URL and visit type but not according to their extra loader arguments. Previously, listed origins were yielded after each processed artifact in a page so we could lose some package version info due to the deduplication process. So ensure to yield listed origins once all artifacts in a page have been processed.	2022-10-25 10:49:52 +02:00
Antoine R. Dumont (@ardumont)	31eb5f637f	Add support for more tarball recognition based on extensions This requires to open those extensions to be supported by loaders too (in swh.core.tarball). Related to T3781	2022-10-25 09:50:31 +02:00
Antoine R. Dumont (@ardumont)	8a82bbf95f	gogs/lister: Allow public gogs instance listing Prior to this commit, the lister assumed authentication was required. It exists public gogs instances which do not require it. This also updates documentation to mention the usual api location. This is useful when people wants to actually trigger a listing as a pre-check flight. This drops repetitive instruction in the gitea lister as well. Co-authored with Antoine Lambert (@anlambert) <anlambert@softwareheritage.org>. Related to infra/sysadm-environment#4644	2022-10-21 18:21:18 +02:00
Antoine Lambert	0baaf68cff	nixguix: Fix typo detected by codespell	2022-10-19 14:47:36 +02:00
David Douard	8778b9cdbf	pre-commit, tox: Bump pre-commit, codespell, black and flake8 - pre-commit from 4.1.0 to 4.3.0, - codespell from 2.2.1 to 2.2.2, - black from 22.3.0 to 22.10.0 and - flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies. Also change flake8's repo config to github (the gitlab mirror being outdated).	2022-10-18 18:53:29 +02:00
Valentin Lorentz	db2f2f8265	maven: Use real data from github API + rely on requests_mock_datadir	2022-10-13 18:28:17 +02:00
Valentin Lorentz	f7ac524a55	maven: Use requests_mock_datadir to simplify mocking.	2022-10-13 17:57:55 +02:00
Valentin Lorentz	3dbe77156c	maven: Make assertions more useful By using set equality, pytest can diff both operands; whereas equality comparisons failures are harder to read.	2022-10-13 17:41:11 +02:00
Valentin Lorentz	a681f2f405	packagist: Canonicalize github origins In particular, there seems to be a negligeable number of origins using SSH instead of HTTPS, which the git loader cannot deal with.	2022-10-13 17:14:58 +02:00
Antoine Lambert	82b936a277	rubygems: Fix debug log	2022-10-13 16:40:35 +02:00
Valentin Lorentz	f5c5599f2e	packagist: Actually test listed origins Tests implemented roughly the same algorithm as the lister, and compared both values...	2022-10-13 11:53:28 +02:00
Antoine Lambert	05cd1de1cd	cpan: Fix module version extraction for some edge cases CPAN API can return versions that are not of str type: either int or float. When version equals 0, it means that version failed to be parsed by CPAN so we try to extract it from release name in that case. Otherwise we ensure to convert the version to str type. Related to T2833	2022-10-11 15:24:34 +02:00
Antoine Lambert	f57b8f3a2c	cpan: Improve listing process by querying the metacpan release endpoint Instead of querying the metacpan distribution endpoint to list origins, prefer to use the release endpoint instead enabling to list all artifacts associated to CPAN packages by scrolling results. Compared to previous implementation, it enables to compute a last_update date for all CPAN packages but also to obtain artifact sha256 checksums that will be used by the CPAN loader to check downloads integrity. As the multiple versions of a module are spread across multiple pages from the CPAN API, origins are sent to the scheduler once all pages processed, it is also faster to proceed that way. Related to T2833	2022-10-11 15:24:34 +02:00
Antoine Lambert	108816f232	rubygems: Use gems database dump to improve listing output Instead of using an undocumented rubygems HTTP endpoint that only gives us the names of the gems, prefer to exploit the daily PostgreSQL dump of the rubygems.org database. It enables to list all gems but also all versions of a gem and its release artifacts. For each relase artifact, the following info are extracted: version, download URL, sha256 checksum, release date plus a couple of extra metadata. The lister will now set list of artifacts and list of metadata as extra loader arguments when sending a listed origin to the scheduler database. A last_update date is also computed which should ensure loading tasks for rubygems will be scheduled only when new releases are available since last loadings. To be noted, the lister will spawn a temporary postgres instance so this require the initdb executable from postgres server installation to be available in the execution environment. Related to T1777	2022-10-07 16:54:48 +02:00
Antoine R. Dumont (@ardumont)	c22f41a6d7	nixguix: Exclude faulty "recursive" file origins from listing For now, those can be faulty as the manifest is missing 'critical' information about how to recompute the hash (e.g. fs layout, executable bit, ...). Related to T4608 Related to T3781	2022-10-07 14:33:38 +02:00
Antoine R. Dumont (@ardumont)	5a53243bd3	nixguix: Refactor by renaming success or failure the different datasets It's more explicit that way. Related to T3781	2022-10-05 22:55:54 +02:00
Franck Bret	4a09f660b3	Crates.io: Add last_update for each version of a crate In order to reduce http api call amount made by the loader, download a crates.io database dump, and parse its csv files to get a last_update value for each versions of a Crate. Those values are sent to the loader through extra_loader_arguments 'crates_metadata'. 'artifacts' and 'crates_metadata' now uses "version" as key. Related T4104, D8171	2022-10-05 17:10:28 +02:00
Antoine R. Dumont (@ardumont)	2e6e282d44	nixguix: Deal with manifest entries without an integrity field In that case, this fallbacks to use the "outputHash" which is an equivalent field of the integrity one except it's for "recursive" outputHashMode. This adds the necessary assertions around this case so correct data is sent to loaders as well. Related to T3781	2022-10-05 16:11:38 +02:00
Antoine R. Dumont (@ardumont)	f2377c283a	nixguix: Improve is_tarball detection pattern This actually includes all query param values as paths to check. When paths have extensions, it then pattern matches against tarballs if any. When no extension is detected, it's doing as before, fallbacks to head query the url to have more information on the file. Prior to this commit, this only looked over a hard-coded list of values (for hard-coded keys: file, f, name, url) detected through docker runs. This way of doing it should decrease future misdetections (when new unknown "keys" show up in the wild). Related to T3781	2022-10-05 12:00:43 +02:00
Antoine R. Dumont (@ardumont)	2ee103e2bc	nixguix: Improve further tarball detection The current content type detection was a bit off mostly for content which includes charset. This commit fixes it. Related to T3781	2022-10-05 11:11:08 +02:00
Antoine R. Dumont (@ardumont)	ff80a91f0a	nixguix: Improve git origins detection Without this, some git repositories are detected as file (due to upstream misqualification too). This does some extra effort to detect those to avoid sending noise to loaders. This also refactors some common code to build vcs artifacts to avoid duplication. Related to T3781	2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)	2fbd66778f	nixguix: Improve tarball detection Without this, some tarballs hidden within query parameters are not detected. This does some extra effort to detect those to avoid sending noise to loaders. Related to T3781	2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)	944d4b5b60	nixguix: Add support for listing origins with "recursive" integrity Without this distinction the current directory or content loader will fail the download as they currently expect the checksums to be about the tarball. When a recursive "integrity" is provided, it's actually about the uncompressed tarball as per the nix-store computation. It's detailed within the code. Related to T3294 Related to T3781	2022-10-04 17:58:50 +02:00
Antoine R. Dumont (@ardumont)	5daead68ad	nixguix: Add support for pseudo url with missing schema Related to T3294 Related to T3781	2022-10-04 16:21:38 +02:00
Antoine R. Dumont (@ardumont)	0f8f293f96	nixguix: Deal with connection error with server When that arises, we skip the origins. Related to T3781	2022-10-04 14:57:01 +02:00
Antoine R. Dumont (@ardumont)	d92474bbda	nixguix: Refactor by cleaning up unneeded code Related to T3781	2022-10-04 14:45:57 +02:00
Antoine R. Dumont (@ardumont)	06b11dd5f6	nixguix: Deal with impossible communication with server When that arises, we skip the origins. Related to T3781	2022-10-04 14:07:42 +02:00
Antoine R. Dumont (@ardumont)	a94b75f366	nixguix: Deal with mistyped origins Some origins are listed as urls while they are not. They are possibly vcs. So this commit tries to detect and and deal with those if possible. If not possible, they are skipped. Related to T3781 Related to P1470	2022-10-04 13:58:39 +02:00
Antoine R. Dumont (@ardumont)	1b4fe51f62	nixguix: Randomize order of listed origins The end goal is to ingest sparsely the origins, that would avoid hitting the various servers around the same time for colocated origins in the upstream manifest (especially file or tarball). Related to T3781	2022-10-04 11:54:12 +02:00
Antoine R. Dumont (@ardumont)	94b6dbea0a	nixguix: Document lister Related to T3781	2022-10-03 18:26:36 +02:00
Antoine R. Dumont (@ardumont)	6d2e7aa178	nixguix: Register task Related to T3781	2022-10-03 18:26:36 +02:00

... 2 3 4 5 6 ...

958 commits