Commit graph

36 commits

Author SHA1 Message Date
Antoine Lambert
fdeb086f77 nixguix: Handle creation of svn-export visit types on svn sub-trees
Some Guix packages correspond to subset exports of a subversion source
tree at a given revision, typically the Tex Live ones.

In that case, we must pass an extra parameter to the svn-export loader
to specify the sub-paths to export but also use a unique origin URL
for each package to archive as otherwise the same one would be used
and only a single package would be archived.

Related to swh/infra/sysadm-environment#5263.
2024-03-14 16:23:32 +01:00
Antoine Lambert
65e51e2925 nixguix: Update heuristic checking if URL targets a tarball file
In addition to query parameters also check if any part of URL path
contains a tarball filename.

It fixes the detection of some tarball URLs provided in Guix manifest.

Related to swh/meta#3781.
2024-01-18 15:07:11 +01:00
Antoine Lambert
f814e1179d nixguix: Exploit new submodule info in sources.json from Guix
Guix now provides a "submodule" info in the sources.jon file it
produced so exploit it to set the new "submodules" parameter of
the git-checkout loader in order to retrieve submodules only when
it is required.

Related to swh/devel/swh-loader-git#4751.
2024-01-08 16:11:02 +01:00
Antoine R. Dumont (@ardumont)
e0bcb64e0f
nixguix/lister: Rename listed origin visit type to tarball-directory
For the ones coming from a tarball. This matches the change happened in the associated
directory loader.

Refs. swh/infra/sysadm-environment#4906
2023-06-08 11:24:38 +02:00
Antoine R. Dumont (@ardumont)
197fb3400b
lister.nixguix: Propagate the origin reference to the loader
Without this, the loader will fail.

Refs. swh/meta#4979
2023-06-07 16:41:14 +02:00
Antoine R. Dumont (@ardumont)
0756c44ea3
Adapt directory loader visit type depending on the vcs tree to ingest
Prior to this, it was sending only 'directory' types for all vcs trees. Multiple
directory loaders now exist whose visit type are currently diverging, so the scheduling
would not happen correctly without it. This commit is the required adaptation for the
scheduling to work appropriately.

Refs. swh/meta#4979
2023-06-05 13:16:52 +02:00
Antoine R. Dumont (@ardumont)
9f252fc85f
nixguix/lister: Deal with directory with recursive checksums
Those will be ingested by the loader as "directory" with "nar" checksum layouts.

Refs. swh/infra/sysadm-environment#4868

Refs. swh/meta#4979
2023-05-31 14:22:44 +02:00
Antoine R. Dumont (@ardumont)
5ebc57912f
lister/nixguix: Make artifact nature check happen on all urls
Starting with the first url. As soon as one detection succeeds, this stops and yields
the result. Otherwise, continue with the detection on the next mirror url.

This should fix the current misbehavior [1] when multiple mirror urls are not ok but the
first one is.

[1] https://gitlab.softwareheritage.org/swh/infra/sysadm-environment/-/issues/4868#note_137483

Refs. swh/infra/sysadm-environment#4868
2023-04-27 18:16:20 +02:00
Antoine R. Dumont (@ardumont)
bf826618b4
nixguix/lister: Rename checksums_computation to checksum_layout
Refs. swh/meta#4979
2023-04-26 11:12:13 +02:00
Antoine Lambert
fc2bd1e937 mypy: Bump to 1.0.1 and fix new typing errors
Related to swh/meta#4960
2023-02-17 17:56:07 +01:00
Nicolas Dandrimont
e785e67315 Hook up recently introduced options to all listers
Hopefully one day we'll be able to replace all of this mess with PEP692
TypedDict kwargs, but that's only on track for Python 3.12.
2022-12-05 16:33:45 +01:00
Valentin Lorentz
e8699422d7 nixguix: Reject Git SSH URLs and pseudo-URLs
For consistency with Maven and Packagist listers
2022-11-04 15:58:50 +01:00
Antoine R. Dumont (@ardumont)
92d494261f
lister: Make sure lister that requires github tokens can use it
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.

This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.

Related to infra/sysadm-environment#4655
2022-10-26 17:23:40 +02:00
Antoine R. Dumont (@ardumont)
81688ca17e
nixguix: Use content-disposition from http head request if provided
As a last fallback after the content-type check, instead of raising immediately.

Related to T3781
2022-10-26 11:58:54 +02:00
Antoine R. Dumont (@ardumont)
026fea21da
nixguix: Deal with edge case url with version instead of extension
Prior to this, some urls were detected as file because their version name were wrongly
detected as extension, hence not matching tarball extensions.

Related to T3781
2022-10-26 10:06:16 +02:00
Antoine R. Dumont (@ardumont)
ca4ab7f277
nixguix: Allow lister to ignore specific extensions
Those extensions can be extended through configuration. They default to some binary
format already encountered during docker runs.

Related to T3781
2022-10-25 12:09:01 +02:00
Antoine R. Dumont (@ardumont)
d96a39d5b0
nixguix/test: Add all supported tarball extensions to test manifest
Next step is to add some extensions filtering so might as well harden the test dataset
first.

Related to T3781
2022-10-25 11:28:56 +02:00
Antoine R. Dumont (@ardumont)
31eb5f637f
Add support for more tarball recognition based on extensions
This requires to open those extensions to be supported by loaders too (in
swh.core.tarball).

Related to T3781
2022-10-25 09:50:31 +02:00
Antoine Lambert
0baaf68cff nixguix: Fix typo detected by codespell 2022-10-19 14:47:36 +02:00
Antoine R. Dumont (@ardumont)
c22f41a6d7
nixguix: Exclude faulty "recursive" file origins from listing
For now, those can be faulty as the manifest is missing 'critical' information about how
to recompute the hash (e.g. fs layout, executable bit, ...).

Related to T4608
Related to T3781
2022-10-07 14:33:38 +02:00
Antoine R. Dumont (@ardumont)
5a53243bd3
nixguix: Refactor by renaming success or failure the different datasets
It's more explicit that way.

Related to T3781
2022-10-05 22:55:54 +02:00
Antoine R. Dumont (@ardumont)
2e6e282d44
nixguix: Deal with manifest entries without an integrity field
In that case, this fallbacks to use the "outputHash" which is an equivalent field of the
integrity one except it's for "recursive" outputHashMode. This adds the necessary
assertions around this case so correct data is sent to loaders as well.

Related to T3781
2022-10-05 16:11:38 +02:00
Antoine R. Dumont (@ardumont)
f2377c283a
nixguix: Improve is_tarball detection pattern
This actually includes all query param values as paths to check. When paths have
extensions, it then pattern matches against tarballs if any. When no extension is
detected, it's doing as before, fallbacks to head query the url to have more information
on the file.

Prior to this commit, this only looked over a hard-coded list of values (for hard-coded
keys: file, f, name, url) detected through docker runs. This way of doing it should
decrease future misdetections (when new unknown "keys" show up in the wild).

Related to T3781
2022-10-05 12:00:43 +02:00
Antoine R. Dumont (@ardumont)
2ee103e2bc
nixguix: Improve further tarball detection
The current content type detection was a bit off mostly for content which includes
charset. This commit fixes it.

Related to T3781
2022-10-05 11:11:08 +02:00
Antoine R. Dumont (@ardumont)
ff80a91f0a
nixguix: Improve git origins detection
Without this, some git repositories are detected as file (due to upstream
misqualification too). This does some extra effort to detect those to avoid sending
noise to loaders.

This also refactors some common code to build vcs artifacts to avoid duplication.

Related to T3781
2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)
2fbd66778f
nixguix: Improve tarball detection
Without this, some tarballs hidden within query parameters are not detected. This does
some extra effort to detect those to avoid sending noise to loaders.

Related to T3781
2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)
944d4b5b60
nixguix: Add support for listing origins with "recursive" integrity
Without this distinction the current directory or content loader will fail the download
as they currently expect the checksums to be about the tarball. When a recursive
"integrity" is provided, it's actually about the uncompressed tarball as per the
nix-store computation.

It's detailed within the code.

Related to T3294
Related to T3781
2022-10-04 17:58:50 +02:00
Antoine R. Dumont (@ardumont)
5daead68ad
nixguix: Add support for pseudo url with missing schema
Related to T3294
Related to T3781
2022-10-04 16:21:38 +02:00
Antoine R. Dumont (@ardumont)
0f8f293f96
nixguix: Deal with connection error with server
When that arises, we skip the origins.

Related to T3781
2022-10-04 14:57:01 +02:00
Antoine R. Dumont (@ardumont)
d92474bbda
nixguix: Refactor by cleaning up unneeded code
Related to T3781
2022-10-04 14:45:57 +02:00
Antoine R. Dumont (@ardumont)
06b11dd5f6
nixguix: Deal with impossible communication with server
When that arises, we skip the origins.

Related to T3781
2022-10-04 14:07:42 +02:00
Antoine R. Dumont (@ardumont)
a94b75f366
nixguix: Deal with mistyped origins
Some origins are listed as urls while they are not. They are possibly vcs. So this
commit tries to detect and and deal with those if possible. If not possible, they are
skipped.

Related to T3781
Related to P1470
2022-10-04 13:58:39 +02:00
Antoine R. Dumont (@ardumont)
1b4fe51f62
nixguix: Randomize order of listed origins
The end goal is to ingest sparsely the origins, that would avoid hitting the various
servers around the same time for colocated origins in the upstream manifest (especially
file or tarball).

Related to T3781
2022-10-04 11:54:12 +02:00
Antoine R. Dumont (@ardumont)
94b6dbea0a
nixguix: Document lister
Related to T3781
2022-10-03 18:26:36 +02:00
Antoine R. Dumont (@ardumont)
6d2e7aa178
nixguix: Register task
Related to T3781
2022-10-03 18:26:36 +02:00
Antoine R. Dumont (@ardumont)
fbfdf88ea4
nixguix: Add lister
Related to T3781
2022-10-03 18:26:36 +02:00