Commit graph

821 commits

Author SHA1 Message Date
Antoine R. Dumont (@ardumont)
f2377c283a
nixguix: Improve is_tarball detection pattern
This actually includes all query param values as paths to check. When paths have
extensions, it then pattern matches against tarballs if any. When no extension is
detected, it's doing as before, fallbacks to head query the url to have more information
on the file.

Prior to this commit, this only looked over a hard-coded list of values (for hard-coded
keys: file, f, name, url) detected through docker runs. This way of doing it should
decrease future misdetections (when new unknown "keys" show up in the wild).

Related to T3781
2022-10-05 12:00:43 +02:00
Antoine R. Dumont (@ardumont)
2ee103e2bc
nixguix: Improve further tarball detection
The current content type detection was a bit off mostly for content which includes
charset. This commit fixes it.

Related to T3781
2022-10-05 11:11:08 +02:00
Antoine R. Dumont (@ardumont)
ff80a91f0a
nixguix: Improve git origins detection
Without this, some git repositories are detected as file (due to upstream
misqualification too). This does some extra effort to detect those to avoid sending
noise to loaders.

This also refactors some common code to build vcs artifacts to avoid duplication.

Related to T3781
2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)
2fbd66778f
nixguix: Improve tarball detection
Without this, some tarballs hidden within query parameters are not detected. This does
some extra effort to detect those to avoid sending noise to loaders.

Related to T3781
2022-10-05 10:09:52 +02:00
Antoine R. Dumont (@ardumont)
944d4b5b60
nixguix: Add support for listing origins with "recursive" integrity
Without this distinction the current directory or content loader will fail the download
as they currently expect the checksums to be about the tarball. When a recursive
"integrity" is provided, it's actually about the uncompressed tarball as per the
nix-store computation.

It's detailed within the code.

Related to T3294
Related to T3781
2022-10-04 17:58:50 +02:00
Antoine R. Dumont (@ardumont)
5daead68ad
nixguix: Add support for pseudo url with missing schema
Related to T3294
Related to T3781
2022-10-04 16:21:38 +02:00
Antoine R. Dumont (@ardumont)
0f8f293f96
nixguix: Deal with connection error with server
When that arises, we skip the origins.

Related to T3781
2022-10-04 14:57:01 +02:00
Antoine R. Dumont (@ardumont)
d92474bbda
nixguix: Refactor by cleaning up unneeded code
Related to T3781
2022-10-04 14:45:57 +02:00
Antoine R. Dumont (@ardumont)
06b11dd5f6
nixguix: Deal with impossible communication with server
When that arises, we skip the origins.

Related to T3781
2022-10-04 14:07:42 +02:00
Antoine R. Dumont (@ardumont)
a94b75f366
nixguix: Deal with mistyped origins
Some origins are listed as urls while they are not. They are possibly vcs. So this
commit tries to detect and and deal with those if possible. If not possible, they are
skipped.

Related to T3781
Related to P1470
2022-10-04 13:58:39 +02:00
Antoine R. Dumont (@ardumont)
1b4fe51f62
nixguix: Randomize order of listed origins
The end goal is to ingest sparsely the origins, that would avoid hitting the various
servers around the same time for colocated origins in the upstream manifest (especially
file or tarball).

Related to T3781
2022-10-04 11:54:12 +02:00
Antoine R. Dumont (@ardumont)
94b6dbea0a
nixguix: Document lister
Related to T3781
2022-10-03 18:26:36 +02:00
Antoine R. Dumont (@ardumont)
6d2e7aa178
nixguix: Register task
Related to T3781
2022-10-03 18:26:36 +02:00
Antoine R. Dumont (@ardumont)
fbfdf88ea4
nixguix: Add lister
Related to T3781
2022-10-03 18:26:36 +02:00
Antoine Lambert
fa1205c4df Send package artifact checksums to loaders when info is available
In listers collecting artifacts for each package to load, add artifacts
checksums, when that info is available, in parameters sent to loaders
in order to check downloaded artifact integrity.
2022-09-30 18:44:11 +02:00
Franck Bret
6f40d2c1a5 Conda: switch artifacts from dict to list
'artifacts' extra_loader_arguments should be a list
2022-09-30 15:55:53 +02:00
Franck Bret
52ccf49e11 RubyGems: List origins from https://rubygems.org
Related T1777
2022-09-29 14:19:06 +02:00
Antoine Lambert
dabb1a2ae5 Update instructions for running a lister in docker
Prefer to execute lister through a celery task as it also enables to
catch possible issues with task implementation.

Also use docker compose v2 commands.
2022-09-29 11:26:40 +02:00
Antoine Lambert
5426883c49 debian: Remove no longer needed code to get accurate origins count
The base lister class now ensures the count of listed origins will
be accurate.
2022-09-29 11:14:42 +02:00
Antoine Lambert
8d85b2e4e8 pattern: Ensure accurate origin counts returned by run method
Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.

However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).

This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.
2022-09-29 11:14:08 +02:00
Franck Bret
3928fc9ee9 Nuget: Lister for NuGet the package manager for .NET
Related T1718
2022-09-27 14:56:36 +02:00
Franck Bret
cd596eb2b4 Puppet: Lister for Puppet modules
The puppet lister retrieves origins from from https://forge.puppet.com/modules

Related T4519
2022-09-27 14:44:13 +02:00
Franck Bret
a4aec3894e Cpan: List Perl module origins from cpan.org
Related T2833
2022-09-27 14:29:33 +02:00
Franck Bret
6696a8424a Hackage: List origins from hackage.haskell.org, The Haskell Package Repository
Use http api point to get package names and build origin urls.
2022-09-27 14:22:03 +02:00
Franck Bret
8ff418fbc2 Conda: List origins for Anaconda, the package manager that provides tooling for datascience
Related T4547
2022-09-27 14:17:26 +02:00
Antoine R. Dumont (@ardumont)
fd1a4244a0
cgit/tests: Rename readme.md to readme
With the extension, the readme is included in the swh-docs build and fails. It's not
intended for the documentation build so renaming it keep it out of the doc build loop.

This fixes build [1].

[1] https://jenkins.softwareheritage.org/view/all/job/DDOC/job/dev/2395/
2022-09-26 13:22:10 +02:00
Antoine Lambert
d5c30a3ce3 Update value of User-Agent HTTP request header used by listers
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.

The following template is used to generate it:

"Software Heritage <lister_name> lister v<swh-lister version>
 (+https://www.softwareheritage.org/contact)"
2022-09-26 10:48:40 +02:00
Antoine Lambert
db6ce12e9e Refactor and deduplicate HTTP requests code in listers
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.

That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.

All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
2022-09-26 10:48:40 +02:00
Antoine Lambert
9c55acd286 Use generic HTTP retry policy by default and rename dedicated decorator
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.

Rename throttling_retry decorator to http_retry to reflect this change.
2022-09-26 10:48:40 +02:00
Vincent SELLIER
9b3e565cf7
cgit: Ensure the clone url is searched on the right tab
For some forges, the default tab for a repository detail is not the
summary tab so the clone urls are not detected and the repository
is ignored

Related to T4544
2022-09-20 17:01:49 +02:00
KShivendu
bd35d54398 gogs: Skip pages with error 500
This also affects the gitea lister
2022-09-20 19:05:20 +05:30
Antoine Lambert
fa65f270ed golang: Update lister name
Align with other lister names by turning it to lowercase.
2022-09-19 13:17:40 +02:00
Antoine Lambert
f1a1b30fd1 arch: Set log level to debug for URL requests 2022-09-13 12:09:13 +02:00
Antoine Lambert
a55f171ed5 arch: Use tempfile module to create temporary directory
It ensures created temporary directories will be removed once they
are no longer needed.
2022-09-13 12:08:02 +02:00
Antoine R. Dumont (@ardumont)
67211adb60
pubdev.lister: Decrease verbosity
This matches other lister verbosity.

Related to T4517
2022-09-09 12:31:43 +02:00
Antoine Lambert
c819cc237d pubdev: Update User-Agent request header value
Use a value that matches good practice recommended by pub.dev REST API doc.

https://github.com/dart-lang/pub/blob/master/doc/repository-spec-v2.md
2022-09-07 12:15:34 +02:00
Antoine Lambert
44560c2383 pubdev: Retrieve last publication date for each listed package
In order to get a last_update for each ListedOrigin sent to scheduler
database, send an extra HTTP request for each listed package to the
/api/packages/<package_name> endpoint of pub.dev API.

A pub.dev developer inform us that endpoint is heavily used and cached
so there is no particular issues to query that endpoint for each package
in a row periodically.
2022-09-02 16:50:12 +02:00
Antoine Lambert
49b79b0759 pubdev: Modify origin URL for listed packages
Use https://pub.dev/packages/<package_name> instead of
https://pub.dev/api/packages/<package_name>
2022-09-02 16:48:29 +02:00
Antoine Lambert
b6c69e5075 aur: Create also a git origin for each listed package repository
It will enable to archive the history of the PKGBUILD file associated
to the AUR package.
2022-09-02 15:58:05 +02:00
Antoine Lambert
d76fbb3447 aur: Modify origin URL for listed packages
Use https://aur.archlinux.org/packages/<package_name> instead
of https://aur.archlinux.org/<package_name>.git
2022-09-02 15:57:57 +02:00
Antoine Lambert
92baa2b45c aur: Store packages index in memory instead of disk
Simplify code for downloading packages index as gzip and deflate
transfer-encodings are automatically decoded by requests, also
do not stream response for a couple of megabytes and store
HTTP responses in memory.

Also add more debug logs to track lister execution.
2022-09-02 15:48:20 +02:00
Antoine Lambert
7638f2028b golang/tests: Fix black formatting 2022-09-01 11:47:35 +02:00
Raphaël Gomès
c6ce862d32 Add incremental function to Golang Lister 2022-08-30 14:32:18 +02:00
Raphaël Gomès
60405e78ae Add non-incremental Golang modules lister
This uses https://index.golang.org. An associated loader will be sent in
the near future, as well as an incremental version of this lister.

[1] https://go.dev/ref/mod#goproxy-protocol
2022-08-30 14:32:02 +02:00
Franck Bret
0acf5b0f4f Arch: Add throttling retry for scrapping and resources download 2022-08-30 09:50:29 +02:00
Franck Bret
b7b11887a0 Bower: Set VISIT_TYPE as 'git'
Origins url for Bower are git repositories. Set the VISIT_type as 'git'.
No need for a specific 'Bower' package loader.
2022-08-29 17:15:09 +02:00
Franck Bret
ceae8c42b5 Bower: List origins from registry.bower.io 2022-08-29 15:55:00 +02:00
Franck Bret
5410b6e3f3 Pub.dev lister for Dart and Flutter packages
Stateless lister for https://pub.dev based on http api to list package names
2022-08-26 10:24:08 +02:00
Valentin Lorentz
ce72969de5 aur: Simplify pathlib logic 2022-08-25 09:41:50 +02:00
Valentin Lorentz
766fbbcc91 arch: Un-nest long method 2022-08-25 09:41:44 +02:00