Commit graph

893 commits

Author SHA1 Message Date
Antoine Lambert
9c55acd286 Use generic HTTP retry policy by default and rename dedicated decorator
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.

Rename throttling_retry decorator to http_retry to reflect this change.
2022-09-26 10:48:40 +02:00
Vincent SELLIER
9b3e565cf7
cgit: Ensure the clone url is searched on the right tab
For some forges, the default tab for a repository detail is not the
summary tab so the clone urls are not detected and the repository
is ignored

Related to T4544
2022-09-20 17:01:49 +02:00
KShivendu
bd35d54398 gogs: Skip pages with error 500
This also affects the gitea lister
2022-09-20 19:05:20 +05:30
Antoine Lambert
fa65f270ed golang: Update lister name
Align with other lister names by turning it to lowercase.
2022-09-19 13:17:40 +02:00
Antoine Lambert
f1a1b30fd1 arch: Set log level to debug for URL requests 2022-09-13 12:09:13 +02:00
Antoine Lambert
a55f171ed5 arch: Use tempfile module to create temporary directory
It ensures created temporary directories will be removed once they
are no longer needed.
2022-09-13 12:08:02 +02:00
Antoine R. Dumont (@ardumont)
67211adb60
pubdev.lister: Decrease verbosity
This matches other lister verbosity.

Related to T4517
2022-09-09 12:31:43 +02:00
Antoine Lambert
c819cc237d pubdev: Update User-Agent request header value
Use a value that matches good practice recommended by pub.dev REST API doc.

https://github.com/dart-lang/pub/blob/master/doc/repository-spec-v2.md
2022-09-07 12:15:34 +02:00
Antoine Lambert
44560c2383 pubdev: Retrieve last publication date for each listed package
In order to get a last_update for each ListedOrigin sent to scheduler
database, send an extra HTTP request for each listed package to the
/api/packages/<package_name> endpoint of pub.dev API.

A pub.dev developer inform us that endpoint is heavily used and cached
so there is no particular issues to query that endpoint for each package
in a row periodically.
2022-09-02 16:50:12 +02:00
Antoine Lambert
49b79b0759 pubdev: Modify origin URL for listed packages
Use https://pub.dev/packages/<package_name> instead of
https://pub.dev/api/packages/<package_name>
2022-09-02 16:48:29 +02:00
Antoine Lambert
b6c69e5075 aur: Create also a git origin for each listed package repository
It will enable to archive the history of the PKGBUILD file associated
to the AUR package.
2022-09-02 15:58:05 +02:00
Antoine Lambert
d76fbb3447 aur: Modify origin URL for listed packages
Use https://aur.archlinux.org/packages/<package_name> instead
of https://aur.archlinux.org/<package_name>.git
2022-09-02 15:57:57 +02:00
Antoine Lambert
92baa2b45c aur: Store packages index in memory instead of disk
Simplify code for downloading packages index as gzip and deflate
transfer-encodings are automatically decoded by requests, also
do not stream response for a couple of megabytes and store
HTTP responses in memory.

Also add more debug logs to track lister execution.
2022-09-02 15:48:20 +02:00
Antoine Lambert
7638f2028b golang/tests: Fix black formatting 2022-09-01 11:47:35 +02:00
Raphaël Gomès
c6ce862d32 Add incremental function to Golang Lister 2022-08-30 14:32:18 +02:00
Raphaël Gomès
60405e78ae Add non-incremental Golang modules lister
This uses https://index.golang.org. An associated loader will be sent in
the near future, as well as an incremental version of this lister.

[1] https://go.dev/ref/mod#goproxy-protocol
2022-08-30 14:32:02 +02:00
Franck Bret
0acf5b0f4f Arch: Add throttling retry for scrapping and resources download 2022-08-30 09:50:29 +02:00
Franck Bret
b7b11887a0 Bower: Set VISIT_TYPE as 'git'
Origins url for Bower are git repositories. Set the VISIT_type as 'git'.
No need for a specific 'Bower' package loader.
2022-08-29 17:15:09 +02:00
Franck Bret
ceae8c42b5 Bower: List origins from registry.bower.io 2022-08-29 15:55:00 +02:00
Franck Bret
5410b6e3f3 Pub.dev lister for Dart and Flutter packages
Stateless lister for https://pub.dev based on http api to list package names
2022-08-26 10:24:08 +02:00
Valentin Lorentz
ce72969de5 aur: Simplify pathlib logic 2022-08-25 09:41:50 +02:00
Valentin Lorentz
766fbbcc91 arch: Un-nest long method 2022-08-25 09:41:44 +02:00
Valentin Lorentz
b7ec6cb120 tests: Simplify origin comparison and improve pytest diff on failure
By using a single equality instead of checking len() then zip()
to check one by one, pytest can find the common/missing elements
and print them nicely when the two lists are unequal.
2022-08-24 17:21:24 +02:00
Valentin Lorentz
4b511b4181 arch: Use lazy interpolation in logging statements 2022-08-23 13:43:07 +02:00
Valentin Lorentz
31c44330e8 gogs: Lower unnecessarily verbose logging statement 2022-08-23 13:40:19 +02:00
Valentin Lorentz
17a219ece0 gitea: Inherit from Gogs lister
This removes code and adds support for incremental pagination.

While both are essentially the same lister now, it still makes sense to
keep the Gitea lister separate, in order to:

1. display them in different categories on https://archive.softwareheritage.org/
2. support possible divergence of APIs in the future
2022-08-23 13:38:32 +02:00
Valentin Lorentz
dde7865ac4 arch: Fix broken ref 2022-08-19 19:07:55 +02:00
Franck Bret
7dd412e553 arch: Extra_loader_arguments consistency + documentation
Split extraloader_arguments artifacts to artifacts and arch_metadata
Add lister documentation at module level

Related T4233
2022-08-19 15:43:58 +02:00
Valentin Lorentz
3ab90cc0cd aur: Fix broken ref 2022-08-19 14:14:37 +02:00
Franck Bret
97b353bf0b Arch User Repository (AUR) lister
Add 'aur' module to swh-lister with data fixtures and tests.
For now, origin url are package vcs (Git) url.
2022-08-19 12:43:15 +02:00
KShivendu
6a53a6ad06 feat: Make the Gogs lister incremental 2022-08-17 15:01:32 +05:30
Antoine Lambert
cee6bcb514 maven: Use BeautifulSoup instead of xmltodict for parsing pom files
xmltodict cannot parse POM files with multi-byte encoding so prefer to
use the XML parser of BeautifulSoup based on lxml instead.

Also drop xmltodict requirement as it is no longer used in swh-lister
codebase.
2022-08-09 11:11:45 +02:00
Valentin Lorentz
d51bce0a1c crates: Fix broken ref 2022-08-08 20:43:54 +02:00
Franck Bret
751c3df1b7 crates: Add a developer documentation at module level
Mainly move documentation content from docs/user to crates module
(See D8199 for details)

Related T4104
2022-08-08 14:48:45 +02:00
Franck Bret
a6f796b268 crates.lister: Implement incremental mode:
Add incremental mode support based on a 'last_commit' state, used to get
new package versions from git diff range of commits.
2022-08-05 13:41:57 +02:00
KShivendu
d34a6232a6 gogs: Introduce Gogs lister 2022-08-03 16:22:06 +05:30
Franck Bret
1bf11aa26d Add arch lister module (origins from archives).
After a first attempt with D7812 this one use a different strategy to
retrieve origins.

Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
Parse metadata from 'desc' file to build origins url.
Scrap the origin url to get artifacts metadata that list all versions of a package.

It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.

Related T4233
2022-06-15 09:11:57 +02:00
Antoine R. Dumont (@ardumont)
263db667d0
Adapt maven lister to list canonical gh urls if any
That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are
canonicalized to https://github.com/${user_repo} format.

This avoids duplication of origins.

Related to T4232
2022-05-23 14:47:11 +02:00
Antoine R. Dumont (@ardumont)
2ffe9c2aea
Use swh.core.github.pytest_plugin in github tests
Related to T4232
2022-05-20 16:06:11 +02:00
Pratyush Desai
aa8c8cb3bc add strict asyncio_mode in pytest.ini 2022-05-09 12:13:28 +02:00
Antoine Lambert
3f6c7edc24 maven: Prevent UnicodeDecodeError when processing pom file
Pass the raw bytes of pom file content in xmltodict.parse and let
it do the string decoding based on the encoding declared in pom file.

If the string decoding failed due to an invalid declared encoding,
xml.parsers.expat.ExpatError will be raised and will be caught by
the lister, ignoring the pom file and continuing listing.

Related to T3874
2022-05-02 14:01:58 +02:00
Antoine Lambert
0222a8f5c4 maven: Handle null mtime value in index for jar archive
It exists cases where the modification time for a jar archive in
a maven index is null which was leading to a processing error
by the lister.

So handle that case to avoid premature exit of the listing process.

Related to T3874
2022-04-29 13:59:17 +02:00
Antoine Lambert
378613ad82 maven: Remove extraction of groupId and artifactId from pom files
When parsing pom files, we are only interested to extract a VCS URL
(git, hg, svn) in order to create associated loading tasks.

In that case, the groupId and artifactId are not used by the lister
so better removing their extraction, plus it will prevent errors when
those info are missing in pom files.
2022-04-29 11:15:03 +02:00
Antoine Lambert
22bcd9deb2 maven: Create one origin per package instead of one per package version
Previously the maven lister was creating an origin for each source
archive (jar, zip) it discovered during the listing process.

This is not the way Software Heritage decided to archive sources
coming from package managers. Instead one origin should be created
per package and all its versions should be found as releases in the
snapshot produced by the package loader.

So modify the maven lister in order to create one origin per package
grouping all its versions.

This change also modifies the way incremental listing is handled,
ListedOrigin instances will be yielded only if we discovered new
versions of a package since the last listing.

Tests have been updated to reflect these changes.

Related to T3874
2022-04-29 10:57:04 +02:00
Franck Bret
985b71e80c crates: Create one origin per package instead of per version
Previously we had as many origins as version for a crate package, url was a link
to a specific crate version package.

Refactor to have one origin per package name and add an 'artifacts' entry to
extra_loader_arguments that list all versions, package url and checksum.
Origin url is now a link to the related http api endpoint for a package name.

Related to T4104
2022-04-28 16:10:33 +02:00
Valentin Lorentz
c251594a1f Bump mypy to v0.942 2022-04-26 13:05:44 +02:00
Valentin Lorentz
d715aaf903 Make user_agent a parameter of GitHubSession
So it can be set when used by other packages
2022-04-26 11:08:53 +02:00
Valentin Lorentz
2d04244cc9 Move GitHubSession from github/lister.py to github/utils.py
So it can be reused by other packages without importing lister.py itself
2022-04-26 11:08:49 +02:00
Valentin Lorentz
9ee4a99f15 github: Refactor rate-limiting out of the GitHubLister class
This will allow the GitHub Metadata Fetcher to reuse the logic
by importing the GitHubSession class.
2022-04-26 11:08:45 +02:00
Antoine Lambert
334c54091e maven: Remove duplicated code related to setting instance from netloc
That processing is already handled in the base Lister class constructor.
2022-04-25 17:31:02 +02:00