Commit graph

19 commits

Author SHA1 Message Date
Antoine Lambert
213a4a152f
crates: Bump chunk size when downloading database dump
It allows faster download of the database dump located at
https://static.crates.io/db-dump.tar.gz.
2025-04-15 12:17:57 +02:00
Antoine Lambert
5003e6588f crates: Remove crates metadata as loader argument
Those extrinsic metadata can be directly fetched by the loader
through the crates Web API, plus it contains more metadata fields.
2024-08-27 12:28:05 +02:00
Antoine Lambert
42e76ee62e crates: Speedup listing by processing crates in batch
Instead of having a single crate and its versions info per page,
prefer to have up to 1000 crates per page to significantly speedup
the listing process.
2024-08-27 12:28:05 +02:00
Antoine Lambert
c6aa490fc1 crates: Record lister state only if all crates were processed
Previously, the lister state was recorded regardless if errors occurred
when listing crates as the finalize method is called regardless of raised
exception during listing.

As a consequence some crates could be missed as the incremental listing
restarts from the dump date of the last processed crate database.

So ensure all crates have been processed by the lister before recording
its state.
2024-08-27 12:28:05 +02:00
Antoine Lambert
aafaebd5de crates: Use looseversion.LooseVersion2 to parse crate versions
packaging.version.parse is dedicated to parse Python package version
numbers but crate versions do not necessarily respect Python version
number conventions and thus some crate versions cannot be parsed.

Prefer to use looseversion.LooseVersion2 instead which in a drop-in
replacement for deprecated distutils.version.LooseVersion and enables
to parse all kind of version numbers.
2024-08-27 12:28:05 +02:00
Antoine Lambert
b2ece7ca63 crates: Bump csv field size limit
A size limit of 1000000 was not enough to properly process
all CSV crates data so bump to a higher value.
2024-08-27 12:28:05 +02:00
David Douard
714fccc3c7 python: Fix black formatting after bump to 23.1.0 in pre-commit 2023-12-05 10:33:07 +01:00
Antoine Lambert
6e7bc49ec7 Harmonize listers parameters and add test to check mandatory ones
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.

Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.

Reated to swh/infra/sysadm-environment#5030.
2023-09-06 11:55:34 +02:00
Valentin Lorentz
0e7fdf482c crates: Don't extract unused files
The files we use weigh 440MB, and there are ~600MB of files we don't use
2023-06-20 16:06:21 +02:00
Nicolas Dandrimont
e785e67315 Hook up recently introduced options to all listers
Hopefully one day we'll be able to replace all of this mess with PEP692
TypedDict kwargs, but that's only on track for Python 3.12.
2022-12-05 16:33:45 +01:00
Franck Bret
4a09f660b3 Crates.io: Add last_update for each version of a crate
In order to reduce http api call amount made by the loader, download a
crates.io database dump, and parse its csv files to get a last_update
value for each versions of a Crate.
Those values are sent to the loader through extra_loader_arguments
'crates_metadata'.

'artifacts' and 'crates_metadata' now uses "version" as key.

Related T4104, D8171
2022-10-05 17:10:28 +02:00
Antoine Lambert
dabb1a2ae5 Update instructions for running a lister in docker
Prefer to execute lister through a celery task as it also enables to
catch possible issues with task implementation.

Also use docker compose v2 commands.
2022-09-29 11:26:40 +02:00
Valentin Lorentz
b7ec6cb120 tests: Simplify origin comparison and improve pytest diff on failure
By using a single equality instead of checking len() then zip()
to check one by one, pytest can find the common/missing elements
and print them nicely when the two lists are unequal.
2022-08-24 17:21:24 +02:00
Valentin Lorentz
d51bce0a1c crates: Fix broken ref 2022-08-08 20:43:54 +02:00
Franck Bret
751c3df1b7 crates: Add a developer documentation at module level
Mainly move documentation content from docs/user to crates module
(See D8199 for details)

Related T4104
2022-08-08 14:48:45 +02:00
Franck Bret
a6f796b268 crates.lister: Implement incremental mode:
Add incremental mode support based on a 'last_commit' state, used to get
new package versions from git diff range of commits.
2022-08-05 13:41:57 +02:00
Franck Bret
985b71e80c crates: Create one origin per package instead of per version
Previously we had as many origins as version for a crate package, url was a link
to a specific crate version package.

Refactor to have one origin per package name and add an 'artifacts' entry to
extra_loader_arguments that list all versions, package url and checksum.
Origin url is now a link to the related http api endpoint for a package name.

Related to T4104
2022-04-28 16:10:33 +02:00
Antoine Lambert
d38e05cff7 python: Reformat code with black 22.3.0
Related to T3922
2022-04-08 15:15:09 +02:00
Franck Bret
fea6fc04aa lister: Add new rust crates lister
The Crates lister retrieves crates package for Rust lang.

It basically fetches https://github.com/rust-lang/crates.io-index.git
to a temp directory and then walks through each file to get the
crate's info.
2022-03-28 08:42:31 +02:00