Commit graph

18 commits

Author SHA1 Message Date
Antoine Lambert
41407e0eff Use beautifulsoup4 CSS selectors to simplify code and type checking
As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.

As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.

So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.

It also has the advantage to simplify the code.
2024-04-16 11:22:51 +02:00
Antoine Lambert
42d8e24d7e
arch/lister: Drop artifact size approximation from the listing
That fails the current loader ingestion as this must be an exact value (when provided,
it's checked against the download operation).

Refs. swh/infra/sysadm-environment#4746
2023-11-14 10:40:40 +01:00
Antoine Lambert
6e7bc49ec7 Harmonize listers parameters and add test to check mandatory ones
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.

Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.

Reated to swh/infra/sysadm-environment#5030.
2023-09-06 11:55:34 +02:00
Nicolas Dandrimont
e785e67315 Hook up recently introduced options to all listers
Hopefully one day we'll be able to replace all of this mess with PEP692
TypedDict kwargs, but that's only on track for Python 3.12.
2022-12-05 16:33:45 +01:00
Antoine Lambert
fa1205c4df Send package artifact checksums to loaders when info is available
In listers collecting artifacts for each package to load, add artifacts
checksums, when that info is available, in parameters sent to loaders
in order to check downloaded artifact integrity.
2022-09-30 18:44:11 +02:00
Antoine Lambert
dabb1a2ae5 Update instructions for running a lister in docker
Prefer to execute lister through a celery task as it also enables to
catch possible issues with task implementation.

Also use docker compose v2 commands.
2022-09-29 11:26:40 +02:00
Antoine Lambert
8d85b2e4e8 pattern: Ensure accurate origin counts returned by run method
Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.

However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).

This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.
2022-09-29 11:14:08 +02:00
Antoine Lambert
db6ce12e9e Refactor and deduplicate HTTP requests code in listers
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.

That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.

All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
2022-09-26 10:48:40 +02:00
Antoine Lambert
9c55acd286 Use generic HTTP retry policy by default and rename dedicated decorator
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.

Rename throttling_retry decorator to http_retry to reflect this change.
2022-09-26 10:48:40 +02:00
Antoine Lambert
f1a1b30fd1 arch: Set log level to debug for URL requests 2022-09-13 12:09:13 +02:00
Antoine Lambert
a55f171ed5 arch: Use tempfile module to create temporary directory
It ensures created temporary directories will be removed once they
are no longer needed.
2022-09-13 12:08:02 +02:00
Franck Bret
0acf5b0f4f Arch: Add throttling retry for scrapping and resources download 2022-08-30 09:50:29 +02:00
Valentin Lorentz
766fbbcc91 arch: Un-nest long method 2022-08-25 09:41:44 +02:00
Valentin Lorentz
b7ec6cb120 tests: Simplify origin comparison and improve pytest diff on failure
By using a single equality instead of checking len() then zip()
to check one by one, pytest can find the common/missing elements
and print them nicely when the two lists are unequal.
2022-08-24 17:21:24 +02:00
Valentin Lorentz
4b511b4181 arch: Use lazy interpolation in logging statements 2022-08-23 13:43:07 +02:00
Valentin Lorentz
dde7865ac4 arch: Fix broken ref 2022-08-19 19:07:55 +02:00
Franck Bret
7dd412e553 arch: Extra_loader_arguments consistency + documentation
Split extraloader_arguments artifacts to artifacts and arch_metadata
Add lister documentation at module level

Related T4233
2022-08-19 15:43:58 +02:00
Franck Bret
1bf11aa26d Add arch lister module (origins from archives).
After a first attempt with D7812 this one use a different strategy to
retrieve origins.

Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
Parse metadata from 'desc' file to build origins url.
Scrap the origin url to get artifacts metadata that list all versions of a package.

It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.

Related T4233
2022-06-15 09:11:57 +02:00