Commit graph

958 commits

Author SHA1 Message Date
11073e0dae
add tests 2025-05-24 02:46:32 +08:00
72d0bf21bc
fix metadata key error 2025-05-24 02:36:22 +08:00
d5d56e3a16
add fdroid lister 2025-05-18 16:37:02 +08:00
Antoine Lambert
213a4a152f
crates: Bump chunk size when downloading database dump
It allows faster download of the database dump located at
https://static.crates.io/db-dump.tar.gz.
2025-04-15 12:17:57 +02:00
Antoine Lambert
ceb1b6450e gnu: Fix KeyError exception due to missing field in JSON data
Latest GNU JSON listing is missing the contents field for a directory
so a KeyError exception was raised by the lister.
2025-04-04 12:03:10 +00:00
Nicolas Dandrimont
41c13438b4 Use swh-scheduler[pytest] instead of swh-scheduler[testing] 2025-03-31 18:57:11 +02:00
Pierre-Yves David
08fda328be Migration to psycopg3 2025-03-21 17:05:07 +01:00
Antoine Lambert
61cfd77da1
debian: Fix error since python-debian 1.0 release
Since python-debian 1.0 release, an extra paragraph is returned
when calling Sources.iter_paragraphs that does not have the
expected schema so ensure to ignore it.
2025-03-13 13:33:33 +01:00
Antoine Lambert
f2f9c7d19e Migrate from deprecated pkg_resources package to importlib.metadata 2025-02-26 08:59:30 +01:00
Antoine Lambert
6b4f84a384
packagist: Fix mypy error after typing added to grouper 2025-02-25 10:54:12 +01:00
Antoine Lambert
bde37867d8 docs: Fix broken external links
Those were spotted thanks to the sphinx linkcheck builder
2025-02-20 10:15:54 +00:00
Antoine Lambert
3771a411ae
tests: Remove no longer needed pytest custom marker named db
This was used at the time we were building debian packages for
swh components but we no longer do that.
2025-02-17 16:29:09 +01:00
Antoine Lambert
db00f23ec0
save_bulk/test_lister: Fix flake8 warnings 2025-02-17 13:44:30 +01:00
Antoine Lambert
4b3a12fe76
maven, sourceforge: Fix mypy errors 2025-02-17 13:44:30 +01:00
Antoine Lambert
edef3b850c
Apply swh-py-template v0.3.3 with copier
Bump development tools: mypy, codespell, isort, ...

Move all tools configuration in pyproject.toml.

Remove no longer needed mypy overrides.
2025-02-17 13:44:24 +01:00
Antoine Lambert
a3d66736a4
maven: Update test that is now failing since beautifulsoup4 4.13
Latest beautifulsoup4 release (4.13) seems to have fixed issues
related to unexpected encodings in XML files so a test that was
passing previously is now failing.

Update that test to check origin URL and visit type can be
successfully extracted from a POM file with unexpected encoding.
2025-02-10 14:28:33 +01:00
Antoine Lambert
4d14e8928b Remove no longer used sql directory 2025-01-22 12:55:35 +00:00
Antoine Lambert
3440881086
bitbucket: Fix request to get next page of buggy page
The bitbucket Web API to list repositories has buggy pages that
needs to be skipped to continue the listing.

Previously the request to get the next page when a buggy page
is detected was missing the after query parameter so the request
was always returning the second page of repositories listing
endpoint.

Also refine buggy page detection by considering all HTTP status
code >= 500.
2025-01-22 12:04:09 +01:00
Antoine Lambert
fc98bc1035 cli: Replace scheduler temporary backend by memory one
Scheduler temporary backend has been removed in favor of a
more efficient memory backend.
2024-12-11 12:04:45 +01:00
Antoine Lambert
63ca5b50a0 sourceforge: Catch correct ConnectionError exception 2024-11-07 14:25:29 +01:00
Antoine Lambert
a8871bd492 save_bulk: Speedup listing process with multi-threading
Check multiple origins in parallel using the concurrent.futures
module to greatly speedup the whole listing process.

Related to #4709
2024-10-29 11:18:33 +01:00
Antoine Lambert
88a715d0c1 github: Ensure range listers do not override shared lister state
Recent changes in base Lister class implementation turn the call to
self.scheduler.update_lister mandatory to update the last termination
date for a lister.

It has some side effects on the GitHub lister as there is one incremental
instance plus multiple range ones relisting previously discovered repos
executed in parallel.

Range GitHub listers should not override the shared incremental lister
state as StaleData exceptions might be raised otherwise, so override
the set_state_in_scheduler Lister method to ensure that.
2024-10-28 15:37:02 +00:00
David Douard
cccb8c21ff Replace all remaining occurrences of the 'local' cls by 'postgresql'
The former has been deprecated for ages...
2024-10-28 14:35:29 +01:00
Antoine Lambert
eadb704494 pattern: Ensure termination date is set at the end of listing process
Previously it could be set by any call to the `set_state_in_scheduler`
method.

This was leading to side effects on the save bulk lister while updating
the scheduler state when encountering an invalid or not found origin,
and thus the listing failed.

Fixes #4712.
2024-10-24 12:33:40 +02:00
Antoine Lambert
99f64ddbff save_bulk: Ensure high priority scheduling for first visits of origins
Related to swh/devel/swh-scheduler#4687.
2024-10-14 15:04:02 +02:00
Antoine Lambert
0e1093e308 pattern: Add first_visits_queue_prefix parameter to Lister constructor
It enables to declare a lister whose first visits of listed origins must
be scheduled with high priority.

Related to swh/devel/swh-scheduler#4687.
2024-10-14 15:03:42 +02:00
Antoine Lambert
7609ebf7e1 pattern: Store termination date to scheduler database at end of listing
It enables to track last lister execution date and will be used to schedule
first visits with high priority for listed origins.

Related to swh/devel/swh-scheduler#4687.
2024-10-14 15:03:28 +02:00
Antoine Lambert
927aebbd0b sourceforge: Also skip ConnectionError when fetching project info
The sourceforge lister sends various HTTP requests to get info about a
project, for instance to get the branch name of a Bazaar project.

If HTTP errors occurred during these steps, they were discarded in order
for the listing to continue but connection errors were not and as a
consequence the listing was failing when encountering such error.

Currently, the legacy Bazaar project hosted on sourceforge seems down and
connection  errors are raised when attempting to fetch branch names so the
lister does not process all projects as it crashes in mid-flight.
2024-09-05 14:52:56 +02:00
Antoine Lambert
af24960bc2 Add save-bulk lister to check origins prior their insertion in database
This new and special lister enables to verify a list of origins to archive
provided by users (for instance through the Web API).

Its purpose is to avoid polluting the scheduler database with origins that
cannot be loaded into the archive.

Each origin is identified by an URL and a visit type. For a given visit type
the lister is checking if the origin URL can be found and if the visit type
is valid.

The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus
the one for loading a tarball content into the archive.

Accepted origins are inserted or upserted in the scheduler database.

Rejected origins are stored in the lister state.

Related to #4709
2024-09-04 10:42:23 +02:00
Antoine Lambert
6618cf341c Move tarball validation functions from nixguix to utils 2024-09-02 11:29:47 +02:00
David Douard
c0dc8edb05 Make qa tools happy again 2024-08-27 17:40:30 +02:00
David Douard
c6baacbcd7 Apply swh-py-template v0.2.3 2024-08-27 16:25:53 +02:00
Antoine Lambert
5003e6588f crates: Remove crates metadata as loader argument
Those extrinsic metadata can be directly fetched by the loader
through the crates Web API, plus it contains more metadata fields.
2024-08-27 12:28:05 +02:00
Antoine Lambert
42e76ee62e crates: Speedup listing by processing crates in batch
Instead of having a single crate and its versions info per page,
prefer to have up to 1000 crates per page to significantly speedup
the listing process.
2024-08-27 12:28:05 +02:00
Antoine Lambert
c6aa490fc1 crates: Record lister state only if all crates were processed
Previously, the lister state was recorded regardless if errors occurred
when listing crates as the finalize method is called regardless of raised
exception during listing.

As a consequence some crates could be missed as the incremental listing
restarts from the dump date of the last processed crate database.

So ensure all crates have been processed by the lister before recording
its state.
2024-08-27 12:28:05 +02:00
Antoine Lambert
aafaebd5de crates: Use looseversion.LooseVersion2 to parse crate versions
packaging.version.parse is dedicated to parse Python package version
numbers but crate versions do not necessarily respect Python version
number conventions and thus some crate versions cannot be parsed.

Prefer to use looseversion.LooseVersion2 instead which in a drop-in
replacement for deprecated distutils.version.LooseVersion and enables
to parse all kind of version numbers.
2024-08-27 12:28:05 +02:00
Antoine Lambert
b2ece7ca63 crates: Bump csv field size limit
A size limit of 1000000 was not enough to properly process
all CSV crates data so bump to a higher value.
2024-08-27 12:28:05 +02:00
Nicolas Dandrimont
f7abfafffe GitHub: record whether the origin is a fork
For now this information is not used downstream, but it can be useful
for specific analysis or one-shot scheduling.
2024-07-18 10:45:06 +02:00
Antoine Lambert
a7607abcf9 tests: Fix mocking of sleep calls with tenacity 8.4.2
Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.

Fix it by directly mocking time.sleep (was not working previously).
2024-06-28 18:15:36 +02:00
Antoine Lambert
323e277482 gitea, gogs: Ensure query parameters are not duplicated in API URLs
Gitea API return next pagination link with all query parameters provided
to an API request.

As we were also passing a dict of fixed query parameters to the page_request
method, some query parameters ended up having multiple instances in the URL
for fetching a new page of repositories data. So each time a new page was
requested, new instances of these parameters were appended to the URL which
could result in a really long URL if the number of pages to retrieve is high
and make the request fail.

Also remove a debug log already present in http_request method.
2024-06-05 15:27:58 +02:00
Antoine Lambert
aaae1a6b0b launchpad, npm: Port code to updated swh-scheduler API
The oldest part of the scheduler API was updated to use model classes
(based on attr package) instead of dictionaries in order to improve
typing.
2024-05-22 17:44:00 +02:00
Antoine Lambert
e51b808d72 nixguix: Ensure to not use a redirection URL as an origin URL
Redirection URLs can be long and quite obscure in some cases (GitHub CDN
for instance) so ensure to use the redirected URL as origin URL.

Related to swh/meta#5090.
2024-04-24 14:25:48 +02:00
Antoine Lambert
41407e0eff Use beautifulsoup4 CSS selectors to simplify code and type checking
As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.

As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.

So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.

It also has the advantage to simplify the code.
2024-04-16 11:22:51 +02:00
David Douard
e6a35c55b0 Apply swh-py-template v0.2.0 2024-03-29 13:55:23 +01:00
Antoine Lambert
fdeb086f77 nixguix: Handle creation of svn-export visit types on svn sub-trees
Some Guix packages correspond to subset exports of a subversion source
tree at a given revision, typically the Tex Live ones.

In that case, we must pass an extra parameter to the svn-export loader
to specify the sub-paths to export but also use a unique origin URL
for each package to archive as otherwise the same one would be used
and only a single package would be archived.

Related to swh/infra/sysadm-environment#5263.
2024-03-14 16:23:32 +01:00
Antoine Lambert
b083b4f1f9 pytest: Fix tests execution with pytest 8.1
Remove use of --import-mode=importlib pytest option and use
new option consider_namespace_packages to fix tests execution
with latest pytest release.
2024-03-13 10:58:03 +01:00
Antoine Lambert
329cb2e44a requirements-test: Add missing swh-scheduler[testing] dependency
It fixes installation of dependencies required by swh-scheduler pytest plugin.
2024-03-13 10:56:47 +01:00
Antoine Lambert
32be94a89b tox: Bump mypy to 1.8.0
Related to swh/meta#5075.
2024-02-05 16:14:17 +01:00
Antoine Lambert
65e51e2925 nixguix: Update heuristic checking if URL targets a tarball file
In addition to query parameters also check if any part of URL path
contains a tarball filename.

It fixes the detection of some tarball URLs provided in Guix manifest.

Related to swh/meta#3781.
2024-01-18 15:07:11 +01:00
David Douard
ed8de05eea Remove the outdated list of swh.lister submodules from the readme
Link to the user documentation instead.

Also add a section on required binary tools.
2024-01-17 18:05:58 +01:00