Commit graph

928 commits

Author SHA1 Message Date
David Douard
c0dc8edb05 Make qa tools happy again 2024-08-27 17:40:30 +02:00
David Douard
c6baacbcd7 Apply swh-py-template v0.2.3 2024-08-27 16:25:53 +02:00
Antoine Lambert
5003e6588f crates: Remove crates metadata as loader argument
Those extrinsic metadata can be directly fetched by the loader
through the crates Web API, plus it contains more metadata fields.
2024-08-27 12:28:05 +02:00
Antoine Lambert
42e76ee62e crates: Speedup listing by processing crates in batch
Instead of having a single crate and its versions info per page,
prefer to have up to 1000 crates per page to significantly speedup
the listing process.
2024-08-27 12:28:05 +02:00
Antoine Lambert
c6aa490fc1 crates: Record lister state only if all crates were processed
Previously, the lister state was recorded regardless if errors occurred
when listing crates as the finalize method is called regardless of raised
exception during listing.

As a consequence some crates could be missed as the incremental listing
restarts from the dump date of the last processed crate database.

So ensure all crates have been processed by the lister before recording
its state.
2024-08-27 12:28:05 +02:00
Antoine Lambert
aafaebd5de crates: Use looseversion.LooseVersion2 to parse crate versions
packaging.version.parse is dedicated to parse Python package version
numbers but crate versions do not necessarily respect Python version
number conventions and thus some crate versions cannot be parsed.

Prefer to use looseversion.LooseVersion2 instead which in a drop-in
replacement for deprecated distutils.version.LooseVersion and enables
to parse all kind of version numbers.
2024-08-27 12:28:05 +02:00
Antoine Lambert
b2ece7ca63 crates: Bump csv field size limit
A size limit of 1000000 was not enough to properly process
all CSV crates data so bump to a higher value.
2024-08-27 12:28:05 +02:00
Nicolas Dandrimont
f7abfafffe GitHub: record whether the origin is a fork
For now this information is not used downstream, but it can be useful
for specific analysis or one-shot scheduling.
2024-07-18 10:45:06 +02:00
Antoine Lambert
a7607abcf9 tests: Fix mocking of sleep calls with tenacity 8.4.2
Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.

Fix it by directly mocking time.sleep (was not working previously).
2024-06-28 18:15:36 +02:00
Antoine Lambert
323e277482 gitea, gogs: Ensure query parameters are not duplicated in API URLs
Gitea API return next pagination link with all query parameters provided
to an API request.

As we were also passing a dict of fixed query parameters to the page_request
method, some query parameters ended up having multiple instances in the URL
for fetching a new page of repositories data. So each time a new page was
requested, new instances of these parameters were appended to the URL which
could result in a really long URL if the number of pages to retrieve is high
and make the request fail.

Also remove a debug log already present in http_request method.
2024-06-05 15:27:58 +02:00
Antoine Lambert
aaae1a6b0b launchpad, npm: Port code to updated swh-scheduler API
The oldest part of the scheduler API was updated to use model classes
(based on attr package) instead of dictionaries in order to improve
typing.
2024-05-22 17:44:00 +02:00
Antoine Lambert
e51b808d72 nixguix: Ensure to not use a redirection URL as an origin URL
Redirection URLs can be long and quite obscure in some cases (GitHub CDN
for instance) so ensure to use the redirected URL as origin URL.

Related to swh/meta#5090.
2024-04-24 14:25:48 +02:00
Antoine Lambert
41407e0eff Use beautifulsoup4 CSS selectors to simplify code and type checking
As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.

As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.

So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.

It also has the advantage to simplify the code.
2024-04-16 11:22:51 +02:00
David Douard
e6a35c55b0 Apply swh-py-template v0.2.0 2024-03-29 13:55:23 +01:00
Antoine Lambert
fdeb086f77 nixguix: Handle creation of svn-export visit types on svn sub-trees
Some Guix packages correspond to subset exports of a subversion source
tree at a given revision, typically the Tex Live ones.

In that case, we must pass an extra parameter to the svn-export loader
to specify the sub-paths to export but also use a unique origin URL
for each package to archive as otherwise the same one would be used
and only a single package would be archived.

Related to swh/infra/sysadm-environment#5263.
2024-03-14 16:23:32 +01:00
Antoine Lambert
b083b4f1f9 pytest: Fix tests execution with pytest 8.1
Remove use of --import-mode=importlib pytest option and use
new option consider_namespace_packages to fix tests execution
with latest pytest release.
2024-03-13 10:58:03 +01:00
Antoine Lambert
329cb2e44a requirements-test: Add missing swh-scheduler[testing] dependency
It fixes installation of dependencies required by swh-scheduler pytest plugin.
2024-03-13 10:56:47 +01:00
Antoine Lambert
32be94a89b tox: Bump mypy to 1.8.0
Related to swh/meta#5075.
2024-02-05 16:14:17 +01:00
Antoine Lambert
65e51e2925 nixguix: Update heuristic checking if URL targets a tarball file
In addition to query parameters also check if any part of URL path
contains a tarball filename.

It fixes the detection of some tarball URLs provided in Guix manifest.

Related to swh/meta#3781.
2024-01-18 15:07:11 +01:00
David Douard
ed8de05eea Remove the outdated list of swh.lister submodules from the readme
Link to the user documentation instead.

Also add a section on required binary tools.
2024-01-17 18:05:58 +01:00
Jérémy Bobbio (Lunar)
d70dd84939 Fix the listing of listers
Commit c2402f405f renamed the entry points from `lister.*` without
updating the rest of the framework. Revert the changes (and sort the
list alphabetically).
2024-01-10 17:46:23 +01:00
Franck Bret
82ee095128 Elm stateful lister
Use another Api endpoint that helps the lister to be stateful.
The Api endpoint used needs a ``since`` value that represents a
sequential index in the history.
The ``all_packages_count`` state helps in storing a count which will be
used as ``since`` argument on the next run.
2024-01-09 14:05:56 +01:00
Franck Bret
4b1f49ac22 Adapt and rebase
'url' and 'instance' are mandatory
Add elm lister entry to pyproject.toml
2024-01-09 14:05:56 +01:00
Franck Bret
3a1beae36e Elm Lister
The Elm Lister lists Elm packages origins from the Elm
lang registry.
It uses an http api endpoint to list packages origins.
Origins are Github repositories, releases take advantages
of Github relase Api.
2024-01-09 14:05:56 +01:00
Antoine Lambert
f814e1179d nixguix: Exploit new submodule info in sources.json from Guix
Guix now provides a "submodule" info in the sources.jon file it
produced so exploit it to set the new "submodules" parameter of
the git-checkout loader in order to retrieve submodules only when
it is required.

Related to swh/devel/swh-loader-git#4751.
2024-01-08 16:11:02 +01:00
Franck Bret
99bbd9d68f Stateful Julia lister
Add a state to the lister to store the ``last_seen_commit`` as a Git
commit hash.

Use Dulwich to retrieve a Git commit walker since
``last_seen_commit`` if any.
For each commit detect if it is a new package or a new package
version commit and returns its origin with commit date as
last_update.
2023-12-18 16:02:22 +01:00
David Douard
053f0a93d5 Add latest blackify to git-blame-ignore-revs 2023-12-05 14:04:51 +01:00
David Douard
714fccc3c7 python: Fix black formatting after bump to 23.1.0 in pre-commit 2023-12-05 10:33:07 +01:00
David Douard
ac52cfed21 Apply swh-py-template 0.1.6 2023-12-03 17:54:52 +01:00
Antoine Lambert
e4c707d807 pytest.ini: Ensure '--import-mode importlib' option is always used
Fix hanging test when executed outside tox.
2023-12-01 14:43:03 +01:00
David Douard
c2402f405f Migrate to copier-based swh-py-template 2023-11-29 17:23:28 +01:00
David Douard
553884fa56 docs: include the README file in the main index page
Convert README from markdown to ReST to make it embeddable in
docs/index.rst
2023-11-16 16:25:56 +01:00
David Douard
a9b2980f14 Fix pygment language declaration in the README file 2023-11-15 17:35:39 +01:00
Nicolas Dandrimont
4bcf4a4147 swh-core's github extra isn't needed anymore 2023-11-14 19:25:13 +01:00
Antoine Lambert
4aee4da784 cran: Use pyreadr instead of rpy2 to read a RDS file from Python
The CRAN lister improvements introduced in 91e4e33 originally used pyreadr
to read a RDS file from Python instead of rpy2.

As swh-lister was still packaged for debian at the time, the choice of using
rpy2 instead was made as a debian package is available for it while it is not
for pyreadr.

Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
based implementation which has the advantages of being faster and not depending
on the R language runtime.

Related to swh/meta#1709.
2023-11-14 17:09:42 +01:00
Antoine Lambert
42d8e24d7e
arch/lister: Drop artifact size approximation from the listing
That fails the current loader ingestion as this must be an exact value (when provided,
it's checked against the download operation).

Refs. swh/infra/sysadm-environment#4746
2023-11-14 10:40:40 +01:00
Antoine Lambert
2eb3223496 cli: Print lister stats at the end of the run command
Display the number of processed pages and listed origins after the
listing process ended.
2023-11-07 19:00:53 +01:00
Antoine Lambert
7092e4e4ac cli: Use temporary scheduler as fallback when no configuration detected
In order to simplify the testing of listers, allow to call the run command
of swh-lister CLI without scheduler configuration. In that case a temporary
scheduler instance with a postgresql backend is created and used.

It enables to easily test a lister with the following command:

$ swh -l DEBUG lister run <lister_name> url=<forge_url>
2023-11-07 19:00:53 +01:00
Jérémy Bobbio (Lunar)
7344d264e7 Ensure HTTPError.response is not None
The implementation of `HTTPError` in `requests` does not guarantee that
the `response` property will always be set. So we need to ensure it is
not `None` before looking for the return code, for example.

This also makes mypy checks pass again, as `types-request` was updated
in 2.31.0.9 to better match this particular aspect. See:
https://github.com/python/typeshed/pull/10875
2023-10-18 10:41:57 +02:00
Franck Bret
968ddef295 Improve registry repository management
Ensure the registry path does not exists before cloning the repository.
2023-10-12 14:31:48 +02:00
Franck Bret
360fa753ef Remove useless triple single quote from bash script 2023-10-09 15:15:21 +02:00
Franck Bret
7f97c2da67 Use a temp directory instead of /tmp 2023-10-09 15:05:25 +02:00
Franck Bret
1984037fe1 Replace obsolete comment, improve docstring 2023-10-09 15:05:25 +02:00
Franck Bret
3e414c5397 url and instance are now mandatory (related #501) 2023-10-09 15:05:25 +02:00
Franck Bret
f8cfa05f3f Add Julia Lister for listing Julia Packages
This module introduce Julia Lister.
It retrieves Julia packages origins from the Julia General Registry, a Git
repository made of per package directory with Toml definition files.
2023-10-09 15:05:25 +02:00
Antoine Lambert
7b932f46b5 gitweb: Add optional base_git_url parameter to lister
Similar to cgit, it exist cases where git clone URLs for projects hosted
on a gitweb instance cannot be found when scraping project pages or cannot
be easily derived from the gitweb instance root URL.

So add an optional base_git_url parameter enabling to compute correct clone
URLs by appending project names to it.
2023-10-02 14:56:04 +02:00
Antoine Lambert
59a979642f gitweb: Ensure to strip any prefix before git clone URL
Some gitweb instances can have some string prefixes before the displayed
git clone URLs so ensure to strip them to properly extract URLs.

Related to swh/infra/sysadm-environment#5051.
2023-10-02 14:54:41 +02:00
Kumar Shivendu
88611642fc Introduce bioconductor lister 2023-09-28 12:54:37 +00:00
Antoine Lambert
a04975571c gitweb: Remove invalid use of str.rstrip
rstrip is not a method to remove a string suffix so use another
way to extract gitweb project name.

It fixes the computation of some gitweb origin URLs.

Related to swh/infra/sysadm-environment#5050.
2023-09-26 14:53:57 +02:00
Antoine Lambert
aa7b3fa7d8 rpm: Add config for listing EPEL source packages
Extra Packages for Enterprise Linux is a set of additional packages
community maintained that can be installed on many Red Hat based
distributions.
2023-09-25 11:40:47 +02:00