Commit graph

911 commits

Author SHA1 Message Date
Antoine Lambert
32be94a89b tox: Bump mypy to 1.8.0
Related to swh/meta#5075.
2024-02-05 16:14:17 +01:00
Antoine Lambert
65e51e2925 nixguix: Update heuristic checking if URL targets a tarball file
In addition to query parameters also check if any part of URL path
contains a tarball filename.

It fixes the detection of some tarball URLs provided in Guix manifest.

Related to swh/meta#3781.
2024-01-18 15:07:11 +01:00
David Douard
ed8de05eea Remove the outdated list of swh.lister submodules from the readme
Link to the user documentation instead.

Also add a section on required binary tools.
2024-01-17 18:05:58 +01:00
Jérémy Bobbio (Lunar)
d70dd84939 Fix the listing of listers
Commit c2402f405f renamed the entry points from `lister.*` without
updating the rest of the framework. Revert the changes (and sort the
list alphabetically).
2024-01-10 17:46:23 +01:00
Franck Bret
82ee095128 Elm stateful lister
Use another Api endpoint that helps the lister to be stateful.
The Api endpoint used needs a ``since`` value that represents a
sequential index in the history.
The ``all_packages_count`` state helps in storing a count which will be
used as ``since`` argument on the next run.
2024-01-09 14:05:56 +01:00
Franck Bret
4b1f49ac22 Adapt and rebase
'url' and 'instance' are mandatory
Add elm lister entry to pyproject.toml
2024-01-09 14:05:56 +01:00
Franck Bret
3a1beae36e Elm Lister
The Elm Lister lists Elm packages origins from the Elm
lang registry.
It uses an http api endpoint to list packages origins.
Origins are Github repositories, releases take advantages
of Github relase Api.
2024-01-09 14:05:56 +01:00
Antoine Lambert
f814e1179d nixguix: Exploit new submodule info in sources.json from Guix
Guix now provides a "submodule" info in the sources.jon file it
produced so exploit it to set the new "submodules" parameter of
the git-checkout loader in order to retrieve submodules only when
it is required.

Related to swh/devel/swh-loader-git#4751.
2024-01-08 16:11:02 +01:00
Franck Bret
99bbd9d68f Stateful Julia lister
Add a state to the lister to store the ``last_seen_commit`` as a Git
commit hash.

Use Dulwich to retrieve a Git commit walker since
``last_seen_commit`` if any.
For each commit detect if it is a new package or a new package
version commit and returns its origin with commit date as
last_update.
2023-12-18 16:02:22 +01:00
David Douard
053f0a93d5 Add latest blackify to git-blame-ignore-revs 2023-12-05 14:04:51 +01:00
David Douard
714fccc3c7 python: Fix black formatting after bump to 23.1.0 in pre-commit 2023-12-05 10:33:07 +01:00
David Douard
ac52cfed21 Apply swh-py-template 0.1.6 2023-12-03 17:54:52 +01:00
Antoine Lambert
e4c707d807 pytest.ini: Ensure '--import-mode importlib' option is always used
Fix hanging test when executed outside tox.
2023-12-01 14:43:03 +01:00
David Douard
c2402f405f Migrate to copier-based swh-py-template 2023-11-29 17:23:28 +01:00
David Douard
553884fa56 docs: include the README file in the main index page
Convert README from markdown to ReST to make it embeddable in
docs/index.rst
2023-11-16 16:25:56 +01:00
David Douard
a9b2980f14 Fix pygment language declaration in the README file 2023-11-15 17:35:39 +01:00
Nicolas Dandrimont
4bcf4a4147 swh-core's github extra isn't needed anymore 2023-11-14 19:25:13 +01:00
Antoine Lambert
4aee4da784 cran: Use pyreadr instead of rpy2 to read a RDS file from Python
The CRAN lister improvements introduced in 91e4e33 originally used pyreadr
to read a RDS file from Python instead of rpy2.

As swh-lister was still packaged for debian at the time, the choice of using
rpy2 instead was made as a debian package is available for it while it is not
for pyreadr.

Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
based implementation which has the advantages of being faster and not depending
on the R language runtime.

Related to swh/meta#1709.
2023-11-14 17:09:42 +01:00
Antoine Lambert
42d8e24d7e
arch/lister: Drop artifact size approximation from the listing
That fails the current loader ingestion as this must be an exact value (when provided,
it's checked against the download operation).

Refs. swh/infra/sysadm-environment#4746
2023-11-14 10:40:40 +01:00
Antoine Lambert
2eb3223496 cli: Print lister stats at the end of the run command
Display the number of processed pages and listed origins after the
listing process ended.
2023-11-07 19:00:53 +01:00
Antoine Lambert
7092e4e4ac cli: Use temporary scheduler as fallback when no configuration detected
In order to simplify the testing of listers, allow to call the run command
of swh-lister CLI without scheduler configuration. In that case a temporary
scheduler instance with a postgresql backend is created and used.

It enables to easily test a lister with the following command:

$ swh -l DEBUG lister run <lister_name> url=<forge_url>
2023-11-07 19:00:53 +01:00
Jérémy Bobbio (Lunar)
7344d264e7 Ensure HTTPError.response is not None
The implementation of `HTTPError` in `requests` does not guarantee that
the `response` property will always be set. So we need to ensure it is
not `None` before looking for the return code, for example.

This also makes mypy checks pass again, as `types-request` was updated
in 2.31.0.9 to better match this particular aspect. See:
https://github.com/python/typeshed/pull/10875
2023-10-18 10:41:57 +02:00
Franck Bret
968ddef295 Improve registry repository management
Ensure the registry path does not exists before cloning the repository.
2023-10-12 14:31:48 +02:00
Franck Bret
360fa753ef Remove useless triple single quote from bash script 2023-10-09 15:15:21 +02:00
Franck Bret
7f97c2da67 Use a temp directory instead of /tmp 2023-10-09 15:05:25 +02:00
Franck Bret
1984037fe1 Replace obsolete comment, improve docstring 2023-10-09 15:05:25 +02:00
Franck Bret
3e414c5397 url and instance are now mandatory (related #501) 2023-10-09 15:05:25 +02:00
Franck Bret
f8cfa05f3f Add Julia Lister for listing Julia Packages
This module introduce Julia Lister.
It retrieves Julia packages origins from the Julia General Registry, a Git
repository made of per package directory with Toml definition files.
2023-10-09 15:05:25 +02:00
Antoine Lambert
7b932f46b5 gitweb: Add optional base_git_url parameter to lister
Similar to cgit, it exist cases where git clone URLs for projects hosted
on a gitweb instance cannot be found when scraping project pages or cannot
be easily derived from the gitweb instance root URL.

So add an optional base_git_url parameter enabling to compute correct clone
URLs by appending project names to it.
2023-10-02 14:56:04 +02:00
Antoine Lambert
59a979642f gitweb: Ensure to strip any prefix before git clone URL
Some gitweb instances can have some string prefixes before the displayed
git clone URLs so ensure to strip them to properly extract URLs.

Related to swh/infra/sysadm-environment#5051.
2023-10-02 14:54:41 +02:00
Kumar Shivendu
88611642fc Introduce bioconductor lister 2023-09-28 12:54:37 +00:00
Antoine Lambert
a04975571c gitweb: Remove invalid use of str.rstrip
rstrip is not a method to remove a string suffix so use another
way to extract gitweb project name.

It fixes the computation of some gitweb origin URLs.

Related to swh/infra/sysadm-environment#5050.
2023-09-26 14:53:57 +02:00
Antoine Lambert
aa7b3fa7d8 rpm: Add config for listing EPEL source packages
Extra Packages for Enterprise Linux is a set of additional packages
community maintained that can be installed on many Red Hat based
distributions.
2023-09-25 11:40:47 +02:00
Franck Bret
bf806f2c7b Remove spurious space 2023-09-21 09:18:54 +02:00
Franck Bret
ebba50882f Revert "Remove spurious space"
This reverts commit c9e2339af9
2023-09-21 07:14:44 +00:00
Franck Bret
c9e2339af9 Remove spurious space 2023-09-20 17:01:35 +02:00
Franck Bret
fa09df8ba8 Merge branch 'dlang' of gitlab.softwareheritage.org:franckbret/swh-lister into dlang 2023-09-20 16:51:25 +02:00
Franck Bret
cc686268ba url and instance are now mandatory 2023-09-20 16:49:52 +02:00
Franck Bret
5b4d15090f Remove fedora entry as it as been replaced with rpm 2023-09-20 16:48:06 +02:00
Franck Bret
eb48db902e Merge branch 'dlang' of gitlab.softwareheritage.org:franckbret/swh-lister into dlang 2023-09-19 16:53:48 +02:00
Franck Bret
2793ef9aad D lang lister
Add a dlang module that retrieve origins from an http api endpoint.
Each origin is a git based project url on github.com, gitlab.com or
bitbucket.com.
2023-09-19 16:08:59 +02:00
Valentin Lorentz
1c964cccd3 maven/README: Fix links 2023-09-14 12:03:12 +00:00
Antoine Lambert
6e7bc49ec7 Harmonize listers parameters and add test to check mandatory ones
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.

Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.

Reated to swh/infra/sysadm-environment#5030.
2023-09-06 11:55:34 +02:00
Antoine R. Dumont (@ardumont)
5f717e311d
rpm: Adapt lister constructor to accept the credentials parameter
Refs. swh/infra/sysadm-environment#5030
2023-09-05 17:40:09 +02:00
Antoine R. Dumont (@ardumont)
a02fdbb4c8
lister.github.utils: Drop no longer used module
This got detected when working on the deployment of the new loader-git.

Refs. swh/infra/sysadm-environment#5017
2023-08-22 11:15:04 +02:00
Antoine Lambert
91e4e33dd5 cran: Improve listing of R packages
Previously, the lister was relying on the use of the CRANtools R module
but it has the drawback to only list the latest version of each registered
package in the CRAN registry.

In order to get all possible versions for each CRAN package, prefer to exploit
the content of the weekly dump of the CRAN database in RDS format.

To read the content of the RDS file from Python, the rpy2 package is used as
it has the advantage to be packaged in debian.

Related to swh/meta#1709.
2023-08-21 16:38:08 +02:00
Antoine Lambert
3a0e8b9995 requirements.txt: Sort packages by name 2023-08-17 10:45:43 +02:00
Antoine Lambert
95714f6f37 rpm: Turn fedora lister into a generic Red Hat based distribution one
As Red Hat based linux distributions share the same type of package repository,
rework the fedora lister into a generic one to list RPM source packages and
their versions from numerous distributions.

For a given distribution, the RPM lister will fetch packages metadata from a
list of release identifiers and a list of software components. Source packages
are then processed and relevant info are extracted to be sent to the RPM loader.
When all releases and components were processed, the lister collected all versions
for each package name and send those info to the scheduler that will create RPM
loading tasks afterwards.

Nevertheless, as there is no generic way to list all releases and components for
a given distribution but also to guess the right URL to retrieve packages metadata
from, those info need to be manually provided to the lister as input parameters.
Some examples of those parameters for various distributions can be found in the
config directory of the lister.

Regarding the produced origin URLs, as there is no way to find valid HTTP ones
for all distributions, the same behavior as with the debian lister is used and
they have the following form: rpm://{instance}/packages/{package_name} where
the instance variable corresponds to the name of the listed distribution such
as Fedora, CentOS, or openSUSE.

Related to swh/meta#5011.
2023-08-16 13:25:23 +00:00
Antoine R. Dumont (@ardumont)
fcfb7004db
pypi: Allow passing configuration arguments to task
The constructor allows it but not the celery task.

This also aligns the behavior with other lister tasks.
2023-08-04 15:34:02 +02:00
Antoine R. Dumont (@ardumont)
928d592e10
sourceforge: Allow passing configuration arguments to task
The constructor allows it but not the celery task.

This also aligns the behavior with other lister tasks.
2023-08-04 15:30:09 +02:00