Commit graph

704 commits

Author SHA1 Message Date
Franck Bret
1984037fe1 Replace obsolete comment, improve docstring 2023-10-09 15:05:25 +02:00
Franck Bret
3e414c5397 url and instance are now mandatory (related #501) 2023-10-09 15:05:25 +02:00
Franck Bret
f8cfa05f3f Add Julia Lister for listing Julia Packages
This module introduce Julia Lister.
It retrieves Julia packages origins from the Julia General Registry, a Git
repository made of per package directory with Toml definition files.
2023-10-09 15:05:25 +02:00
Antoine Lambert
7b932f46b5 gitweb: Add optional base_git_url parameter to lister
Similar to cgit, it exist cases where git clone URLs for projects hosted
on a gitweb instance cannot be found when scraping project pages or cannot
be easily derived from the gitweb instance root URL.

So add an optional base_git_url parameter enabling to compute correct clone
URLs by appending project names to it.
2023-10-02 14:56:04 +02:00
Antoine Lambert
59a979642f gitweb: Ensure to strip any prefix before git clone URL
Some gitweb instances can have some string prefixes before the displayed
git clone URLs so ensure to strip them to properly extract URLs.

Related to swh/infra/sysadm-environment#5051.
2023-10-02 14:54:41 +02:00
Kumar Shivendu
88611642fc Introduce bioconductor lister 2023-09-28 12:54:37 +00:00
Antoine Lambert
a04975571c gitweb: Remove invalid use of str.rstrip
rstrip is not a method to remove a string suffix so use another
way to extract gitweb project name.

It fixes the computation of some gitweb origin URLs.

Related to swh/infra/sysadm-environment#5050.
2023-09-26 14:53:57 +02:00
Antoine Lambert
aa7b3fa7d8 rpm: Add config for listing EPEL source packages
Extra Packages for Enterprise Linux is a set of additional packages
community maintained that can be installed on many Red Hat based
distributions.
2023-09-25 11:40:47 +02:00
Franck Bret
bf806f2c7b Remove spurious space 2023-09-21 09:18:54 +02:00
Franck Bret
ebba50882f Revert "Remove spurious space"
This reverts commit c9e2339af9
2023-09-21 07:14:44 +00:00
Franck Bret
c9e2339af9 Remove spurious space 2023-09-20 17:01:35 +02:00
Franck Bret
cc686268ba url and instance are now mandatory 2023-09-20 16:49:52 +02:00
Franck Bret
2793ef9aad D lang lister
Add a dlang module that retrieve origins from an http api endpoint.
Each origin is a git based project url on github.com, gitlab.com or
bitbucket.com.
2023-09-19 16:08:59 +02:00
Valentin Lorentz
1c964cccd3 maven/README: Fix links 2023-09-14 12:03:12 +00:00
Antoine Lambert
6e7bc49ec7 Harmonize listers parameters and add test to check mandatory ones
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.

Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.

Reated to swh/infra/sysadm-environment#5030.
2023-09-06 11:55:34 +02:00
Antoine R. Dumont (@ardumont)
5f717e311d
rpm: Adapt lister constructor to accept the credentials parameter
Refs. swh/infra/sysadm-environment#5030
2023-09-05 17:40:09 +02:00
Antoine R. Dumont (@ardumont)
a02fdbb4c8
lister.github.utils: Drop no longer used module
This got detected when working on the deployment of the new loader-git.

Refs. swh/infra/sysadm-environment#5017
2023-08-22 11:15:04 +02:00
Antoine Lambert
91e4e33dd5 cran: Improve listing of R packages
Previously, the lister was relying on the use of the CRANtools R module
but it has the drawback to only list the latest version of each registered
package in the CRAN registry.

In order to get all possible versions for each CRAN package, prefer to exploit
the content of the weekly dump of the CRAN database in RDS format.

To read the content of the RDS file from Python, the rpy2 package is used as
it has the advantage to be packaged in debian.

Related to swh/meta#1709.
2023-08-21 16:38:08 +02:00
Antoine Lambert
95714f6f37 rpm: Turn fedora lister into a generic Red Hat based distribution one
As Red Hat based linux distributions share the same type of package repository,
rework the fedora lister into a generic one to list RPM source packages and
their versions from numerous distributions.

For a given distribution, the RPM lister will fetch packages metadata from a
list of release identifiers and a list of software components. Source packages
are then processed and relevant info are extracted to be sent to the RPM loader.
When all releases and components were processed, the lister collected all versions
for each package name and send those info to the scheduler that will create RPM
loading tasks afterwards.

Nevertheless, as there is no generic way to list all releases and components for
a given distribution but also to guess the right URL to retrieve packages metadata
from, those info need to be manually provided to the lister as input parameters.
Some examples of those parameters for various distributions can be found in the
config directory of the lister.

Regarding the produced origin URLs, as there is no way to find valid HTTP ones
for all distributions, the same behavior as with the debian lister is used and
they have the following form: rpm://{instance}/packages/{package_name} where
the instance variable corresponds to the name of the listed distribution such
as Fedora, CentOS, or openSUSE.

Related to swh/meta#5011.
2023-08-16 13:25:23 +00:00
Antoine R. Dumont (@ardumont)
fcfb7004db
pypi: Allow passing configuration arguments to task
The constructor allows it but not the celery task.

This also aligns the behavior with other lister tasks.
2023-08-04 15:34:02 +02:00
Antoine R. Dumont (@ardumont)
928d592e10
sourceforge: Allow passing configuration arguments to task
The constructor allows it but not the celery task.

This also aligns the behavior with other lister tasks.
2023-08-04 15:30:09 +02:00
Antoine R. Dumont (@ardumont)
b02144b4f9
packagist: Yield pages of origins to regularly record origins
Instead of sending one page with all origins listed which is britle.
When something goes wrong during the listing, the lister currently records nothing.
2023-08-04 11:09:58 +02:00
Antoine R. Dumont (@ardumont)
15a4c4cdb4
packagist: Skip package if unable to parse the last update date 2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)
d4f3e91466
packagist: Allow batch size records configuration in constructor
This allows to configure smaller batch when testing from docker & cli.
2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)
f236f3d163
packagist: Continue listing when github server hangs up
With or without retry (for a future version of swh.core).

This skips the origin when this sporadically happens. It should get picked up by another
listing eventually.

The listing is currently failing to finish when the github server hangs up on the
process. Adding this behavior allows to skip the issue without breaking the listing.
2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)
203f6db8f0
packagist: Randomize the packages list
To avoid starting always in the same order the packages list when some problems occur in
previous listing.
2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)
903ff367ec
packagist: Fix json parsing which is different depending on page 2023-08-02 16:34:32 +02:00
Antoine R. Dumont (@ardumont)
f1ae6825e5
packagist: Improve extract package metadata information algorithm
The current lister implementation lists very few metadata with the hard-coded /p/ base
url (404 on mostly all packages). The packagist api implementation must have evolved
since the initial implementation of the lister (and the first deployment on staging).

Following the upstream documentation [1], it's sensible to first use the /p2/ as it's
performant from the packagist api side. It's then fallbacking to use /p2/+~dev url
scheme, then the /p/ scheme and finally the /packages/ base url if previous result are
either not found or empty (different than no modification since the last visit).

It keeps the initial implementation behavior of stopping immediately if a 304
NotModifiedSince is returned by the server.

[1] https://repo.packagist.org/apidoc
2023-08-02 10:34:55 +02:00
Antoine R. Dumont (@ardumont)
1f27250694
lister.pattern: Make batch record parametric and test it
This adds a test around the batch recording behavior to ensure it's not dropped by
mistake.
2023-08-01 15:06:21 +02:00
Antoine R. Dumont (@ardumont)
920ed0d529
lister.pattern: Restore flushing origin batch in the scheduler
Prior to this commit, the newly introduced check on url validity was consuming the
stream of origins. In effect, this would no longer write origin records regularly.

For all listers, that would translate to flush origins only at the end of the listing
which could take a while for some (e.g. packagist lister has been running for more than
12h currently without writing anything in the scheduler).
2023-08-01 10:04:48 +02:00
Antoine R. Dumont (@ardumont)
56b4fcc760
Add stagit lister
That lister is really near the cgit & gitweb implementations. But the dom data is again
structured differently though so this implementation stands on its own.

Refs. swh/meta#5048
2023-07-13 11:50:51 +02:00
Antoine R. Dumont (@ardumont)
3ab856288c
Add gitiles lister
Gitiles instance returns voluntarily a malformed json output (json prefixed with
``)]}'\n``) [2]. The lister deals with it to properly parse the json response
nonetheless. It drops the prefix and then parses the json.

If at some point, they drop this prefix to return json directly, the lister will be able
to deal with it too. There are 2 tests one with 'standard' gitile format and another
with standard json to account for both case.

Refs. swh/meta#5045

[2] https://github.com/google/gitiles/issues/263
2023-07-13 10:30:51 +02:00
Antoine R. Dumont (@ardumont)
573958ce64
Add Gitweb lister
Depending on some instances, we have some specific heuristics, some instances:
- have summary pages which do not not list metadata_url (so some
  computation happens to list git:// origins which are cloneable)
- have summary page which reference metadata_url as a multiple comma separated urls
- lists relative urls of the repository so we need to join it with the main instance url
  to have a complete cloneable origins (or summary page)
- lists "down" http origins (cloning those won't work) so lists those as cloneable https
  ones (when the main url is behind https).

Refs. swh/devel/swh-lister#1800
2023-07-10 16:50:41 +02:00
Antoine Lambert
8d7dccc54a opam: Fix 'opam init' error when relisting an opam instance
When relisting an opam instance and the opam root directory is already
populated, the '--set-default' parameter must be provided otherwise the
following error is reported:

No switch is currently set. Please use 'opam switch' to set or install a switch

Related to swh/infra/sysadm-environment#4971.
2023-06-29 17:49:21 +02:00
Antoine Lambert
01be6ce581 opam: Only capture stdout when calling 'opam list'
Ensure opam errors are displayed when attempting to list all packages
in order to ease debugging.

Related to swh/infra/sysadm-environment#4971.
2023-06-29 17:49:08 +02:00
Antoine Lambert
d20803ddae opam: Ensure CalledProcessError is raised when an opam command failed
Use subprocess.run instead of subprocess.call and subprocess.Popen to
call opam commands and set check parameter to True in order to raise
CalledProcessError exception when an opam command failed.

This should help spotting issues with the opam lister.

Related to swh/infra/sysadm-environment#4971.
2023-06-29 14:02:00 +00:00
Antoine Lambert
b9815ed577 gogs: Ensure to list all repositories
In contrary of gitea listing which does not require to provide the q query
parameter, it is required for the gogs case.

After reading the gogs source code, the /repos/search endpoint generates
a sql request of the form: "SELECT * FROM repos WHERE name LIKE '%{q}%'".
By setting the q parameter value to "_", the LIKE clause acts as a
wildcard and all repositories are ensured to be returned.

Fixes #4698.
2023-06-26 15:16:48 +00:00
Antoine Lambert
206ac680dc pagure/tasks: Add missing docstring for list_pagure task function
Missing docstring prevents the task type to be registered in scheduler
database.
2023-06-23 14:29:17 +02:00
Antoine Lambert
c81c473a83 pagure: Implement lister for pagure forges
Pagure is a git-centered forge, python based using pygit2.

Its REST API enables to easily list all projects hosted in an
instance so the lister implementation is quite simple.

Related to swh/meta#5043.
2023-06-23 09:02:49 +00:00
Nicolas Dandrimont
ad6644a663 opam: retrieve opam from $PATH with shutil.which
The default behavior of subprocess is to pull executables from a
hardcoded list, which doesn't work when opam is installed manually in
the user's home directory.
2023-06-21 14:53:17 +02:00
Nicolas Dandrimont
b2ff630c9b debian: refactor inner loop slightly to help mypy
mypy doesn't catch that multiple uses of
`self.listed_origins[origin_url]` in the same statement should be identical.
Using a temporary local variable for it seems to help.
2023-06-21 13:57:27 +02:00
Valentin Lorentz
0e7fdf482c crates: Don't extract unused files
The files we use weigh 440MB, and there are ~600MB of files we don't use
2023-06-20 16:06:21 +02:00
Antoine R. Dumont (@ardumont)
e0bcb64e0f
nixguix/lister: Rename listed origin visit type to tarball-directory
For the ones coming from a tarball. This matches the change happened in the associated
directory loader.

Refs. swh/infra/sysadm-environment#4906
2023-06-08 11:24:38 +02:00
Antoine R. Dumont (@ardumont)
197fb3400b
lister.nixguix: Propagate the origin reference to the loader
Without this, the loader will fail.

Refs. swh/meta#4979
2023-06-07 16:41:14 +02:00
Antoine R. Dumont (@ardumont)
0756c44ea3
Adapt directory loader visit type depending on the vcs tree to ingest
Prior to this, it was sending only 'directory' types for all vcs trees. Multiple
directory loaders now exist whose visit type are currently diverging, so the scheduling
would not happen correctly without it. This commit is the required adaptation for the
scheduling to work appropriately.

Refs. swh/meta#4979
2023-06-05 13:16:52 +02:00
Antoine R. Dumont (@ardumont)
9f252fc85f
nixguix/lister: Deal with directory with recursive checksums
Those will be ingested by the loader as "directory" with "nar" checksum layouts.

Refs. swh/infra/sysadm-environment#4868

Refs. swh/meta#4979
2023-05-31 14:22:44 +02:00
Antoine R. Dumont (@ardumont)
e91e0bf09c
cgit: Allow url to be optional
Some cgit instances are at a domain's root path so we can build their url directly from
their 'instance' parameter.

This unifies further the cli to register a lister and the cli to schedule the listed
origins from a forge.

[1]
```
https://git.kernel.org
https://source.codeaurora.org
https://git.trueelena.org
https://dev.sanctum.geek.nz
https://git.trueelena.org
https://git.dpkg.org
https://anongit.mindrot.org
https://git.aurel32.net
https://gitweb.gentoo.org
https://git.joeyh.name
https://git.adrian.geek.nz
```

Refs. swh/devel/swh-lister#4693
2023-05-23 11:47:51 +02:00
Antoine R. Dumont (@ardumont)
19bdeefb14
lister: Allow lister to build url out of the instance parameter
This pushes the rather elementary logic within the lister's scope. This will simplify
and unify cli call between lister and scheduler clis. This will also allow to reduce
erroneous operations which can happen for example in the add-forge-now.

With the following, we will only have to provide the type and the instance, then
everything will be scheduled properly.

Refs. swh/devel/swh-lister#4693
2023-05-19 15:03:49 +02:00
Valentin Lorentz
596e8c6c40 Fix crash of 'swh lister run' when called without -l
```
$ swh lister run
Traceback (most recent call last):
  File "/home/dev/.local/bin/swh", line 33, in <module>
    sys.exit(load_entry_point('swh.core', 'console_scripts', 'swh')())
  File "/home/dev/swh-environment/swh-core/swh/core/cli/__init__.py", line 144, in main
    return swh(auto_envvar_prefix="SWH")
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/dev/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/dev/swh-environment/swh-lister/swh/lister/cli.py", line 68, in run
    get_lister(lister, **config).run()
  File "/home/dev/swh-environment/swh-lister/swh/lister/__init__.py", line 75, in get_lister
    raise ValueError(
ValueError: Invalid lister None: only supported listers are ['arch', 'aur', 'bitbucket', 'bower', 'cgit', 'conda', 'cpan', 'cran', 'crates', 'debian', 'fedora', 'gitea', 'github', 'gitlab', 'gnu', 'gogs', 'golang', 'hackage', 'hex', 'launchpad', 'maven', 'nixguix', 'npm', 'nuget', 'opam', 'packagist', 'phabricator', 'pubdev', 'puppet', 'pypi', 'rubygems', 'sourceforge', 'tuleap']
```
2023-05-10 10:19:26 +02:00
Antoine R. Dumont (@ardumont)
5ebc57912f
lister/nixguix: Make artifact nature check happen on all urls
Starting with the first url. As soon as one detection succeeds, this stops and yields
the result. Otherwise, continue with the detection on the next mirror url.

This should fix the current misbehavior [1] when multiple mirror urls are not ok but the
first one is.

[1] https://gitlab.softwareheritage.org/swh/infra/sysadm-environment/-/issues/4868#note_137483

Refs. swh/infra/sysadm-environment#4868
2023-04-27 18:16:20 +02:00