Commit graph

64 commits

Author SHA1 Message Date
Antoine Lambert
af24960bc2 Add save-bulk lister to check origins prior their insertion in database
This new and special lister enables to verify a list of origins to archive
provided by users (for instance through the Web API).

Its purpose is to avoid polluting the scheduler database with origins that
cannot be loaded into the archive.

Each origin is identified by an URL and a visit type. For a given visit type
the lister is checking if the origin URL can be found and if the visit type
is valid.

The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus
the one for loading a tarball content into the archive.

Accepted origins are inserted or upserted in the scheduler database.

Rejected origins are stored in the lister state.

Related to #4709
2024-09-04 10:42:23 +02:00
Antoine Lambert
6e7bc49ec7 Harmonize listers parameters and add test to check mandatory ones
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.

Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.

Reated to swh/infra/sysadm-environment#5030.
2023-09-06 11:55:34 +02:00
Antoine Lambert
95714f6f37 rpm: Turn fedora lister into a generic Red Hat based distribution one
As Red Hat based linux distributions share the same type of package repository,
rework the fedora lister into a generic one to list RPM source packages and
their versions from numerous distributions.

For a given distribution, the RPM lister will fetch packages metadata from a
list of release identifiers and a list of software components. Source packages
are then processed and relevant info are extracted to be sent to the RPM loader.
When all releases and components were processed, the lister collected all versions
for each package name and send those info to the scheduler that will create RPM
loading tasks afterwards.

Nevertheless, as there is no generic way to list all releases and components for
a given distribution but also to guess the right URL to retrieve packages metadata
from, those info need to be manually provided to the lister as input parameters.
Some examples of those parameters for various distributions can be found in the
config directory of the lister.

Regarding the produced origin URLs, as there is no way to find valid HTTP ones
for all distributions, the same behavior as with the debian lister is used and
they have the following form: rpm://{instance}/packages/{package_name} where
the instance variable corresponds to the name of the listed distribution such
as Fedora, CentOS, or openSUSE.

Related to swh/meta#5011.
2023-08-16 13:25:23 +00:00
Antoine R. Dumont (@ardumont)
1f27250694
lister.pattern: Make batch record parametric and test it
This adds a test around the batch recording behavior to ensure it's not dropped by
mistake.
2023-08-01 15:06:21 +02:00
Antoine R. Dumont (@ardumont)
56b4fcc760
Add stagit lister
That lister is really near the cgit & gitweb implementations. But the dom data is again
structured differently though so this implementation stands on its own.

Refs. swh/meta#5048
2023-07-13 11:50:51 +02:00
Antoine R. Dumont (@ardumont)
3ab856288c
Add gitiles lister
Gitiles instance returns voluntarily a malformed json output (json prefixed with
``)]}'\n``) [2]. The lister deals with it to properly parse the json response
nonetheless. It drops the prefix and then parses the json.

If at some point, they drop this prefix to return json directly, the lister will be able
to deal with it too. There are 2 tests one with 'standard' gitile format and another
with standard json to account for both case.

Refs. swh/meta#5045

[2] https://github.com/google/gitiles/issues/263
2023-07-13 10:30:51 +02:00
Antoine R. Dumont (@ardumont)
573958ce64
Add Gitweb lister
Depending on some instances, we have some specific heuristics, some instances:
- have summary pages which do not not list metadata_url (so some
  computation happens to list git:// origins which are cloneable)
- have summary page which reference metadata_url as a multiple comma separated urls
- lists relative urls of the repository so we need to join it with the main instance url
  to have a complete cloneable origins (or summary page)
- lists "down" http origins (cloning those won't work) so lists those as cloneable https
  ones (when the main url is behind https).

Refs. swh/devel/swh-lister#1800
2023-07-10 16:50:41 +02:00
Antoine Lambert
c81c473a83 pagure: Implement lister for pagure forges
Pagure is a git-centered forge, python based using pygit2.

Its REST API enables to easily list all projects hosted in an
instance so the lister implementation is quite simple.

Related to swh/meta#5043.
2023-06-23 09:02:49 +00:00
Antoine R. Dumont (@ardumont)
19bdeefb14
lister: Allow lister to build url out of the instance parameter
This pushes the rather elementary logic within the lister's scope. This will simplify
and unify cli call between lister and scheduler clis. This will also allow to reduce
erroneous operations which can happen for example in the add-forge-now.

With the following, we will only have to provide the type and the instance, then
everything will be scheduled properly.

Refs. swh/devel/swh-lister#4693
2023-05-19 15:03:49 +02:00
Antoine Lambert
4f57e84450 Use http_retry decorator from swh.core.retry module
The http_retry decorator has been moved to swh-core package in order
to ease its reuse across swh packages.
2023-04-13 14:19:57 +02:00
KShivendu
a452995d95 feat: Add Hex.pm lister 2023-03-14 17:59:46 +00:00
Nicolas Dandrimont
64267f8f50 Add a flag to not enable origins listed by a lister
This cuts down one more manual step in the add forge now validation
process: we can add the relevant origins to the staging scheduler
without enabling them at all.
2022-12-05 14:53:42 +01:00
Nicolas Dandrimont
b815737054 Add built-in page and origin count limit to listers
This will allow more automation of the staging add forge now process:
for known-good listers, we can limit the number of origins being
processed and reduce the amount of manual steps taken for each instance.
2022-12-05 14:53:42 +01:00
KShivendu
6ad61aec23 feat(fedora): Introduce fedora lister
Summary: Lister to ingest fedora mirrors (.rpm)

Reviewers: #reviewers, vlorentz

Subscribers: vlorentz, olasd

Maniphest Tasks: T4448

Differential Revision: https://forge.softwareheritage.org/D8386
2022-11-15 15:53:52 +05:30
Antoine R. Dumont (@ardumont)
6d2e7aa178
nixguix: Register task
Related to T3781
2022-10-03 18:26:36 +02:00
Antoine Lambert
8d85b2e4e8 pattern: Ensure accurate origin counts returned by run method
Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.

However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).

This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.
2022-09-29 11:14:08 +02:00
Antoine Lambert
9c55acd286 Use generic HTTP retry policy by default and rename dedicated decorator
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.

Rename throttling_retry decorator to http_retry to reflect this change.
2022-09-26 10:48:40 +02:00
KShivendu
d34a6232a6 gogs: Introduce Gogs lister 2022-08-03 16:22:06 +05:30
Antoine Lambert
d38e05cff7 python: Reformat code with black 22.3.0
Related to T3922
2022-04-08 15:15:09 +02:00
Antoine Lambert
445d539b3f Remove no longer needed tenacity workarounds
Now that we have packaged tenacity 6.2 for debian buster and use it
in production, we can remove the workarounds to support tenacity < 5.
2021-12-08 13:28:11 +01:00
Boris Baldassari
8991c625ea lister: Add new maven lister
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724
2021-11-29 17:33:13 +01:00
Antoine Lambert
6c12350863 pattern: Use URL network location as instance name when not provided
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.

It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.

Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.

Related to T3403
2021-07-13 12:33:49 +02:00
zapashcanon
fe01d08cd9
add opam lister 2021-07-06 15:19:00 +02:00
Boris Baldassari
04c0a50706 tuleap: initialise lister.
tuleap-lister: fix args in test_task.

tuleap-lister: Add rate-limiting test + fix debug and typo.

tuleap-lister: code review: fix mocker + tests/setup_cli.

tuleap-lister: code review: fix relister > lister.

tuleap-lister: code review: fix test_task kwargs.

tuleap-lister: code review: Remove authentication useless lines + fix typos.

tuleap-lister: code review: improve results_simplified for svn repos.

tuleap-lister: code review: add name to CONTRIBUTORS file.

tuleap-lister: code review: Update tutorial for misc files to edit.

tuleap-lister: code review: Update copyright to 2021 exactly.

tuleap-lister: code review: Update py files perms -X.

tuleap-lister: code review: minimise json files.

tuleap-lister: code review: fix chmod on json files.

tuleap-lister: code review: fix var names + add tests.

tuleap-lister: code review: fix useless indirection.

tuleap-lister: code review: Add empty repo test, minor typo fixes.
2021-05-26 11:09:12 +02:00
Antoine Lambert
8933544521 Remove no longer used legacy Lister API and update CLI options
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.

Also remove lister CLI options related to legacy Lister API.

As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.

Closes T2442
2021-02-02 15:54:55 +01:00
Antoine Lambert
f862004700 launchpad: Reimplement lister using new Lister API
Port launchpad lister to the swh.lister.pattern.Lister API.

Last update date of each listed git repositories is now sent to the scheduler.

The lister can work in incremental mode, only modified repositories since
the last listing operation will be returned in that case.

Closes T2992
2021-01-28 15:22:40 +01:00
Antoine R. Dumont (@ardumont)
cbd2cce339
test_cli: Drop launchpad lister from the test_get_lister
Drop launchpad lister from the lister to check, its test setup is more involved than the
other listers. As its setup is not done in that test, it's actually connecting
anonymously to the launchpad server. So remove such lister from the test.

This should also fix the debian build which refuses such access [1]

[1] https://jenkins.softwareheritage.org/job/debian/job/packages/job/DLS/job/gbp-buildpackage/97/console
2021-01-27 17:18:53 +01:00
Antoine R. Dumont (@ardumont)
bea9d6d147
gitlab: make url mandatory and add type 2021-01-25 19:00:01 +01:00
Vincent SELLIER
d62e77c1b4
cgit lister: Add missing types on the init method
Related to T2984
2021-01-25 18:52:59 +01:00
Antoine Lambert
ea8ecee541 tests: Fix errors after swh-scheduler API update
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
2021-01-25 17:11:54 +01:00
tenma
7892077a89 tests.cli: add Gitea lister mandatory params 2021-01-25 15:54:06 +01:00
Vincent SELLIER
e4a590fc7f
Port cgit lister to the new lister api
Related to T2984
2021-01-25 14:57:45 +01:00
Antoine R. Dumont (@ardumont)
7f1609265f
test: Rename internal method to something public
It's used in multiple module tests now.
2021-01-25 13:39:07 +01:00
Antoine R. Dumont (@ardumont)
1390a513f2
gitlab: Port to the new lister api
Related to T2987
2021-01-25 08:51:16 +01:00
Antoine Lambert
9fd91f007d pattern: Fix and improve config overriding in from_configfile method
Fix error when a configuration value loaded from a config file is also
given as keyword parameter to the from_configfile method.

Override configuration loaded from config file only if the provided
value is not None.
2021-01-18 17:55:53 +01:00
Antoine Lambert
d1fbccd988 lister: Add utility decorator to ease HTTP requests rate limit handling
Add swh.lister.utils.throttling_retry decorator enabling to retry a
function that performs an HTTP request who can return a 429 status code.

The implementation is based on the tenacity module and it is assumed
that the requests library is used when querying an URL.

The default wait strategy is based on exponential backoff.

The default max number of attempts is set to 5, HTTPError exception
will then be reraised.

All tenacity.retry parameters can also be overridden in client code.
2021-01-18 11:28:51 +01:00
Nicolas Dandrimont
9944295729 Implement phabricator lister using the new pattern class 2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
734901747b Implement a base pattern for listers with no state storage 2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
525fc0102d Hook up listers implemented with the new pattern to the CLI
We stop depending on the ListerBase implementation. The main hoop we're jumping
through is the config override mechanism in swh.lister.get_lister, as it's
really specifc to the ListerBase `override_config` argument, which is dropped in
pattern.Lister (in favor of explicit arguments at lister instantiation).

We implement a small shim in swh.lister.pattern.Lister to give
backwards-compatibility for the new pattern to get_lister.

This generic configuration override mechanism will probably be completely
removed when the configuration mechanism is reworked. We'll see.
2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
9e083c1eea Introduce a simpler base pattern for lister implementations.
This new pattern uses the lister support features introduced in swh.scheduler to
replace the database management done in previous iterations of the listers.
2021-01-11 11:00:29 +01:00
Antoine R. Dumont (@ardumont)
30ad6200a2
Drop mock_get_scheduler which creates indirection for no good reason
This is no longer useful, as removing it and tests are still ok.
2020-10-16 14:32:46 +02:00
Antoine Lambert
22f7181294 python: Reorder imports with isort
Related to T2610
2020-09-17 17:48:27 +02:00
Antoine R. Dumont (@ardumont)
e3c856b5ee
utils.split_range: Split into not overlapping ranges
Existing listers use the `is_within_bound` [1] method from the base lister.
This method uses inclusive boundaries in all cases.

As some "range" task listers [2] [3] are using `split_range` function to create
"overlapping" ranges, this can cause concurrent insert issues down the line [4].

This commit adapts the function `split_range` to make the generated ranges no
longer overlap.

[1]
https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/core/lister_base.py$194-199

[2]
https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/gitlab/tasks.py$37-41

[3]
https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/gitea/tasks.py$36-41

Related to T2577
2020-09-10 11:01:44 +02:00
Antoine R. Dumont (@ardumont)
725c1fe4ad
test_utils: Migrate to pytest 2020-09-09 18:48:07 +02:00
Antoine R. Dumont (@ardumont)
5a5b7ef70b
tests: Separate lister instantiations
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.

The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
2020-09-02 12:49:00 +02:00
Antoine R. Dumont (@ardumont)
92422dcf75
pytest_plugin: Instantiate only lister with no particular setup
This should fix the remaining blocking problems in the jenkins build failure
report [1]

[1] https://jenkins.softwareheritage.org/view/Debian%20packages/job/debian/job/packages/job/DLS/job/gbp-buildpackage/78/consoleFull
2020-09-02 12:25:15 +02:00
Antoine R. Dumont (@ardumont)
e99d3464e4
test_cli: Exclude launchpad lister from the check
This should fix the build [1]

[1] https://jenkins.softwareheritage.org/view/Debian%20packages/job/debian/job/packages/job/DLS/job/gbp-buildpackage/77/console
2020-09-01 15:55:24 +02:00
Nicolas Dandrimont
211f4610df Move get_scheduler monkeypatching into an explicit pytest fixture
This allows us to actually run the lister instantiation code instead of relying
on the underlying structure of the lister object. In turn, this allows future
listers to use the scheduler right in their __init__.
2020-07-16 12:14:04 +02:00
Nicolas Dandrimont
c9963d4302 Use the new names for the swh.scheduler test fixtures 2020-07-09 17:06:50 +02:00
Léni Gauffier
58ef08b083 Added LaunchpadLister
Summary:
Related to T1734

From abandonned D2799

Reviewers: ardumont

Reviewed By: ardumont

Differential Revision: https://forge.softwareheritage.org/D2974
2020-04-12 01:00:12 +02:00