Commit graph

57 commits

Author SHA1 Message Date
Antoine Lambert
c81c473a83 pagure: Implement lister for pagure forges
Pagure is a git-centered forge, python based using pygit2.

Its REST API enables to easily list all projects hosted in an
instance so the lister implementation is quite simple.

Related to swh/meta#5043.
2023-06-23 09:02:49 +00:00
Antoine R. Dumont (@ardumont)
19bdeefb14
lister: Allow lister to build url out of the instance parameter
This pushes the rather elementary logic within the lister's scope. This will simplify
and unify cli call between lister and scheduler clis. This will also allow to reduce
erroneous operations which can happen for example in the add-forge-now.

With the following, we will only have to provide the type and the instance, then
everything will be scheduled properly.

Refs. swh/devel/swh-lister#4693
2023-05-19 15:03:49 +02:00
Antoine Lambert
4f57e84450 Use http_retry decorator from swh.core.retry module
The http_retry decorator has been moved to swh-core package in order
to ease its reuse across swh packages.
2023-04-13 14:19:57 +02:00
KShivendu
a452995d95 feat: Add Hex.pm lister 2023-03-14 17:59:46 +00:00
Nicolas Dandrimont
64267f8f50 Add a flag to not enable origins listed by a lister
This cuts down one more manual step in the add forge now validation
process: we can add the relevant origins to the staging scheduler
without enabling them at all.
2022-12-05 14:53:42 +01:00
Nicolas Dandrimont
b815737054 Add built-in page and origin count limit to listers
This will allow more automation of the staging add forge now process:
for known-good listers, we can limit the number of origins being
processed and reduce the amount of manual steps taken for each instance.
2022-12-05 14:53:42 +01:00
KShivendu
6ad61aec23 feat(fedora): Introduce fedora lister
Summary: Lister to ingest fedora mirrors (.rpm)

Reviewers: #reviewers, vlorentz

Subscribers: vlorentz, olasd

Maniphest Tasks: T4448

Differential Revision: https://forge.softwareheritage.org/D8386
2022-11-15 15:53:52 +05:30
Antoine R. Dumont (@ardumont)
6d2e7aa178
nixguix: Register task
Related to T3781
2022-10-03 18:26:36 +02:00
Antoine Lambert
8d85b2e4e8 pattern: Ensure accurate origin counts returned by run method
Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.

However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).

This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.
2022-09-29 11:14:08 +02:00
Antoine Lambert
9c55acd286 Use generic HTTP retry policy by default and rename dedicated decorator
Instead of retrying HTTP requests only for 429 status code by default,
prefer to use the generic retry policy enabling to also retry for status
codes >= 500 but also on ConnectionError exceptions.

Rename throttling_retry decorator to http_retry to reflect this change.
2022-09-26 10:48:40 +02:00
KShivendu
d34a6232a6 gogs: Introduce Gogs lister 2022-08-03 16:22:06 +05:30
Antoine Lambert
d38e05cff7 python: Reformat code with black 22.3.0
Related to T3922
2022-04-08 15:15:09 +02:00
Antoine Lambert
445d539b3f Remove no longer needed tenacity workarounds
Now that we have packaged tenacity 6.2 for debian buster and use it
in production, we can remove the workarounds to support tenacity < 5.
2021-12-08 13:28:11 +01:00
Boris Baldassari
8991c625ea lister: Add new maven lister
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724
2021-11-29 17:33:13 +01:00
Antoine Lambert
6c12350863 pattern: Use URL network location as instance name when not provided
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.

It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.

Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.

Related to T3403
2021-07-13 12:33:49 +02:00
zapashcanon
fe01d08cd9
add opam lister 2021-07-06 15:19:00 +02:00
Boris Baldassari
04c0a50706 tuleap: initialise lister.
tuleap-lister: fix args in test_task.

tuleap-lister: Add rate-limiting test + fix debug and typo.

tuleap-lister: code review: fix mocker + tests/setup_cli.

tuleap-lister: code review: fix relister > lister.

tuleap-lister: code review: fix test_task kwargs.

tuleap-lister: code review: Remove authentication useless lines + fix typos.

tuleap-lister: code review: improve results_simplified for svn repos.

tuleap-lister: code review: add name to CONTRIBUTORS file.

tuleap-lister: code review: Update tutorial for misc files to edit.

tuleap-lister: code review: Update copyright to 2021 exactly.

tuleap-lister: code review: Update py files perms -X.

tuleap-lister: code review: minimise json files.

tuleap-lister: code review: fix chmod on json files.

tuleap-lister: code review: fix var names + add tests.

tuleap-lister: code review: fix useless indirection.

tuleap-lister: code review: Add empty repo test, minor typo fixes.
2021-05-26 11:09:12 +02:00
Antoine Lambert
8933544521 Remove no longer used legacy Lister API and update CLI options
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.

Also remove lister CLI options related to legacy Lister API.

As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.

Closes T2442
2021-02-02 15:54:55 +01:00
Antoine Lambert
f862004700 launchpad: Reimplement lister using new Lister API
Port launchpad lister to the swh.lister.pattern.Lister API.

Last update date of each listed git repositories is now sent to the scheduler.

The lister can work in incremental mode, only modified repositories since
the last listing operation will be returned in that case.

Closes T2992
2021-01-28 15:22:40 +01:00
Antoine R. Dumont (@ardumont)
cbd2cce339
test_cli: Drop launchpad lister from the test_get_lister
Drop launchpad lister from the lister to check, its test setup is more involved than the
other listers. As its setup is not done in that test, it's actually connecting
anonymously to the launchpad server. So remove such lister from the test.

This should also fix the debian build which refuses such access [1]

[1] https://jenkins.softwareheritage.org/job/debian/job/packages/job/DLS/job/gbp-buildpackage/97/console
2021-01-27 17:18:53 +01:00
Antoine R. Dumont (@ardumont)
bea9d6d147
gitlab: make url mandatory and add type 2021-01-25 19:00:01 +01:00
Vincent SELLIER
d62e77c1b4
cgit lister: Add missing types on the init method
Related to T2984
2021-01-25 18:52:59 +01:00
Antoine Lambert
ea8ecee541 tests: Fix errors after swh-scheduler API update
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
2021-01-25 17:11:54 +01:00
tenma
7892077a89 tests.cli: add Gitea lister mandatory params 2021-01-25 15:54:06 +01:00
Vincent SELLIER
e4a590fc7f
Port cgit lister to the new lister api
Related to T2984
2021-01-25 14:57:45 +01:00
Antoine R. Dumont (@ardumont)
7f1609265f
test: Rename internal method to something public
It's used in multiple module tests now.
2021-01-25 13:39:07 +01:00
Antoine R. Dumont (@ardumont)
1390a513f2
gitlab: Port to the new lister api
Related to T2987
2021-01-25 08:51:16 +01:00
Antoine Lambert
9fd91f007d pattern: Fix and improve config overriding in from_configfile method
Fix error when a configuration value loaded from a config file is also
given as keyword parameter to the from_configfile method.

Override configuration loaded from config file only if the provided
value is not None.
2021-01-18 17:55:53 +01:00
Antoine Lambert
d1fbccd988 lister: Add utility decorator to ease HTTP requests rate limit handling
Add swh.lister.utils.throttling_retry decorator enabling to retry a
function that performs an HTTP request who can return a 429 status code.

The implementation is based on the tenacity module and it is assumed
that the requests library is used when querying an URL.

The default wait strategy is based on exponential backoff.

The default max number of attempts is set to 5, HTTPError exception
will then be reraised.

All tenacity.retry parameters can also be overridden in client code.
2021-01-18 11:28:51 +01:00
Nicolas Dandrimont
9944295729 Implement phabricator lister using the new pattern class 2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
734901747b Implement a base pattern for listers with no state storage 2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
525fc0102d Hook up listers implemented with the new pattern to the CLI
We stop depending on the ListerBase implementation. The main hoop we're jumping
through is the config override mechanism in swh.lister.get_lister, as it's
really specifc to the ListerBase `override_config` argument, which is dropped in
pattern.Lister (in favor of explicit arguments at lister instantiation).

We implement a small shim in swh.lister.pattern.Lister to give
backwards-compatibility for the new pattern to get_lister.

This generic configuration override mechanism will probably be completely
removed when the configuration mechanism is reworked. We'll see.
2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
9e083c1eea Introduce a simpler base pattern for lister implementations.
This new pattern uses the lister support features introduced in swh.scheduler to
replace the database management done in previous iterations of the listers.
2021-01-11 11:00:29 +01:00
Antoine R. Dumont (@ardumont)
30ad6200a2
Drop mock_get_scheduler which creates indirection for no good reason
This is no longer useful, as removing it and tests are still ok.
2020-10-16 14:32:46 +02:00
Antoine Lambert
22f7181294 python: Reorder imports with isort
Related to T2610
2020-09-17 17:48:27 +02:00
Antoine R. Dumont (@ardumont)
e3c856b5ee
utils.split_range: Split into not overlapping ranges
Existing listers use the `is_within_bound` [1] method from the base lister.
This method uses inclusive boundaries in all cases.

As some "range" task listers [2] [3] are using `split_range` function to create
"overlapping" ranges, this can cause concurrent insert issues down the line [4].

This commit adapts the function `split_range` to make the generated ranges no
longer overlap.

[1]
https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/core/lister_base.py$194-199

[2]
https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/gitlab/tasks.py$37-41

[3]
https://forge.softwareheritage.org/source/swh-lister/browse/master/swh/lister/gitea/tasks.py$36-41

Related to T2577
2020-09-10 11:01:44 +02:00
Antoine R. Dumont (@ardumont)
725c1fe4ad
test_utils: Migrate to pytest 2020-09-09 18:48:07 +02:00
Antoine R. Dumont (@ardumont)
5a5b7ef70b
tests: Separate lister instantiations
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.

The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
2020-09-02 12:49:00 +02:00
Antoine R. Dumont (@ardumont)
92422dcf75
pytest_plugin: Instantiate only lister with no particular setup
This should fix the remaining blocking problems in the jenkins build failure
report [1]

[1] https://jenkins.softwareheritage.org/view/Debian%20packages/job/debian/job/packages/job/DLS/job/gbp-buildpackage/78/consoleFull
2020-09-02 12:25:15 +02:00
Antoine R. Dumont (@ardumont)
e99d3464e4
test_cli: Exclude launchpad lister from the check
This should fix the build [1]

[1] https://jenkins.softwareheritage.org/view/Debian%20packages/job/debian/job/packages/job/DLS/job/gbp-buildpackage/77/console
2020-09-01 15:55:24 +02:00
Nicolas Dandrimont
211f4610df Move get_scheduler monkeypatching into an explicit pytest fixture
This allows us to actually run the lister instantiation code instead of relying
on the underlying structure of the lister object. In turn, this allows future
listers to use the scheduler right in their __init__.
2020-07-16 12:14:04 +02:00
Nicolas Dandrimont
c9963d4302 Use the new names for the swh.scheduler test fixtures 2020-07-09 17:06:50 +02:00
Léni Gauffier
58ef08b083 Added LaunchpadLister
Summary:
Related to T1734

From abandonned D2799

Reviewers: ardumont

Reviewed By: ardumont

Differential Revision: https://forge.softwareheritage.org/D2974
2020-04-12 01:00:12 +02:00
David Douard
93a4d8b784 Enable black
- blackify all the python files,
- enable black in pre-commit,
- add a black tox environment.
2020-04-08 16:31:22 +02:00
Antoine R. Dumont (@ardumont)
484377cc13
lister.cli: Remove task type register cli
It's now defined in swh.scheduler
2019-11-18 10:41:46 +01:00
Antoine R. Dumont (@ardumont)
eebbc859fc
lister.cli: Clarify configuration loading step 2019-11-08 10:50:51 +01:00
David Douard
b810876ef8 tasks: normalize the url argument name of most lister
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.

Also kill now useless `new_lister()` functions.
2019-09-04 15:38:01 +02:00
David Douard
8d9deeb8f8 plugins: add support for scheduler's task-type declaration
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:

- only create missing task-types (do not update them), but check that the
  backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
  plugin registry field will be checked and added if needed; tasks which name
  start wit an underscore will not be added,
- added task-type will have:
  - the 'type' field is derived from the task's function name (with underscores
    replaced with dashes),
  - the description field is the first line of that function's docstring,
  - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
    a simple pattern matching to have decent default values for full/incremental
    tasks),
  - these default values can be overloaded via the 'task_type' plugin registry
    entry.

For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).

Comes with some tests.
2019-09-04 15:36:08 +02:00
David Douard
e3c0ea9d90 implement listers as plugins
Listers are declared as plugins via the `swh.workers` entry_point.

As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:

- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
  (typically, create/initialize the database for this lister).
  If not set, the default implementation creates database tables (after
  optionally having deleted exisintg ones) according to models declared in
  the `models` register field.

There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).

Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
  level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
  initializing (especially with --drop-tables) the database for a single
  lister is unreliable, since all tables are created using a sibgle MetaData
  (in the same namespace).
2019-09-03 15:02:24 +02:00
David Douard
87cec2f5c3 phabricator: refactor PhabricatorLister's constructor
- use the 'standard' api_baseurl as init argument,
- make it optional, with default to forge.softwareheritage.org,
- use origin_url as id.
2019-09-02 12:29:38 +02:00