Commit graph

931 commits

Author SHA1 Message Date
Valentin Lorentz
596e8c6c40 Fix crash of 'swh lister run' when called without -l
```
$ swh lister run
Traceback (most recent call last):
  File "/home/dev/.local/bin/swh", line 33, in <module>
    sys.exit(load_entry_point('swh.core', 'console_scripts', 'swh')())
  File "/home/dev/swh-environment/swh-core/swh/core/cli/__init__.py", line 144, in main
    return swh(auto_envvar_prefix="SWH")
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/dev/.local/lib/python3.9/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/home/dev/.local/lib/python3.9/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/home/dev/swh-environment/swh-lister/swh/lister/cli.py", line 68, in run
    get_lister(lister, **config).run()
  File "/home/dev/swh-environment/swh-lister/swh/lister/__init__.py", line 75, in get_lister
    raise ValueError(
ValueError: Invalid lister None: only supported listers are ['arch', 'aur', 'bitbucket', 'bower', 'cgit', 'conda', 'cpan', 'cran', 'crates', 'debian', 'fedora', 'gitea', 'github', 'gitlab', 'gnu', 'gogs', 'golang', 'hackage', 'hex', 'launchpad', 'maven', 'nixguix', 'npm', 'nuget', 'opam', 'packagist', 'phabricator', 'pubdev', 'puppet', 'pypi', 'rubygems', 'sourceforge', 'tuleap']
```
2023-05-10 10:19:26 +02:00
Antoine R. Dumont (@ardumont)
5ebc57912f
lister/nixguix: Make artifact nature check happen on all urls
Starting with the first url. As soon as one detection succeeds, this stops and yields
the result. Otherwise, continue with the detection on the next mirror url.

This should fix the current misbehavior [1] when multiple mirror urls are not ok but the
first one is.

[1] https://gitlab.softwareheritage.org/swh/infra/sysadm-environment/-/issues/4868#note_137483

Refs. swh/infra/sysadm-environment#4868
2023-04-27 18:16:20 +02:00
Antoine R. Dumont (@ardumont)
bf826618b4
nixguix/lister: Rename checksums_computation to checksum_layout
Refs. swh/meta#4979
2023-04-26 11:12:13 +02:00
Antoine Lambert
4f57e84450 Use http_retry decorator from swh.core.retry module
The http_retry decorator has been moved to swh-core package in order
to ease its reuse across swh packages.
2023-04-13 14:19:57 +02:00
Kumar Shivendu
1ee549fc9d hex: Update loader arguments 2023-03-22 08:45:41 +00:00
Antoine Lambert
35871896b2 pattern: Improve handling of max_origins_per_page parameter
Instead of fully consuming the get_origins_from_page generator into
a list and truncate it, prefer to consume the generator origin per
origin and abort the process when the max number of origin per page
is reached.

Indeed some non trivial listers like the cgit one can perform costly
processing, HTTP request for instance, for each origin in a page.
So better not consuming the full generator in a row to avoid such
side effects.
2023-03-21 16:56:48 +01:00
Antoine R. Dumont (@ardumont)
45bbc29a52
cgit/tasks: Allow passing extra parameters to task
This unifies with other lister tasks modules. And this allow the cgit task to
be scheduled by the add-forge-now scheduler cli.

Refs. swh/infra/sysadm-environment#4813
2023-03-21 12:22:07 +01:00
KShivendu
571d69f965 fix(hex): Use page_size for stopping condition 2023-03-14 17:59:46 +00:00
KShivendu
6d228a8147 fix(hex): Update Hex Lister task name 2023-03-14 17:59:46 +00:00
KShivendu
9095bbec00 fix(hex): Use only updated_after for pagination 2023-03-14 17:59:46 +00:00
KShivendu
ac9993a001 test(hex): Improve comments in the lister tests 2023-03-14 17:59:46 +00:00
KShivendu
cfd9a693aa feat(hex): Use updated_after search query 2023-03-14 17:59:46 +00:00
KShivendu
a452995d95 feat: Add Hex.pm lister 2023-03-14 17:59:46 +00:00
Antoine Lambert
5d0f35aa69 bitbucket: Skip buggy page when listing
Some URLs of the repositories endpoint from BitBucket REST API 2.0
can return an error 500. In that case, skip the buggy repositories
page and get next one to continue listing and avoid to end it
prematurely.

Related to #4239
2023-03-10 15:37:44 +01:00
Antoine Lambert
7da7fa57d0 bitbucket: Prettify JSON tests data 2023-03-09 14:20:56 +01:00
Antoine Lambert
fc2bd1e937 mypy: Bump to 1.0.1 and fix new typing errors
Related to swh/meta#4960
2023-02-17 17:56:07 +01:00
Jérémy Bobbio (Lunar)
4ca84310eb Update and clean tox configuration for version 4
Related to swh/meta#4959
2023-02-16 15:43:51 +01:00
Valentin Lorentz
62b0835193 maven: Discard templated origin URLs 2023-02-10 10:43:36 +00:00
Antoine Lambert
36179044ad pre-commit: Bump isort from 5.10.1 to 5.11.5
This fixes python 3.7 support due to poetry, a dependency of isort, that
removed support for that Python version in a recent release.
2023-02-02 11:07:28 +01:00
Antoine Lambert
bcf30aba90 github: Fix fixtures use in tests
requests_ratelimited fixture from swh-core was renamed to
github_requests_ratelimited.

remaining_requests parameter was added to the github_response_callback
function from swh-core, making it no longer compatible with requests_mock
callback for json responses.
2023-01-02 18:06:26 +01:00
Antoine Lambert
e218fbfef6 github: Fix test error with latest pytest release 2023-01-02 14:10:03 +01:00
Antoine Lambert
4cec345028 docs: Include module indices only when building standalone package doc
In order to remove warnings about /apidoc/*.rst files being included
multiple times in toc when building full swh documentation, prefer to
include module indices only when building standalone package documentation.

Also include them the proper sphinx way.

Related to T4496
2022-12-19 13:46:51 +01:00
Antoine R. Dumont (@ardumont)
b3b5639e9a
gogs, gitea: Fix task execution to pass along extra kwargs
Related to https://gitlab.softwareheritage.org/infra/sysadm-environment/-/issues/4684
2022-12-14 16:09:56 +01:00
Nicolas Dandrimont
e785e67315 Hook up recently introduced options to all listers
Hopefully one day we'll be able to replace all of this mess with PEP692
TypedDict kwargs, but that's only on track for Python 3.12.
2022-12-05 16:33:45 +01:00
Nicolas Dandrimont
5ea79ee3e0 gitlab: allow ignoring projects with certain path prefixes
Some GitLab instances use specific namespaces for transient repositories
that it doesn't make sense to archive (for example, gitlab.org has a set
of QA namespaces used for integration testing of their production
deployments; drupal has an `issues/` namespace with forks of repos that
are only used for collaboration on merge requests, and aren't that
useful to be archived).
2022-12-05 15:36:40 +01:00
Nicolas Dandrimont
64267f8f50 Add a flag to not enable origins listed by a lister
This cuts down one more manual step in the add forge now validation
process: we can add the relevant origins to the staging scheduler
without enabling them at all.
2022-12-05 14:53:42 +01:00
Nicolas Dandrimont
b815737054 Add built-in page and origin count limit to listers
This will allow more automation of the staging add forge now process:
for known-good listers, we can limit the number of origins being
processed and reduce the amount of manual steps taken for each instance.
2022-12-05 14:53:42 +01:00
Nicolas Dandrimont
a66e24bfa2 Ignore psqlrc when loading the rubygems database dump
The SQL dump contains ownership instructions that can't be run if you
don't have the right users in your database clusters. When someone has a
psqlrc with ON_ERROR_STOP, this fails the load of the dump.

Use the opportunity to trigger an exception when psql returns a non-zero
exit code, rather than continue with an empty/inconsistent database.
2022-12-05 13:52:23 +01:00
Antoine Lambert
f4aafe026b fedora: Update versions in packages dict provided as loader argument
In a similar way to the debian lister, use the following versions in the
packages dictionary provided to the generic rpm loader:

- dict keys are package versions prefixed by the fedora release and
  edition they have been found (fedora{release}/{edition}/{version}),
  they will be used as branch names targeting releases in the snapshot
  created by the rpm loader

- version fields in dict values are the package intrinsic versions parsed
  from package repository metadata excluding any ".fcXY" suffixes to avoid
  the loader to create multiple releases targeting the same directory,
  they will be used as release names in the snapshot created by the rpm
  loader

Related to T4448
2022-11-21 14:14:17 +01:00
Franck Bret
065b3f81a1 Hackage: Implement incremental mode
Use http api lastUpload argument in search query to retrieve new or
updated origins since last run

Related to T4597
2022-11-18 13:48:45 +01:00
KShivendu
6ad61aec23 feat(fedora): Introduce fedora lister
Summary: Lister to ingest fedora mirrors (.rpm)

Reviewers: #reviewers, vlorentz

Subscribers: vlorentz, olasd

Maniphest Tasks: T4448

Differential Revision: https://forge.softwareheritage.org/D8386
2022-11-15 15:53:52 +05:30
Franck Bret
ea146ce297 Nuget: Implement incremental listing
The lister is incremental and based on the value of ``commitTimeStamp`` retrieved on index http api endpoint.

Related T1718
2022-11-14 09:30:54 +01:00
Franck Bret
e1f3f87c73 Puppet: Lister implements incremental mode
Use with_release_since api argument to retrieve modules that have been
updated since the last date the lister has been executed.

Related T4519
2022-11-08 14:29:07 +01:00
Valentin Lorentz
e8699422d7 nixguix: Reject Git SSH URLs and pseudo-URLs
For consistency with Maven and Packagist listers
2022-11-04 15:58:50 +01:00
Valentin Lorentz
8ea4200909 Validate origin URLs before sending to the scheduler 2022-11-04 15:58:45 +01:00
Antoine Lambert
60707a45dd pubdev: Update outdated lister documentation 2022-10-28 11:22:15 +02:00
Antoine R. Dumont (@ardumont)
92d494261f
lister: Make sure lister that requires github tokens can use it
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.

This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.

Related to infra/sysadm-environment#4655
2022-10-26 17:23:40 +02:00
Antoine R. Dumont (@ardumont)
81688ca17e
nixguix: Use content-disposition from http head request if provided
As a last fallback after the content-type check, instead of raising immediately.

Related to T3781
2022-10-26 11:58:54 +02:00
Antoine R. Dumont (@ardumont)
026fea21da
nixguix: Deal with edge case url with version instead of extension
Prior to this, some urls were detected as file because their version name were wrongly
detected as extension, hence not matching tarball extensions.

Related to T3781
2022-10-26 10:06:16 +02:00
Franck Bret
8355fee25f Puppet: Switch artifacts from dict to list 2022-10-25 14:49:09 +02:00
Antoine R. Dumont (@ardumont)
ca4ab7f277
nixguix: Allow lister to ignore specific extensions
Those extensions can be extended through configuration. They default to some binary
format already encountered during docker runs.

Related to T3781
2022-10-25 12:09:01 +02:00
Antoine R. Dumont (@ardumont)
d96a39d5b0
nixguix/test: Add all supported tarball extensions to test manifest
Next step is to add some extensions filtering so might as well harden the test dataset
first.

Related to T3781
2022-10-25 11:28:56 +02:00
Antoine Lambert
4f6b3f3f09 conda: Yield listed origins after all artifacts in a page are processed
swh-scheduler will deduplicate listed origins according to their URL
and visit type but not according to their extra loader arguments.

Previously, listed origins were yielded after each processed artifact
in a page so we could lose some package version info due to the
deduplication process.

So ensure to yield listed origins once all artifacts in a page have
been processed.
2022-10-25 10:49:52 +02:00
Antoine R. Dumont (@ardumont)
31eb5f637f
Add support for more tarball recognition based on extensions
This requires to open those extensions to be supported by loaders too (in
swh.core.tarball).

Related to T3781
2022-10-25 09:50:31 +02:00
Antoine R. Dumont (@ardumont)
8a82bbf95f
gogs/lister: Allow public gogs instance listing
Prior to this commit, the lister assumed authentication was required. It exists public
gogs instances which do not require it.

This also updates documentation to mention the usual api location. This is useful when
people wants to actually trigger a listing as a pre-check flight.

This drops repetitive instruction in the gitea lister as well.

Co-authored with Antoine Lambert (@anlambert) <anlambert@softwareheritage.org>.

Related to infra/sysadm-environment#4644
2022-10-21 18:21:18 +02:00
Antoine Lambert
0baaf68cff nixguix: Fix typo detected by codespell 2022-10-19 14:47:36 +02:00
David Douard
8778b9cdbf pre-commit, tox: Bump pre-commit, codespell, black and flake8
- pre-commit from 4.1.0 to 4.3.0,
- codespell from 2.2.1 to 2.2.2,
- black from 22.3.0 to 22.10.0 and
- flake8 from 4.0.1 to 5.0.4. Also freeze flake8 dependencies.

Also change flake8's repo config to github (the gitlab mirror
being outdated).
2022-10-18 18:53:29 +02:00
Valentin Lorentz
db2f2f8265 maven: Use real data from github API + rely on requests_mock_datadir 2022-10-13 18:28:17 +02:00
Valentin Lorentz
f7ac524a55 maven: Use requests_mock_datadir to simplify mocking. 2022-10-13 17:57:55 +02:00
Valentin Lorentz
3dbe77156c maven: Make assertions more useful
By using set equality, pytest can diff both operands; whereas equality
comparisons failures are harder to read.
2022-10-13 17:41:11 +02:00