Commit graph

902 commits

Author SHA1 Message Date
Boris Baldassari
8991c625ea lister: Add new maven lister
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724
2021-11-29 17:33:13 +01:00
Antoine R. Dumont (@ardumont)
3ffea8f525
lister: Fix type
This fixes the master build [1]

[1] https://jenkins.softwareheritage.org/view/swh-draft/job/DLS/job/tests/1625/console
2021-11-23 10:13:19 +01:00
Antoine R. Dumont (@ardumont)
97553d8984
opam: Stop leaking temporary folders on machine 2021-11-10 16:58:35 +01:00
Valentin Lorentz
6243f800b4 cran: Pass the package name to the loader
It will be used to create a synthetic release message that contains
the package's name, like the Debian loader does.
2021-11-09 15:04:01 +01:00
Antoine Lambert
24bc671679 cgit: Enable to retry throttled HTTP requests
Related to T3645
2021-10-22 15:15:05 +02:00
Antoine Lambert
20232cc36e cran: Fix ListedOrigin visit type
CRAN origins must be loaded with the cran visit type and not the tar one.

Related to T3675
2021-10-22 14:42:32 +02:00
Antoine R. Dumont (@ardumont)
5bba1a783a
Let sourceforge origins be enabled by default
Related to T3470
2021-10-11 13:03:40 +02:00
Antoine R. Dumont (@ardumont)
04dc628091
docs: Explain task type registering to complete the save forge doc
Related to T3629
2021-10-08 16:07:41 +02:00
Antoine R. Dumont (@ardumont)
1a9c08c93f
docs: Add a save forge documentation
This does not yet enter into the registration of a new lister.

Related to T3629
2021-10-08 16:07:40 +02:00
Antoine R. Dumont (@ardumont)
e7716c0122
opam: Share opam root directory even on multiple instances
That avoids having multiple distinct opam root directories per opam lister instance. The
current opam commands used by the lister are actually listing specifically per instance.

Related to P1171
2021-09-24 11:55:07 +02:00
Antoine R. Dumont (@ardumont)
5ab6b00408
gnu: Respect the pattern docstring about state initialization
Any extra state initialization (outside the scheduler scope) is to happen in the
get_pages method.
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
332ed8e543
opam: Allow defining where to actually install the opam_root folder
Related to T3590
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
ff5e86ff48
opam: Make the instance optional and derived from the url
This matches how it's done for all other multi instances listers.

Related to T3590
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
b69b0b7fd6
opam: Move the state initialization into the get_pages method
We should avoid side-effects in the constructor as much as possible. That avoids
surprising behavior at object instantiation time. The state if needed must be
initialized into the `swh.lister.pattern.Lister.get_pages` method, as preconized in the
class docstring.

This also fixes the current test that actually bootstrap a real opam local "clone" in
/tmp.

Related to T3590
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
c803fc2b59
Allow gitlab lister's name to be overriden by task arguments
This will allow to dedicate the heptapod instances into its their own stats.

Related to T3581
2021-09-17 14:27:16 +02:00
Antoine R. Dumont (@ardumont)
fdb420238c
gitlab: Allow ingestion of hg_git origins as hg ones
Related to T3581#70593
2021-09-17 12:17:11 +02:00
Antoine R. Dumont (@ardumont)
4e4edee478
gitlab: Allow listing of instances providing multiple vcs_type
This will allow to list the foss.heptapod.net instance for example.

Related to T3581
2021-09-16 18:36:25 +02:00
Antoine Lambert
e904f4760e gitlab: Handle HTTP status code 500 when listing projects
GitLab API can return errors 500 when listing projects
(see https://gitlab.com/gitlab-org/gitlab/-/issues/262629).

To avoid ending the listing prematurely, skip buggy URLs and move
to next pages.

Related to T3442
2021-07-23 15:07:16 +02:00
Antoine Lambert
52c3150155 gitlab: Update requests query parameters
Increase number of origins per page to the maximum value allowed
by GitLab API (100) to send less requests.

Ask for simple responses to reduce size of JSON data.
2021-07-23 14:05:38 +02:00
Antoine Lambert
73f85c0b8a gitlab: Adapt requests retry policy to consider HTTP 50x status codes
Temporarily server failures can happen when listing a GitLab instance,
HTTP status codes 502, 503 or 520 are returned in that case.

So adapt lister requests retry policy to execute requests again when
such errors are encountered.

Related to T3442
2021-07-23 13:51:17 +02:00
Antoine R. Dumont (@ardumont)
f00d41d0cd
opam: Directly use the --root flag instead of using an env variable
This aligns the behavior with the opam loader

Related to T3358
2021-07-20 16:46:10 +02:00
Antoine Lambert
6c12350863 pattern: Use URL network location as instance name when not provided
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.

It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.

Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.

Related to T3403
2021-07-13 12:33:49 +02:00
Antoine R. Dumont (@ardumont)
df46b22098
Make PyPI lister incremental and complete in regards to last_update
This rewrote the current implementation to actually use pypi's xml-rpc api which allows
to be incremental. It also allows to fetch the last release date per package. This last
part actually make it possible to update the "last_update" entry in the ListedOrigin
model.

Related to T3399
2021-07-09 12:51:58 +02:00
Antoine R. Dumont (@ardumont)
698be475e9
test_tasks: Align test consistently with other using mocker 2021-07-09 10:13:21 +02:00
zapashcanon
fe01d08cd9
add opam lister 2021-07-06 15:19:00 +02:00
Antoine Lambert
bf7d44db3c mypy: Fix errors with release >= v0.900 2021-06-09 14:02:23 +02:00
David Douard
de23a2219e Merge branch 'T3334_tuleap_lister' 2021-06-09 12:29:39 +02:00
Raphaël Gomès
e8f966de59 sourceforge: use http:// for Mercurial
See inline comment as to why.
This change also adds a Mercurial repo to the test data.
2021-06-04 10:51:56 +02:00
Raphaël Gomès
2e0c951be0 sourceforge: set the protocol for origin urls
I previously forgot to add the `https://` prefix to the cloning URL.
Whoops.
2021-06-03 10:01:33 +02:00
Antoine R. Dumont (@ardumont)
3a375d5bcc
Disable the sourceforge lister origins
This is a temporary workaround the time we make a first pass on those repositories.

Related to T3350
2021-05-31 15:59:49 +02:00
Antoine R. Dumont (@ardumont)
729e76168f
cgit/lister: Fix error when a missing version is not provided
Related to [1]

[1] https://sentry.softwareheritage.org/share/issue/afe7279f9f2d4bdc86f4b1b068a281a5/
2021-05-28 12:09:56 +02:00
Raphaël Gomès
9ca5295a40 sourceforge: retry for all retryable exceptions
Since this lister is doing a lot more requests than most other, it makes
sense that issues would arise more often. We want the lister to continue
even if the website is having issues and not break on the first 500 or
closed connection it encounters.

This change introduces a mechanism to retry all exceptions worth
retrying and uses it for the SourceForge lister. Other listers might
benefit from this, but this is out of scope here.

Tests had to be adjusted to stub the sleep function since retries happened
way more often.
2021-05-26 12:05:39 +02:00
Boris Baldassari
04c0a50706 tuleap: initialise lister.
tuleap-lister: fix args in test_task.

tuleap-lister: Add rate-limiting test + fix debug and typo.

tuleap-lister: code review: fix mocker + tests/setup_cli.

tuleap-lister: code review: fix relister > lister.

tuleap-lister: code review: fix test_task kwargs.

tuleap-lister: code review: Remove authentication useless lines + fix typos.

tuleap-lister: code review: improve results_simplified for svn repos.

tuleap-lister: code review: add name to CONTRIBUTORS file.

tuleap-lister: code review: Update tutorial for misc files to edit.

tuleap-lister: code review: Update copyright to 2021 exactly.

tuleap-lister: code review: Update py files perms -X.

tuleap-lister: code review: minimise json files.

tuleap-lister: code review: fix chmod on json files.

tuleap-lister: code review: fix var names + add tests.

tuleap-lister: code review: fix useless indirection.

tuleap-lister: code review: Add empty repo test, minor typo fixes.
2021-05-26 11:09:12 +02:00
Raphaël Gomès
8f3bbacd5e sourceforge: don't abort on error for project
It's suboptimal to say the least to stop the entire lister process
if a single project page is somehow broken (404, most likely). This
change logs the issue as a warning and carries on, as well as some
minor logging changes and comments touch ups.
2021-05-12 15:54:53 +02:00
Antoine R. Dumont (@ardumont)
2ff549e125
sourceforge/tasks: Allow incremental listing
Related to T3310
2021-05-07 17:04:24 +02:00
Antoine R. Dumont (@ardumont)
7282647bb2
sourceforge/lister: Add credentials parameter
The credentials parameter is not optional due to the instance constructor logic. Even if
unused, this must be provided to the lister (from the task standpoint).

Related to T3310#64801
2021-05-07 16:46:09 +02:00
Antoine Lambert
3167a6dcb7 sourceforge/tests: Ensure correct sleep function gets mocked
This ensures the mocked sleep will work with all tenacity versions.

Related to T3310
2021-05-07 14:40:13 +02:00
Antoine Lambert
1284eb1587 sourceforge/tests: Fix failing test with tenacity < 5.1
It fixes debian package build of swh-lister on buster.
2021-05-07 14:05:36 +02:00
Raphaël Gomès
3baf1d0999 Make the SourceForge lister incremental
SourceForge's sitemaps (1 main one + many sharded) give us a "last
modified" date for every subsitemap and project, allowing us to perform
an incremental listing.

We store the subsitemaps' "last modified" dates in the lister state, as
well as those of the empty projects (projects which don't have any VCS
registered), and the rest comes from the already visited origins from
the database.

The tests try to cover the possible cases of a subsitemap that has
changed, one that hasn't, a project that has change, one that hasn't,
and same for an empty project.
2021-05-06 10:28:27 +02:00
Antoine Lambert
6f8dd5d3f2 tox: Add sphinx environments to check sane doc build
Enable to check package documentation can be built without producing
sphinx warnings.

The sphinx environment is designed to be used in continuous integration
in order to prevent breaking documentation build when committing changes.

The sphinx-dev environment is designed to be used inside a full swh
development environment.

Related to T3258
2021-04-28 14:05:20 +02:00
Valentin Lorentz
18b68bd8c7 s/REST( API)?/API/
Bitbucket's API kind of supports REST workflows, but the clearly use it
like an RPC API (the hardcoded schema in `PROJECT_API_URL_FORMAT`
make it particularly clear)
2021-04-27 18:13:13 +02:00
Valentin Lorentz
40e1916510 Fix various Sphinx warnings 2021-04-13 21:56:08 +02:00
Valentin Lorentz
465506a0ce Remove old lister tutorial.
Sphinx complains because it's an orphan document.
2021-04-13 18:31:31 +02:00
Hezekiah Maina
d5d7830b64 Added Hezekiah Maina as a contributor 2021-04-04 23:33:22 +03:00
Hezekiah Maina
7124627400 Added the right location on linter.yml file 2021-04-04 23:22:46 +03:00
Raphaël Gomès
f7b27c6930 Add a non-incremental sourceforge lister
Following zack's work on T735, this change introduces an actual SWH lister for
SourceForge.

SourceForge provides a main sitemap that lists sharded sitemaps, which
themselves list pages. Each page belongs to a project (or sub-project,
though those are rare), information about which can be found by querying
a REST API, which gives us the list of any and all VCS used for said
project. Both sitemaps and pages have a "last modified" timestamp that
will be used in a future patch to implement incremental listing.

More precise information can be found as inline comments or docstrings.
2021-03-23 18:40:21 +01:00
Nicolas Dandrimont
879170a57d GitHub: handle edge cases with empty responses 2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
c375a61b16 GitHub: handle Server Errors
These errors happen, sometimes, when requesting large pages of results.
2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
4a215e68e0 GitHub: Move rate-limit reset logic to RateLimited exception
This makes the logic easier to test.
2021-03-19 16:52:46 +01:00
Nicolas Dandrimont
cfd4169bd8 Retry GitHub requests on ChunkEncodingErrors
These happen, sometimes, when the connection to the GitHub server
resets, e.g. because of congestion on a slow link.
2021-03-19 16:52:46 +01:00