The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.
Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.
Related to T1724
That avoids having multiple distinct opam root directories per opam lister instance. The
current opam commands used by the lister are actually listing specifically per instance.
Related to P1171
We should avoid side-effects in the constructor as much as possible. That avoids
surprising behavior at object instantiation time. The state if needed must be
initialized into the `swh.lister.pattern.Lister.get_pages` method, as preconized in the
class docstring.
This also fixes the current test that actually bootstrap a real opam local "clone" in
/tmp.
Related to T3590
GitLab API can return errors 500 when listing projects
(see https://gitlab.com/gitlab-org/gitlab/-/issues/262629).
To avoid ending the listing prematurely, skip buggy URLs and move
to next pages.
Related to T3442
Increase number of origins per page to the maximum value allowed
by GitLab API (100) to send less requests.
Ask for simple responses to reduce size of JSON data.
Temporarily server failures can happen when listing a GitLab instance,
HTTP status codes 502, 503 or 520 are returned in that case.
So adapt lister requests retry policy to execute requests again when
such errors are encountered.
Related to T3442
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.
It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.
Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.
Related to T3403
This rewrote the current implementation to actually use pypi's xml-rpc api which allows
to be incremental. It also allows to fetch the last release date per package. This last
part actually make it possible to update the "last_update" entry in the ListedOrigin
model.
Related to T3399
Since this lister is doing a lot more requests than most other, it makes
sense that issues would arise more often. We want the lister to continue
even if the website is having issues and not break on the first 500 or
closed connection it encounters.
This change introduces a mechanism to retry all exceptions worth
retrying and uses it for the SourceForge lister. Other listers might
benefit from this, but this is out of scope here.
Tests had to be adjusted to stub the sleep function since retries happened
way more often.
It's suboptimal to say the least to stop the entire lister process
if a single project page is somehow broken (404, most likely). This
change logs the issue as a warning and carries on, as well as some
minor logging changes and comments touch ups.
The credentials parameter is not optional due to the instance constructor logic. Even if
unused, this must be provided to the lister (from the task standpoint).
Related to T3310#64801
SourceForge's sitemaps (1 main one + many sharded) give us a "last
modified" date for every subsitemap and project, allowing us to perform
an incremental listing.
We store the subsitemaps' "last modified" dates in the lister state, as
well as those of the empty projects (projects which don't have any VCS
registered), and the rest comes from the already visited origins from
the database.
The tests try to cover the possible cases of a subsitemap that has
changed, one that hasn't, a project that has change, one that hasn't,
and same for an empty project.
Enable to check package documentation can be built without producing
sphinx warnings.
The sphinx environment is designed to be used in continuous integration
in order to prevent breaking documentation build when committing changes.
The sphinx-dev environment is designed to be used inside a full swh
development environment.
Related to T3258
Bitbucket's API kind of supports REST workflows, but the clearly use it
like an RPC API (the hardcoded schema in `PROJECT_API_URL_FORMAT`
make it particularly clear)
Following zack's work on T735, this change introduces an actual SWH lister for
SourceForge.
SourceForge provides a main sitemap that lists sharded sitemaps, which
themselves list pages. Each page belongs to a project (or sub-project,
though those are rare), information about which can be found by querying
a REST API, which gives us the list of any and all VCS used for said
project. Both sitemaps and pages have a "last modified" timestamp that
will be used in a future patch to implement incremental listing.
More precise information can be found as inline comments or docstrings.