Since this lister is doing a lot more requests than most other, it makes
sense that issues would arise more often. We want the lister to continue
even if the website is having issues and not break on the first 500 or
closed connection it encounters.
This change introduces a mechanism to retry all exceptions worth
retrying and uses it for the SourceForge lister. Other listers might
benefit from this, but this is out of scope here.
Tests had to be adjusted to stub the sleep function since retries happened
way more often.
It's suboptimal to say the least to stop the entire lister process
if a single project page is somehow broken (404, most likely). This
change logs the issue as a warning and carries on, as well as some
minor logging changes and comments touch ups.
The credentials parameter is not optional due to the instance constructor logic. Even if
unused, this must be provided to the lister (from the task standpoint).
Related to T3310#64801
SourceForge's sitemaps (1 main one + many sharded) give us a "last
modified" date for every subsitemap and project, allowing us to perform
an incremental listing.
We store the subsitemaps' "last modified" dates in the lister state, as
well as those of the empty projects (projects which don't have any VCS
registered), and the rest comes from the already visited origins from
the database.
The tests try to cover the possible cases of a subsitemap that has
changed, one that hasn't, a project that has change, one that hasn't,
and same for an empty project.
Enable to check package documentation can be built without producing
sphinx warnings.
The sphinx environment is designed to be used in continuous integration
in order to prevent breaking documentation build when committing changes.
The sphinx-dev environment is designed to be used inside a full swh
development environment.
Related to T3258
Bitbucket's API kind of supports REST workflows, but the clearly use it
like an RPC API (the hardcoded schema in `PROJECT_API_URL_FORMAT`
make it particularly clear)
Following zack's work on T735, this change introduces an actual SWH lister for
SourceForge.
SourceForge provides a main sitemap that lists sharded sitemaps, which
themselves list pages. Each page belongs to a project (or sub-project,
though those are rare), information about which can be found by querying
a REST API, which gives us the list of any and all VCS used for said
project. Both sitemaps and pages have a "last modified" timestamp that
will be used in a future patch to implement incremental listing.
More precise information can be found as inline comments or docstrings.
This adds a new tutorial which details how to currently write the new listers (both
incremental or stateless). This proposes a python template file to start a new lister.
At last, this renames the previous tutorial into tutorial-2017.
Related to T3073
Some distributions (e.g. debian-security) have a slightly different URL
for retrieving source packages metadata.
So add a new URL template to process when trying to download such data.
Related to T3032#58239
A CRAN package can appear twice in the JSON list returned by the
list_all_packages.R script, most recent version of the package
appearing first.
So handle that edge case to avoid error when sending origins to
the scheduler.
xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page.
So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.
Also drop no longer used xmltodict in requirements.
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.
Also remove lister CLI options related to legacy Lister API.
As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.
Closes T2442
The previous implementation was generating tasks for a non implemented
Packagist loader.
The new implementation extracts source repository URL, VCS type and
last update date for each package referenced by Packagist and send
those info to the scheduler.
Packages metadata are retrieved using Packagist API endpoints whose
responses are served from static files, which are guaranteed to be
efficient on the Packagist side (no dymamic queries).
Furthermore, subsequent listing will send the "If-Modified-Since" HTTP
header to only retrieve packages metadata updated since the previous
listing operation in order to save bandwidth and return only origins
which might have new released versions.
Closes T2991
This adds a second behavior to the cgit lister to actually compute origin urls instead
of parsing them out of another http request on git detailed page.
This new behavior is expected to be the default behavior.
The old behavior is kept for now and is expected to be used as fallback if too much
false negatives are returned.
Related to T2999
As origins is a generator, the previous behavior would try to consume the overall
generator to send the records.
This groups and sends batch of 100 origins to the scheduler for writing.
Related to T3003
Port launchpad lister to the swh.lister.pattern.Lister API.
Last update date of each listed git repositories is now sent to the scheduler.
The lister can work in incremental mode, only modified repositories since
the last listing operation will be returned in that case.
Closes T2992
Drop launchpad lister from the lister to check, its test setup is more involved than the
other listers. As its setup is not done in that test, it's actually connecting
anonymously to the launchpad server. So remove such lister from the test.
This should also fix the debian build which refuses such access [1]
[1] https://jenkins.softwareheritage.org/job/debian/job/packages/job/DLS/job/gbp-buildpackage/97/console