Commit graph

829 commits

Author SHA1 Message Date
Antoine R. Dumont (@ardumont)
698be475e9
test_tasks: Align test consistently with other using mocker 2021-07-09 10:13:21 +02:00
zapashcanon
fe01d08cd9
add opam lister 2021-07-06 15:19:00 +02:00
Antoine Lambert
bf7d44db3c mypy: Fix errors with release >= v0.900 2021-06-09 14:02:23 +02:00
David Douard
de23a2219e Merge branch 'T3334_tuleap_lister' 2021-06-09 12:29:39 +02:00
Raphaël Gomès
e8f966de59 sourceforge: use http:// for Mercurial
See inline comment as to why.
This change also adds a Mercurial repo to the test data.
2021-06-04 10:51:56 +02:00
Raphaël Gomès
2e0c951be0 sourceforge: set the protocol for origin urls
I previously forgot to add the `https://` prefix to the cloning URL.
Whoops.
2021-06-03 10:01:33 +02:00
Antoine R. Dumont (@ardumont)
3a375d5bcc
Disable the sourceforge lister origins
This is a temporary workaround the time we make a first pass on those repositories.

Related to T3350
2021-05-31 15:59:49 +02:00
Antoine R. Dumont (@ardumont)
729e76168f
cgit/lister: Fix error when a missing version is not provided
Related to [1]

[1] https://sentry.softwareheritage.org/share/issue/afe7279f9f2d4bdc86f4b1b068a281a5/
2021-05-28 12:09:56 +02:00
Raphaël Gomès
9ca5295a40 sourceforge: retry for all retryable exceptions
Since this lister is doing a lot more requests than most other, it makes
sense that issues would arise more often. We want the lister to continue
even if the website is having issues and not break on the first 500 or
closed connection it encounters.

This change introduces a mechanism to retry all exceptions worth
retrying and uses it for the SourceForge lister. Other listers might
benefit from this, but this is out of scope here.

Tests had to be adjusted to stub the sleep function since retries happened
way more often.
2021-05-26 12:05:39 +02:00
Boris Baldassari
04c0a50706 tuleap: initialise lister.
tuleap-lister: fix args in test_task.

tuleap-lister: Add rate-limiting test + fix debug and typo.

tuleap-lister: code review: fix mocker + tests/setup_cli.

tuleap-lister: code review: fix relister > lister.

tuleap-lister: code review: fix test_task kwargs.

tuleap-lister: code review: Remove authentication useless lines + fix typos.

tuleap-lister: code review: improve results_simplified for svn repos.

tuleap-lister: code review: add name to CONTRIBUTORS file.

tuleap-lister: code review: Update tutorial for misc files to edit.

tuleap-lister: code review: Update copyright to 2021 exactly.

tuleap-lister: code review: Update py files perms -X.

tuleap-lister: code review: minimise json files.

tuleap-lister: code review: fix chmod on json files.

tuleap-lister: code review: fix var names + add tests.

tuleap-lister: code review: fix useless indirection.

tuleap-lister: code review: Add empty repo test, minor typo fixes.
2021-05-26 11:09:12 +02:00
Raphaël Gomès
8f3bbacd5e sourceforge: don't abort on error for project
It's suboptimal to say the least to stop the entire lister process
if a single project page is somehow broken (404, most likely). This
change logs the issue as a warning and carries on, as well as some
minor logging changes and comments touch ups.
2021-05-12 15:54:53 +02:00
Antoine R. Dumont (@ardumont)
2ff549e125
sourceforge/tasks: Allow incremental listing
Related to T3310
2021-05-07 17:04:24 +02:00
Antoine R. Dumont (@ardumont)
7282647bb2
sourceforge/lister: Add credentials parameter
The credentials parameter is not optional due to the instance constructor logic. Even if
unused, this must be provided to the lister (from the task standpoint).

Related to T3310#64801
2021-05-07 16:46:09 +02:00
Antoine Lambert
3167a6dcb7 sourceforge/tests: Ensure correct sleep function gets mocked
This ensures the mocked sleep will work with all tenacity versions.

Related to T3310
2021-05-07 14:40:13 +02:00
Antoine Lambert
1284eb1587 sourceforge/tests: Fix failing test with tenacity < 5.1
It fixes debian package build of swh-lister on buster.
2021-05-07 14:05:36 +02:00
Raphaël Gomès
3baf1d0999 Make the SourceForge lister incremental
SourceForge's sitemaps (1 main one + many sharded) give us a "last
modified" date for every subsitemap and project, allowing us to perform
an incremental listing.

We store the subsitemaps' "last modified" dates in the lister state, as
well as those of the empty projects (projects which don't have any VCS
registered), and the rest comes from the already visited origins from
the database.

The tests try to cover the possible cases of a subsitemap that has
changed, one that hasn't, a project that has change, one that hasn't,
and same for an empty project.
2021-05-06 10:28:27 +02:00
Antoine Lambert
6f8dd5d3f2 tox: Add sphinx environments to check sane doc build
Enable to check package documentation can be built without producing
sphinx warnings.

The sphinx environment is designed to be used in continuous integration
in order to prevent breaking documentation build when committing changes.

The sphinx-dev environment is designed to be used inside a full swh
development environment.

Related to T3258
2021-04-28 14:05:20 +02:00
Valentin Lorentz
18b68bd8c7 s/REST( API)?/API/
Bitbucket's API kind of supports REST workflows, but the clearly use it
like an RPC API (the hardcoded schema in `PROJECT_API_URL_FORMAT`
make it particularly clear)
2021-04-27 18:13:13 +02:00
Valentin Lorentz
40e1916510 Fix various Sphinx warnings 2021-04-13 21:56:08 +02:00
Valentin Lorentz
465506a0ce Remove old lister tutorial.
Sphinx complains because it's an orphan document.
2021-04-13 18:31:31 +02:00
Hezekiah Maina
d5d7830b64 Added Hezekiah Maina as a contributor 2021-04-04 23:33:22 +03:00
Hezekiah Maina
7124627400 Added the right location on linter.yml file 2021-04-04 23:22:46 +03:00
Raphaël Gomès
f7b27c6930 Add a non-incremental sourceforge lister
Following zack's work on T735, this change introduces an actual SWH lister for
SourceForge.

SourceForge provides a main sitemap that lists sharded sitemaps, which
themselves list pages. Each page belongs to a project (or sub-project,
though those are rare), information about which can be found by querying
a REST API, which gives us the list of any and all VCS used for said
project. Both sitemaps and pages have a "last modified" timestamp that
will be used in a future patch to implement incremental listing.

More precise information can be found as inline comments or docstrings.
2021-03-23 18:40:21 +01:00
Nicolas Dandrimont
879170a57d GitHub: handle edge cases with empty responses 2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
c375a61b16 GitHub: handle Server Errors
These errors happen, sometimes, when requesting large pages of results.
2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
4a215e68e0 GitHub: Move rate-limit reset logic to RateLimited exception
This makes the logic easier to test.
2021-03-19 16:52:46 +01:00
Nicolas Dandrimont
cfd4169bd8 Retry GitHub requests on ChunkEncodingErrors
These happen, sometimes, when the connection to the GitHub server
resets, e.g. because of congestion on a slow link.
2021-03-19 16:52:46 +01:00
Nicolas Dandrimont
61c1d444c5 GitHub: Move rate limit handling to the request function 2021-03-19 15:58:01 +01:00
Nicolas Dandrimont
03b10e5c83 GitHub: Start moving the request logic to a separate function 2021-03-19 15:58:01 +01:00
Nicolas Dandrimont
8f7dbb7488 GitHub: Use function for requests.Session initialization
This will help us to break the retry logic for the listing requests
themselves to a separate function too.
2021-03-19 15:58:01 +01:00
Valentin Lorentz
df73073a67 docs: Fix title syntax 2021-03-19 09:58:38 +01:00
tenma
2e17729e97
docs: Add new "howto write a lister tutorial" with unified lister api
This adds a new tutorial which details how to currently write the new listers (both
incremental or stateless). This proposes a python template file to start a new lister.

At last, this renames the previous tutorial into tutorial-2017.

Related to T3073
2021-02-26 16:05:18 +01:00
Antoine Lambert
5b4dc289b7 debian: Update archive mirror URL templates to process
Some distributions (e.g. debian-security) have a slightly different URL
for retrieving source packages metadata.

So add a new URL template to process when trying to download such data.

Related to T3032#58239
2021-02-08 14:01:59 +01:00
Antoine Lambert
e72c15e97a docs: Update listers execution instructions
Remove outdated part about listers database and use swh CLI in
README for executing a lister instead of raw Python code.
2021-02-05 14:51:24 +01:00
Antoine Lambert
1803b707e4 cran: Prevent multiple listing of an origin
A CRAN package can appear twice in the JSON list returned by the
list_all_packages.R script, most recent version of the package
appearing first.

So handle that edge case to avoid error when sending origins to
the scheduler.
2021-02-05 14:34:37 +01:00
Antoine Lambert
b4c4c20bb9 cran: Add support for parsing date with milliseconds 2021-02-05 14:32:49 +01:00
Antoine Lambert
2461c97bbb pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page.

So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.

Also drop no longer used xmltodict in requirements.
2021-02-05 14:23:11 +01:00
Antoine Lambert
4245c5046f Remove no longer used models field in dict returned by register 2021-02-02 16:33:52 +01:00
Antoine Lambert
8933544521 Remove no longer used legacy Lister API and update CLI options
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.

Also remove lister CLI options related to legacy Lister API.

As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.

Closes T2442
2021-02-02 15:54:55 +01:00
Antoine Lambert
ff05191b7d packagist: Reimplement lister using new Lister API
The previous implementation was generating tasks for a non implemented
Packagist loader.

The new implementation extracts source repository URL, VCS type and
last update date for each package referenced by Packagist and send
those info to the scheduler.

Packages metadata are retrieved using Packagist API endpoints whose
responses are served from static files, which are guaranteed to be
efficient on the Packagist side (no dymamic queries).
Furthermore, subsequent listing will send the "If-Modified-Since" HTTP
header to only retrieve packages metadata updated since the previous
listing operation in order to save bandwidth and return only origins
which might have new released versions.

Closes T2991
2021-02-02 14:48:47 +01:00
Antoine Lambert
82ab96ad06 gnu: Remove dependency on pytz
UTC timezone settings can be obtained from the datetime.timezone
module from Python standard library so remove dependency on external
pytz module.
2021-02-02 13:19:04 +01:00
Vincent SELLIER
8e4dd178f1
cgit: remove the repository urls's trailing /
Ensure the behavior is the same when a base url is provided or not

Related to T3013#57810
2021-02-01 17:31:08 +01:00
Antoine R. Dumont (@ardumont)
003cf5491f
pattern: Bump packet split to chunk of 1000 records
Listers like github and bitbucket should not be impacted as they already list 1000
records per page.
2021-01-29 16:55:29 +01:00
Antoine R. Dumont (@ardumont)
2e22073558
cgit: Compute origin urls out of a base git url when provided.
This adds a second behavior to the cgit lister to actually compute origin urls instead
of parsing them out of another http request on git detailed page.

This new behavior is expected to be the default behavior.

The old behavior is kept for now and is expected to be used as fallback if too much
false negatives are returned.

Related to T2999
2021-01-29 15:33:24 +01:00
Antoine Lambert
4cf0c7f765 gnu: Reimplement lister using new Lister API
ISO functionalities port of the stateless GNU lister to the new
swh.lister.pattern.Lister API.

Closes T2990
2021-01-29 14:39:36 +01:00
Antoine Lambert
5aa7c8f2b2 launchpad: Remove call to dataclasses.asdict on lister state
This generates an error due to the datetime type field, so manually build
the dict instead.

Related to T3003#57551
2021-01-28 19:17:58 +01:00
Antoine Lambert
46f5a50099 launchpad: Prevent error due to origin listed twice
launchpadlib can list the last modified repository twice so ensure to yield
a single ListedOrigin model for that special case.

Related to T3003#57551
2021-01-28 19:09:44 +01:00
Antoine R. Dumont (@ardumont)
130ad7d73e
Make debian lister constructors compatible with credentials
In effect, it just allows to add credentials to cgit, cran and pypi listers.

This fixes instances of error [1]

[1] https://sentry.softwareheritage.org/share/issue/a5fb50f8e43e4b328c4917771576c6b0/

Related to T2998
2021-01-28 18:46:52 +01:00
Antoine Lambert
e8725eb247 launchpad/tasks: Fix ping task function name
An exception is raised when registering task types in scheduler database otherwise.
2021-01-28 17:35:40 +01:00
Antoine R. Dumont (@ardumont)
0ad37740d9
pattern: Make lister flush regularly origins to scheduler
As origins is a generator, the previous behavior would try to consume the overall
generator to send the records.

This groups and sends batch of 100 origins to the scheduler for writing.

Related to T3003
2021-01-28 16:52:03 +01:00