Commit graph

658 commits

Author SHA1 Message Date
Valentin Lorentz
18b68bd8c7 s/REST( API)?/API/
Bitbucket's API kind of supports REST workflows, but the clearly use it
like an RPC API (the hardcoded schema in `PROJECT_API_URL_FORMAT`
make it particularly clear)
2021-04-27 18:13:13 +02:00
Valentin Lorentz
40e1916510 Fix various Sphinx warnings 2021-04-13 21:56:08 +02:00
Raphaël Gomès
f7b27c6930 Add a non-incremental sourceforge lister
Following zack's work on T735, this change introduces an actual SWH lister for
SourceForge.

SourceForge provides a main sitemap that lists sharded sitemaps, which
themselves list pages. Each page belongs to a project (or sub-project,
though those are rare), information about which can be found by querying
a REST API, which gives us the list of any and all VCS used for said
project. Both sitemaps and pages have a "last modified" timestamp that
will be used in a future patch to implement incremental listing.

More precise information can be found as inline comments or docstrings.
2021-03-23 18:40:21 +01:00
Nicolas Dandrimont
879170a57d GitHub: handle edge cases with empty responses 2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
c375a61b16 GitHub: handle Server Errors
These errors happen, sometimes, when requesting large pages of results.
2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
4a215e68e0 GitHub: Move rate-limit reset logic to RateLimited exception
This makes the logic easier to test.
2021-03-19 16:52:46 +01:00
Nicolas Dandrimont
cfd4169bd8 Retry GitHub requests on ChunkEncodingErrors
These happen, sometimes, when the connection to the GitHub server
resets, e.g. because of congestion on a slow link.
2021-03-19 16:52:46 +01:00
Nicolas Dandrimont
61c1d444c5 GitHub: Move rate limit handling to the request function 2021-03-19 15:58:01 +01:00
Nicolas Dandrimont
03b10e5c83 GitHub: Start moving the request logic to a separate function 2021-03-19 15:58:01 +01:00
Nicolas Dandrimont
8f7dbb7488 GitHub: Use function for requests.Session initialization
This will help us to break the retry logic for the listing requests
themselves to a separate function too.
2021-03-19 15:58:01 +01:00
Antoine Lambert
5b4dc289b7 debian: Update archive mirror URL templates to process
Some distributions (e.g. debian-security) have a slightly different URL
for retrieving source packages metadata.

So add a new URL template to process when trying to download such data.

Related to T3032#58239
2021-02-08 14:01:59 +01:00
Antoine Lambert
1803b707e4 cran: Prevent multiple listing of an origin
A CRAN package can appear twice in the JSON list returned by the
list_all_packages.R script, most recent version of the package
appearing first.

So handle that edge case to avoid error when sending origins to
the scheduler.
2021-02-05 14:34:37 +01:00
Antoine Lambert
b4c4c20bb9 cran: Add support for parsing date with milliseconds 2021-02-05 14:32:49 +01:00
Antoine Lambert
2461c97bbb pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page.

So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.

Also drop no longer used xmltodict in requirements.
2021-02-05 14:23:11 +01:00
Antoine Lambert
4245c5046f Remove no longer used models field in dict returned by register 2021-02-02 16:33:52 +01:00
Antoine Lambert
8933544521 Remove no longer used legacy Lister API and update CLI options
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.

Also remove lister CLI options related to legacy Lister API.

As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.

Closes T2442
2021-02-02 15:54:55 +01:00
Antoine Lambert
ff05191b7d packagist: Reimplement lister using new Lister API
The previous implementation was generating tasks for a non implemented
Packagist loader.

The new implementation extracts source repository URL, VCS type and
last update date for each package referenced by Packagist and send
those info to the scheduler.

Packages metadata are retrieved using Packagist API endpoints whose
responses are served from static files, which are guaranteed to be
efficient on the Packagist side (no dymamic queries).
Furthermore, subsequent listing will send the "If-Modified-Since" HTTP
header to only retrieve packages metadata updated since the previous
listing operation in order to save bandwidth and return only origins
which might have new released versions.

Closes T2991
2021-02-02 14:48:47 +01:00
Antoine Lambert
82ab96ad06 gnu: Remove dependency on pytz
UTC timezone settings can be obtained from the datetime.timezone
module from Python standard library so remove dependency on external
pytz module.
2021-02-02 13:19:04 +01:00
Vincent SELLIER
8e4dd178f1
cgit: remove the repository urls's trailing /
Ensure the behavior is the same when a base url is provided or not

Related to T3013#57810
2021-02-01 17:31:08 +01:00
Antoine R. Dumont (@ardumont)
003cf5491f
pattern: Bump packet split to chunk of 1000 records
Listers like github and bitbucket should not be impacted as they already list 1000
records per page.
2021-01-29 16:55:29 +01:00
Antoine R. Dumont (@ardumont)
2e22073558
cgit: Compute origin urls out of a base git url when provided.
This adds a second behavior to the cgit lister to actually compute origin urls instead
of parsing them out of another http request on git detailed page.

This new behavior is expected to be the default behavior.

The old behavior is kept for now and is expected to be used as fallback if too much
false negatives are returned.

Related to T2999
2021-01-29 15:33:24 +01:00
Antoine Lambert
4cf0c7f765 gnu: Reimplement lister using new Lister API
ISO functionalities port of the stateless GNU lister to the new
swh.lister.pattern.Lister API.

Closes T2990
2021-01-29 14:39:36 +01:00
Antoine Lambert
5aa7c8f2b2 launchpad: Remove call to dataclasses.asdict on lister state
This generates an error due to the datetime type field, so manually build
the dict instead.

Related to T3003#57551
2021-01-28 19:17:58 +01:00
Antoine Lambert
46f5a50099 launchpad: Prevent error due to origin listed twice
launchpadlib can list the last modified repository twice so ensure to yield
a single ListedOrigin model for that special case.

Related to T3003#57551
2021-01-28 19:09:44 +01:00
Antoine R. Dumont (@ardumont)
130ad7d73e
Make debian lister constructors compatible with credentials
In effect, it just allows to add credentials to cgit, cran and pypi listers.

This fixes instances of error [1]

[1] https://sentry.softwareheritage.org/share/issue/a5fb50f8e43e4b328c4917771576c6b0/

Related to T2998
2021-01-28 18:46:52 +01:00
Antoine Lambert
e8725eb247 launchpad/tasks: Fix ping task function name
An exception is raised when registering task types in scheduler database otherwise.
2021-01-28 17:35:40 +01:00
Antoine R. Dumont (@ardumont)
0ad37740d9
pattern: Make lister flush regularly origins to scheduler
As origins is a generator, the previous behavior would try to consume the overall
generator to send the records.

This groups and sends batch of 100 origins to the scheduler for writing.

Related to T3003
2021-01-28 16:52:03 +01:00
Antoine Lambert
f862004700 launchpad: Reimplement lister using new Lister API
Port launchpad lister to the swh.lister.pattern.Lister API.

Last update date of each listed git repositories is now sent to the scheduler.

The lister can work in incremental mode, only modified repositories since
the last listing operation will be returned in that case.

Closes T2992
2021-01-28 15:22:40 +01:00
Antoine R. Dumont (@ardumont)
ae17b6b9a0
Make stateless lister constructors compatible with credentials
In effect, it just allows to add credentials to cgit, cran and pypi listers.

This fixes instances of error [1]

[1] https://sentry.softwareheritage.org/share/issue/2c35a9f129cf4982a2dd003a232d507a/

Related to T2998
2021-01-28 14:42:56 +01:00
Antoine R. Dumont (@ardumont)
72be074a79
gitlab: Deal with missing or trailing / in url input 2021-01-28 10:46:58 +01:00
Antoine R. Dumont (@ardumont)
17b0e7af26
cli: Make cli work with new lister
while allowing legacy lister to still run (with --legacy)
2021-01-28 09:12:56 +01:00
Antoine R. Dumont (@ardumont)
cbd2cce339
test_cli: Drop launchpad lister from the test_get_lister
Drop launchpad lister from the lister to check, its test setup is more involved than the
other listers. As its setup is not done in that test, it's actually connecting
anonymously to the launchpad server. So remove such lister from the test.

This should also fix the debian build which refuses such access [1]

[1] https://jenkins.softwareheritage.org/job/debian/job/packages/job/DLS/job/gbp-buildpackage/97/console
2021-01-27 17:18:53 +01:00
Antoine R. Dumont (@ardumont)
b11b4d1001
launchpad: Actually mock the anonymous login to launchpad
This current test was failing in a debian chroot with connection error.
2021-01-27 16:14:01 +01:00
Antoine R. Dumont (@ardumont)
461bf09973
Drop no longer swh.lister.core.{indexing,page_by_page}_lister
The listers depending on it got ported to the new lister api.
2021-01-27 15:42:57 +01:00
Antoine R. Dumont (@ardumont)
e09ad272d7
tests: Drop unneeded reset instruction
Plus that instruction is not correct in most recent requests_mock version (failing the
debian build)
2021-01-27 15:42:57 +01:00
Vincent SELLIER
f6f9f1ca28
cgit: Don't stop the listing when a repository page is not available
Related to T2988
2021-01-27 14:52:04 +01:00
Vincent SELLIER
91fcde8341
cgit: Add support for last_update information during listing
Related to T2988
2021-01-27 14:17:17 +01:00
Antoine Lambert
bb0184c004 debian: Reimplement lister using new Lister API
Port debian lister to `swh.lister.pattern.Lister` API.

The new implementation will produce one instance of ListedOrigin model
per package, notably containing the set of parameters expected by the
debian loader.

The lister is also stateful, meaning only new packages and those with
new found versions since the last listing will be returned.

Closes T2979
2021-01-26 17:20:22 +01:00
tenma
6cd31769c1 tests: Remove no longer used conftest files
All the fixtures declared in them are not used anymore in the
tests of the listers ported to the new Lister API.
2021-01-26 17:09:04 +01:00
Antoine R. Dumont (@ardumont)
97254a19f2
gitlab: Implement keyset-based pagination listing
The previous pagination implementation has a hard-coded limit server side [1]

[1]

```
{"error":"Offset pagination has a maximum allowed offset of 50000 for requests that return objects of type Project. Remaining records can be retrieved using keyset pagination."}
```

Related to T2994
2021-01-26 16:54:14 +01:00
Antoine Lambert
22eeb0956e cran: Retrieve last update date for each listed package
R package last update date can be found in the "Packaged" field of
package info returned by tools::CRAN_package_db().

So retrieve it and parse it as a datetime to provide as last_update
parameter value in ListedOrigin model.

Closes T2989
2021-01-26 15:14:32 +01:00
Antoine Lambert
6f40ab4c57 cran: Reimplement lister using new Lister API
Related to T2989
2021-01-26 15:14:32 +01:00
Antoine R. Dumont (@ardumont)
aefb260f76
gitlab: Add support for last_update information during listing 2021-01-26 14:03:20 +01:00
Antoine R. Dumont (@ardumont)
1a19b2c747
gitlab: Support authentication
Related to T2987
2021-01-26 14:03:20 +01:00
Antoine R. Dumont (@ardumont)
bea9d6d147
gitlab: make url mandatory and add type 2021-01-25 19:00:01 +01:00
Vincent SELLIER
d62e77c1b4
cgit lister: Add missing types on the init method
Related to T2984
2021-01-25 18:52:59 +01:00
Antoine Lambert
ea8ecee541 tests: Fix errors after swh-scheduler API update
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
2021-01-25 17:11:54 +01:00
tenma
b6a69b2ed9 gitea.lister: improve handling of credentials 2021-01-25 15:54:06 +01:00
tenma
c220d7d299 gitea.tests: split and make them more thorough 2021-01-25 15:54:06 +01:00
tenma
c780ad4b44 Reimplement Gitea lister using new Lister API
The lister is stateless and has full listing capability.
It can request the Gitea API using HTTP token authentication.
Rate-limiting was not encountered but is handled generically.
Added support for getting repo last update date through API.
2021-01-25 15:54:06 +01:00