Commit graph

781 commits

Author SHA1 Message Date
Antoine Lambert
e8725eb247 launchpad/tasks: Fix ping task function name
An exception is raised when registering task types in scheduler database otherwise.
2021-01-28 17:35:40 +01:00
Antoine R. Dumont (@ardumont)
0ad37740d9
pattern: Make lister flush regularly origins to scheduler
As origins is a generator, the previous behavior would try to consume the overall
generator to send the records.

This groups and sends batch of 100 origins to the scheduler for writing.

Related to T3003
2021-01-28 16:52:03 +01:00
Antoine Lambert
f862004700 launchpad: Reimplement lister using new Lister API
Port launchpad lister to the swh.lister.pattern.Lister API.

Last update date of each listed git repositories is now sent to the scheduler.

The lister can work in incremental mode, only modified repositories since
the last listing operation will be returned in that case.

Closes T2992
2021-01-28 15:22:40 +01:00
Antoine R. Dumont (@ardumont)
ae17b6b9a0
Make stateless lister constructors compatible with credentials
In effect, it just allows to add credentials to cgit, cran and pypi listers.

This fixes instances of error [1]

[1] https://sentry.softwareheritage.org/share/issue/2c35a9f129cf4982a2dd003a232d507a/

Related to T2998
2021-01-28 14:42:56 +01:00
Antoine R. Dumont (@ardumont)
72be074a79
gitlab: Deal with missing or trailing / in url input 2021-01-28 10:46:58 +01:00
Antoine R. Dumont (@ardumont)
3bede83d0f
tox.ini: Work around build failure due to upstream release
This fixes the master build [1]

[1] https://jenkins.softwareheritage.org/job/DLS/job/tests/1210/console
2021-01-28 10:46:27 +01:00
Antoine R. Dumont (@ardumont)
17b0e7af26
cli: Make cli work with new lister
while allowing legacy lister to still run (with --legacy)
2021-01-28 09:12:56 +01:00
Antoine R. Dumont (@ardumont)
cbd2cce339
test_cli: Drop launchpad lister from the test_get_lister
Drop launchpad lister from the lister to check, its test setup is more involved than the
other listers. As its setup is not done in that test, it's actually connecting
anonymously to the launchpad server. So remove such lister from the test.

This should also fix the debian build which refuses such access [1]

[1] https://jenkins.softwareheritage.org/job/debian/job/packages/job/DLS/job/gbp-buildpackage/97/console
2021-01-27 17:18:53 +01:00
Antoine R. Dumont (@ardumont)
b11b4d1001
launchpad: Actually mock the anonymous login to launchpad
This current test was failing in a debian chroot with connection error.
2021-01-27 16:14:01 +01:00
Antoine R. Dumont (@ardumont)
461bf09973
Drop no longer swh.lister.core.{indexing,page_by_page}_lister
The listers depending on it got ported to the new lister api.
2021-01-27 15:42:57 +01:00
Antoine R. Dumont (@ardumont)
e09ad272d7
tests: Drop unneeded reset instruction
Plus that instruction is not correct in most recent requests_mock version (failing the
debian build)
2021-01-27 15:42:57 +01:00
Vincent SELLIER
f6f9f1ca28
cgit: Don't stop the listing when a repository page is not available
Related to T2988
2021-01-27 14:52:04 +01:00
Vincent SELLIER
91fcde8341
cgit: Add support for last_update information during listing
Related to T2988
2021-01-27 14:17:17 +01:00
Antoine Lambert
bb0184c004 debian: Reimplement lister using new Lister API
Port debian lister to `swh.lister.pattern.Lister` API.

The new implementation will produce one instance of ListedOrigin model
per package, notably containing the set of parameters expected by the
debian loader.

The lister is also stateful, meaning only new packages and those with
new found versions since the last listing will be returned.

Closes T2979
2021-01-26 17:20:22 +01:00
tenma
6cd31769c1 tests: Remove no longer used conftest files
All the fixtures declared in them are not used anymore in the
tests of the listers ported to the new Lister API.
2021-01-26 17:09:04 +01:00
Antoine R. Dumont (@ardumont)
97254a19f2
gitlab: Implement keyset-based pagination listing
The previous pagination implementation has a hard-coded limit server side [1]

[1]

```
{"error":"Offset pagination has a maximum allowed offset of 50000 for requests that return objects of type Project. Remaining records can be retrieved using keyset pagination."}
```

Related to T2994
2021-01-26 16:54:14 +01:00
Antoine Lambert
22eeb0956e cran: Retrieve last update date for each listed package
R package last update date can be found in the "Packaged" field of
package info returned by tools::CRAN_package_db().

So retrieve it and parse it as a datetime to provide as last_update
parameter value in ListedOrigin model.

Closes T2989
2021-01-26 15:14:32 +01:00
Antoine Lambert
6f40ab4c57 cran: Reimplement lister using new Lister API
Related to T2989
2021-01-26 15:14:32 +01:00
Antoine R. Dumont (@ardumont)
aefb260f76
gitlab: Add support for last_update information during listing 2021-01-26 14:03:20 +01:00
Antoine R. Dumont (@ardumont)
1a19b2c747
gitlab: Support authentication
Related to T2987
2021-01-26 14:03:20 +01:00
Antoine R. Dumont (@ardumont)
bea9d6d147
gitlab: make url mandatory and add type 2021-01-25 19:00:01 +01:00
Vincent SELLIER
d62e77c1b4
cgit lister: Add missing types on the init method
Related to T2984
2021-01-25 18:52:59 +01:00
Antoine Lambert
ea8ecee541 tests: Fix errors after swh-scheduler API update
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
2021-01-25 17:11:54 +01:00
tenma
b6a69b2ed9 gitea.lister: improve handling of credentials 2021-01-25 15:54:06 +01:00
tenma
c220d7d299 gitea.tests: split and make them more thorough 2021-01-25 15:54:06 +01:00
tenma
c780ad4b44 Reimplement Gitea lister using new Lister API
The lister is stateless and has full listing capability.
It can request the Gitea API using HTTP token authentication.
Rate-limiting was not encountered but is handled generically.
Added support for getting repo last update date through API.
2021-01-25 15:54:06 +01:00
tenma
7892077a89 tests.cli: add Gitea lister mandatory params 2021-01-25 15:54:06 +01:00
Antoine R. Dumont (@ardumont)
02871f16c9
gitlab: Adapt celery task implementations to the new lister api
Related to T2987
2021-01-25 15:08:31 +01:00
Vincent SELLIER
e4a590fc7f
Port cgit lister to the new lister api
Related to T2984
2021-01-25 14:57:45 +01:00
Antoine Lambert
59c9abb916 bitbucket: Pick random credentials in configuration and improve logging
Use random credentials from the list in configuration and improve related
logging messages.
2021-01-25 14:34:22 +01:00
Antoine R. Dumont (@ardumont)
ce87a8f7b2
gitlab: Let the lister compute the internal project listing page
Related to T2987
2021-01-25 14:05:34 +01:00
Antoine R. Dumont (@ardumont)
7f1609265f
test: Rename internal method to something public
It's used in multiple module tests now.
2021-01-25 13:39:07 +01:00
Antoine R. Dumont (@ardumont)
d3fe3d5747
gitlab: Fix mypy issue 2021-01-25 13:39:07 +01:00
Antoine R. Dumont (@ardumont)
2246d28606
gitlab: Document the lister constructor parameters
Related to T2987
2021-01-25 13:30:45 +01:00
Antoine R. Dumont (@ardumont)
b352b8e11e
gitlab: Add test on rate-limit support
Related to T2987
2021-01-25 09:23:22 +01:00
Antoine R. Dumont (@ardumont)
1f911401a1
gitlab: Add test on incremental implementation
Note that the current implementation will start back the new visit from the last
next_page link seen (that's what is stored in the lister state to avoid computing back
the url). This means that this page will be seen at least 2 times, on the first visit
and on the next. This should not pose any problems as the listing is idempotent.

Related to T2987
2021-01-25 08:51:23 +01:00
Antoine R. Dumont (@ardumont)
84dd616ab6
gitlab: Add test on pagination
Related to T2987
2021-01-25 08:51:23 +01:00
Antoine R. Dumont (@ardumont)
1390a513f2
gitlab: Port to the new lister api
Related to T2987
2021-01-25 08:51:16 +01:00
Antoine Lambert
ff232f0d91 npm: Reimplement lister using new Lister API
Port npm lister to `swh.lister.pattern.Lister` API.

As before, the lister can be run in full or incremental mode.
When using incremental mode, only new and modified packages will
be returned since the last incremental listing process.
Otherwise, all packages will be listed in lexicographical order.

One major improvement to be noted, latest package update date
is now retrieved when available and sent to scheduler database.

Closes T2972
2021-01-22 10:58:52 +01:00
tenma
5411141e3a gitlab.tests: fix erroneous import from gitea module 2021-01-21 16:17:36 +01:00
tenma
6d22007946 pypi.tests: simplify requests_mock invocation 2021-01-21 16:17:29 +01:00
tenma
62c825b8cb Reimplement PyPI lister using new Lister API
The new lister has only full listing capability.
It scrapes pypi.org list of packages.
Rate-limiting was not encountered but is handled generically.
2021-01-20 15:45:16 +01:00
tenma
565e7423e3 Reimplement Bitbucket lister using new Lister API
The new lister has incremental and full listing capability.
It can request the Bitbucket API in anonymous and HTTP basic authentication
modes. Rate-limiting is not aggressive and is handled.
2021-01-20 15:28:34 +01:00
Antoine Lambert
9fd91f007d pattern: Fix and improve config overriding in from_configfile method
Fix error when a configuration value loaded from a config file is also
given as keyword parameter to the from_configfile method.

Override configuration loaded from config file only if the provided
value is not None.
2021-01-18 17:55:53 +01:00
Antoine Lambert
a41c03e4c8 phabricator: Ensure request errors are raised as exceptions
This ensures that a celery task will be marked as failed if a request error
happens when listing origins.
2021-01-18 12:11:26 +01:00
Antoine Lambert
b743c36496 phabricator: Add test for new lister implementation
Also remove no longer used JSON files.
2021-01-18 12:11:26 +01:00
Antoine Lambert
d691c04eb8 phabricator: Allow to pass forge base URL as lister parameter 2021-01-18 12:11:26 +01:00
Antoine Lambert
d1fbccd988 lister: Add utility decorator to ease HTTP requests rate limit handling
Add swh.lister.utils.throttling_retry decorator enabling to retry a
function that performs an HTTP request who can return a 429 status code.

The implementation is based on the tenacity module and it is assumed
that the requests library is used when querying an URL.

The default wait strategy is based on exponential backoff.

The default max number of attempts is set to 5, HTTPError exception
will then be reraised.

All tenacity.retry parameters can also be overridden in client code.
2021-01-18 11:28:51 +01:00
Antoine Lambert
c782275296 phabricator/tasks: Fix task function return type
Previously, the following error was raised when the task has finished
its execution: "Object of type ListerStats is not JSON serializable".

So ensure ListerStats object gets converted to dict before returning it.

Also add missing test for task function.
2021-01-11 17:59:24 +01:00
Antoine Lambert
b48f71ff93 phabricator/tasks: Allow to pass api_token as optional parameter
This is useful when one wants to test the lister in docker environment.
2021-01-11 17:53:11 +01:00