A CRAN package can appear twice in the JSON list returned by the
list_all_packages.R script, most recent version of the package
appearing first.
So handle that edge case to avoid error when sending origins to
the scheduler.
xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page.
So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.
Also drop no longer used xmltodict in requirements.
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.
Also remove lister CLI options related to legacy Lister API.
As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.
Closes T2442
The previous implementation was generating tasks for a non implemented
Packagist loader.
The new implementation extracts source repository URL, VCS type and
last update date for each package referenced by Packagist and send
those info to the scheduler.
Packages metadata are retrieved using Packagist API endpoints whose
responses are served from static files, which are guaranteed to be
efficient on the Packagist side (no dymamic queries).
Furthermore, subsequent listing will send the "If-Modified-Since" HTTP
header to only retrieve packages metadata updated since the previous
listing operation in order to save bandwidth and return only origins
which might have new released versions.
Closes T2991
This adds a second behavior to the cgit lister to actually compute origin urls instead
of parsing them out of another http request on git detailed page.
This new behavior is expected to be the default behavior.
The old behavior is kept for now and is expected to be used as fallback if too much
false negatives are returned.
Related to T2999
As origins is a generator, the previous behavior would try to consume the overall
generator to send the records.
This groups and sends batch of 100 origins to the scheduler for writing.
Related to T3003
Port launchpad lister to the swh.lister.pattern.Lister API.
Last update date of each listed git repositories is now sent to the scheduler.
The lister can work in incremental mode, only modified repositories since
the last listing operation will be returned in that case.
Closes T2992
Drop launchpad lister from the lister to check, its test setup is more involved than the
other listers. As its setup is not done in that test, it's actually connecting
anonymously to the launchpad server. So remove such lister from the test.
This should also fix the debian build which refuses such access [1]
[1] https://jenkins.softwareheritage.org/job/debian/job/packages/job/DLS/job/gbp-buildpackage/97/console
Port debian lister to `swh.lister.pattern.Lister` API.
The new implementation will produce one instance of ListedOrigin model
per package, notably containing the set of parameters expected by the
debian loader.
The lister is also stateful, meaning only new packages and those with
new found versions since the last listing will be returned.
Closes T2979
The previous pagination implementation has a hard-coded limit server side [1]
[1]
```
{"error":"Offset pagination has a maximum allowed offset of 50000 for requests that return objects of type Project. Remaining records can be retrieved using keyset pagination."}
```
Related to T2994
R package last update date can be found in the "Packaged" field of
package info returned by tools::CRAN_package_db().
So retrieve it and parse it as a datetime to provide as last_update
parameter value in ListedOrigin model.
Closes T2989
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
The lister is stateless and has full listing capability.
It can request the Gitea API using HTTP token authentication.
Rate-limiting was not encountered but is handled generically.
Added support for getting repo last update date through API.