Commit graph

37 commits

Author SHA1 Message Date
Antoine Lambert
a7607abcf9 tests: Fix mocking of sleep calls with tenacity 8.4.2
Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.

Fix it by directly mocking time.sleep (was not working previously).
2024-06-28 18:15:36 +02:00
Antoine Lambert
4aee4da784 cran: Use pyreadr instead of rpy2 to read a RDS file from Python
The CRAN lister improvements introduced in 91e4e33 originally used pyreadr
to read a RDS file from Python instead of rpy2.

As swh-lister was still packaged for debian at the time, the choice of using
rpy2 instead was made as a debian package is available for it while it is not
for pyreadr.

Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
based implementation which has the advantages of being faster and not depending
on the R language runtime.

Related to swh/meta#1709.
2023-11-14 17:09:42 +01:00
Franck Bret
f8cfa05f3f Add Julia Lister for listing Julia Packages
This module introduce Julia Lister.
It retrieves Julia packages origins from the Julia General Registry, a Git
repository made of per package directory with Toml definition files.
2023-10-09 15:05:25 +02:00
Antoine Lambert
91e4e33dd5 cran: Improve listing of R packages
Previously, the lister was relying on the use of the CRANtools R module
but it has the drawback to only list the latest version of each registered
package in the CRAN registry.

In order to get all possible versions for each CRAN package, prefer to exploit
the content of the weekly dump of the CRAN database in RDS format.

To read the content of the RDS file from Python, the rpy2 package is used as
it has the advantage to be packaged in debian.

Related to swh/meta#1709.
2023-08-21 16:38:08 +02:00
Antoine Lambert
3a0e8b9995 requirements.txt: Sort packages by name 2023-08-17 10:45:43 +02:00
Antoine R. Dumont (@ardumont)
573958ce64
Add Gitweb lister
Depending on some instances, we have some specific heuristics, some instances:
- have summary pages which do not not list metadata_url (so some
  computation happens to list git:// origins which are cloneable)
- have summary page which reference metadata_url as a multiple comma separated urls
- lists relative urls of the repository so we need to join it with the main instance url
  to have a complete cloneable origins (or summary page)
- lists "down" http origins (cloning those won't work) so lists those as cloneable https
  ones (when the main url is behind https).

Refs. swh/devel/swh-lister#1800
2023-07-10 16:50:41 +02:00
KShivendu
6ad61aec23 feat(fedora): Introduce fedora lister
Summary: Lister to ingest fedora mirrors (.rpm)

Reviewers: #reviewers, vlorentz

Subscribers: vlorentz, olasd

Maniphest Tasks: T4448

Differential Revision: https://forge.softwareheritage.org/D8386
2022-11-15 15:53:52 +05:30
Antoine Lambert
108816f232 rubygems: Use gems database dump to improve listing output
Instead of using an undocumented rubygems HTTP endpoint that only
gives us the names of the gems, prefer to exploit the daily PostgreSQL
dump of the rubygems.org database.

It enables to list all gems but also all versions of a gem and its
release artifacts. For each relase artifact, the following info are
extracted: version, download URL, sha256 checksum, release date
plus a couple of extra metadata.

The lister will now set list of artifacts and list of metadata as extra
loader arguments when sending a listed origin to the scheduler database.
A last_update date is also computed which should ensure loading tasks
for rubygems will be scheduled only when new releases are available since
last loadings.

To be noted, the lister will spawn a temporary postgres instance so this
require the initdb executable from postgres server installation to be
available in the execution environment.

Related to T1777
2022-10-07 16:54:48 +02:00
Antoine Lambert
cee6bcb514 maven: Use BeautifulSoup instead of xmltodict for parsing pom files
xmltodict cannot parse POM files with multi-byte encoding so prefer to
use the XML parser of BeautifulSoup based on lxml instead.

Also drop xmltodict requirement as it is no longer used in swh-lister
codebase.
2022-08-09 11:11:45 +02:00
Franck Bret
a6f796b268 crates.lister: Implement incremental mode:
Add incremental mode support based on a 'last_commit' state, used to get
new package versions from git diff range of commits.
2022-08-05 13:41:57 +02:00
Antoine Lambert
2fa9f0abd2 sourceforge: Fix listing of bzr projects
Fix sourceforge origin URL for bzr projects,
http://project.bzr.sourceforge.net/bzrroot/project
redirects to http://project.bzr.sourceforge.net/bzr/project.

Handle bzr projects with multiple branches, one listed origin
must be created per branch.

Discard bzr projects that no longer exist from listing.
2022-04-21 18:19:07 +02:00
Antoine Lambert
445d539b3f Remove no longer needed tenacity workarounds
Now that we have packaged tenacity 6.2 for debian buster and use it
in production, we can remove the workarounds to support tenacity < 5.
2021-12-08 13:28:11 +01:00
Boris Baldassari
8991c625ea lister: Add new maven lister
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724
2021-11-29 17:33:13 +01:00
Antoine Lambert
2461c97bbb pypi: Use BeautifulSoup for parsing HTML instead of xmltodict
xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page.

So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.

Also drop no longer used xmltodict in requirements.
2021-02-05 14:23:11 +01:00
Antoine Lambert
8933544521 Remove no longer used legacy Lister API and update CLI options
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.

Also remove lister CLI options related to legacy Lister API.

As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.

Closes T2442
2021-02-02 15:54:55 +01:00
Antoine Lambert
82ab96ad06 gnu: Remove dependency on pytz
UTC timezone settings can be obtained from the datetime.timezone
module from Python standard library so remove dependency on external
pytz module.
2021-02-02 13:19:04 +01:00
Antoine Lambert
d1fbccd988 lister: Add utility decorator to ease HTTP requests rate limit handling
Add swh.lister.utils.throttling_retry decorator enabling to retry a
function that performs an HTTP request who can return a 429 status code.

The implementation is based on the tenacity module and it is assumed
that the requests library is used when querying an URL.

The default wait strategy is based on exponential backoff.

The default max number of attempts is set to 5, HTTPError exception
will then be reraised.

All tenacity.retry parameters can also be overridden in client code.
2021-01-18 11:28:51 +01:00
Léni Gauffier
58ef08b083 Added LaunchpadLister
Summary:
Related to T1734

From abandonned D2799

Reviewers: ardumont

Reviewed By: ardumont

Differential Revision: https://forge.softwareheritage.org/D2974
2020-04-12 01:00:12 +02:00
Antoine R. Dumont (@ardumont)
c6372eea7e
gnu.lister: Unify timestamp formats to isoformat date in model
Related T2023
2019-11-04 10:08:01 +01:00
Archit Agrawal
b972a2a88d swh.lister.cgit
Implemented a lister to list the repos for a given CGit instance.

Closes T1659
2019-06-28 19:27:25 +05:30
David Douard
c2c26d7e46 Fix the bitbucket lister; handle properly the date-like bounds 2019-02-01 15:38:11 +01:00
Nicolas Dandrimont
2922b68570 Clean up dependencies to enable tests on build 2017-10-30 17:04:49 +01:00
Stefano Zacchiroli
164922afe2 requirements.txt: add missing dependency on "arrow" 2017-09-05 10:54:54 +02:00
Antoine Pietri
c81c7de88c requirements: remove celery (already required by swh.scheduler) 2017-04-12 15:21:22 +02:00
Avi Kelman (fiendish)
68d77fd43f Refactor lister code
Streamline production of new listers by aggressively moving core
functionality into progressively inherited (A->B->C) base classes
with the transport layer abstracted.
This should make common individual forge listers straightforward to
produce with minimal customization. Github and Bitbucket listers
can be used as examples of the indexing type.
2017-03-06 12:35:49 +01:00
Antoine Pietri
ede9e5048c requirements: split internal and external requirements in two separate files 2017-02-09 14:32:02 +01:00
Antoine R. Dumont (@ardumont)
b217f55cfe
Update storage configuration reading
Related T613
2016-12-15 19:07:02 +01:00
Nicolas Dandrimont
d2483e7893 requirements.txt: use proper syntax 2016-10-20 17:27:26 +02:00
Nicolas Dandrimont
9ba8fedc4c base: add implementation for adding origins 2016-10-19 16:53:32 +02:00
Nicolas Dandrimont
ca4d346451 requirements.txt: Add inter-swh dependencies 2016-10-19 15:41:26 +02:00
Nicolas Dandrimont
2a62db6827 Revert to the pre-qless refactoring version 2016-09-13 14:57:26 +02:00
Nicolas Dandrimont
52f9fd157e sync packaging metadata 2016-03-17 19:01:10 +01:00
Nicolas Dandrimont
6fbabbe586 req_queue: use qless instead of a handmade queue 2016-03-17 17:50:03 +01:00
Nicolas Dandrimont
c7871e44e8 requirements.txt: add redis 2016-03-17 17:45:59 +01:00
Nicolas Dandrimont
533f6fa1a3 swh.lister.github: Refactor to use swh.storage instead of sqlalchemy 2016-03-09 19:03:35 +01:00
Stefano Zacchiroli
ecca87dccf requirements.txt: add dependency on requests 2015-09-21 21:13:04 +02:00
Stefano Zacchiroli
376141397d add requirements.txt, listing sqlalchemy as dependency 2015-09-21 21:11:45 +02:00