swh-lister

Author	SHA1	Message	Date
Pierre-Yves David	08fda328be	Migration to psycopg3	2025-03-21 17:05:07 +01:00
Antoine Lambert	a3d66736a4	maven: Update test that is now failing since beautifulsoup4 4.13 Latest beautifulsoup4 release (4.13) seems to have fixed issues related to unexpected encodings in XML files so a test that was passing previously is now failing. Update that test to check origin URL and visit type can be successfully extracted from a POM file with unexpected encoding.	2025-02-10 14:28:33 +01:00
Antoine Lambert	af24960bc2	Add save-bulk lister to check origins prior their insertion in database This new and special lister enables to verify a list of origins to archive provided by users (for instance through the Web API). Its purpose is to avoid polluting the scheduler database with origins that cannot be loaded into the archive. Each origin is identified by an URL and a visit type. For a given visit type the lister is checking if the origin URL can be found and if the visit type is valid. The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus the one for loading a tarball content into the archive. Accepted origins are inserted or upserted in the scheduler database. Rejected origins are stored in the lister state. Related to #4709	2024-09-04 10:42:23 +02:00
Antoine Lambert	aafaebd5de	crates: Use looseversion.LooseVersion2 to parse crate versions packaging.version.parse is dedicated to parse Python package version numbers but crate versions do not necessarily respect Python version number conventions and thus some crate versions cannot be parsed. Prefer to use looseversion.LooseVersion2 instead which in a drop-in replacement for deprecated distutils.version.LooseVersion and enables to parse all kind of version numbers.	2024-08-27 12:28:05 +02:00
Antoine Lambert	a7607abcf9	tests: Fix mocking of sleep calls with tenacity 8.4.2 Latest tenacity release adds some internal changes that broke the mocking of sleep calls in tests. Fix it by directly mocking time.sleep (was not working previously).	2024-06-28 18:15:36 +02:00
Antoine Lambert	4aee4da784	cran: Use pyreadr instead of rpy2 to read a RDS file from Python The CRAN lister improvements introduced in `91e4e33` originally used pyreadr to read a RDS file from Python instead of rpy2. As swh-lister was still packaged for debian at the time, the choice of using rpy2 instead was made as a debian package is available for it while it is not for pyreadr. Now debian packaging was dropped for swh-lister we can reinstate the pyreadr based implementation which has the advantages of being faster and not depending on the R language runtime. Related to swh/meta#1709.	2023-11-14 17:09:42 +01:00
Franck Bret	f8cfa05f3f	Add Julia Lister for listing Julia Packages This module introduce Julia Lister. It retrieves Julia packages origins from the Julia General Registry, a Git repository made of per package directory with Toml definition files.	2023-10-09 15:05:25 +02:00
Antoine Lambert	91e4e33dd5	cran: Improve listing of R packages Previously, the lister was relying on the use of the CRANtools R module but it has the drawback to only list the latest version of each registered package in the CRAN registry. In order to get all possible versions for each CRAN package, prefer to exploit the content of the weekly dump of the CRAN database in RDS format. To read the content of the RDS file from Python, the rpy2 package is used as it has the advantage to be packaged in debian. Related to swh/meta#1709.	2023-08-21 16:38:08 +02:00
Antoine Lambert	3a0e8b9995	requirements.txt: Sort packages by name	2023-08-17 10:45:43 +02:00
Antoine R. Dumont (@ardumont)	573958ce64	Add Gitweb lister Depending on some instances, we have some specific heuristics, some instances: - have summary pages which do not not list metadata_url (so some computation happens to list git:// origins which are cloneable) - have summary page which reference metadata_url as a multiple comma separated urls - lists relative urls of the repository so we need to join it with the main instance url to have a complete cloneable origins (or summary page) - lists "down" http origins (cloning those won't work) so lists those as cloneable https ones (when the main url is behind https). Refs. swh/devel/swh-lister#1800	2023-07-10 16:50:41 +02:00
KShivendu	6ad61aec23	feat(fedora): Introduce fedora lister Summary: Lister to ingest fedora mirrors (.rpm) Reviewers: #reviewers, vlorentz Subscribers: vlorentz, olasd Maniphest Tasks: T4448 Differential Revision: https://forge.softwareheritage.org/D8386	2022-11-15 15:53:52 +05:30
Antoine Lambert	108816f232	rubygems: Use gems database dump to improve listing output Instead of using an undocumented rubygems HTTP endpoint that only gives us the names of the gems, prefer to exploit the daily PostgreSQL dump of the rubygems.org database. It enables to list all gems but also all versions of a gem and its release artifacts. For each relase artifact, the following info are extracted: version, download URL, sha256 checksum, release date plus a couple of extra metadata. The lister will now set list of artifacts and list of metadata as extra loader arguments when sending a listed origin to the scheduler database. A last_update date is also computed which should ensure loading tasks for rubygems will be scheduled only when new releases are available since last loadings. To be noted, the lister will spawn a temporary postgres instance so this require the initdb executable from postgres server installation to be available in the execution environment. Related to T1777	2022-10-07 16:54:48 +02:00
Antoine Lambert	cee6bcb514	maven: Use BeautifulSoup instead of xmltodict for parsing pom files xmltodict cannot parse POM files with multi-byte encoding so prefer to use the XML parser of BeautifulSoup based on lxml instead. Also drop xmltodict requirement as it is no longer used in swh-lister codebase.	2022-08-09 11:11:45 +02:00
Franck Bret	a6f796b268	crates.lister: Implement incremental mode: Add incremental mode support based on a 'last_commit' state, used to get new package versions from git diff range of commits.	2022-08-05 13:41:57 +02:00
Antoine Lambert	2fa9f0abd2	sourceforge: Fix listing of bzr projects Fix sourceforge origin URL for bzr projects, http://project.bzr.sourceforge.net/bzrroot/project redirects to http://project.bzr.sourceforge.net/bzr/project. Handle bzr projects with multiple branches, one listed origin must be created per branch. Discard bzr projects that no longer exist from listing.	2022-04-21 18:19:07 +02:00
Antoine Lambert	445d539b3f	Remove no longer needed tenacity workarounds Now that we have packaged tenacity 6.2 for debian buster and use it in production, we can remove the workarounds to support tenacity < 5.	2021-12-08 13:28:11 +01:00
Boris Baldassari	8991c625ea	lister: Add new maven lister The Maven lister retrieves the maven central indexes, exports them in a convenient text format, and parse them to identify all src archives and pom files in the maven repository. Then the pom files are downloaded and analysed to find and yield any scm reference. Note: This is a new version of the maven lister diff D6133 which takes into account the initial round of reviews. Related to T1724	2021-11-29 17:33:13 +01:00
Antoine Lambert	2461c97bbb	pypi: Use BeautifulSoup for parsing HTML instead of xmltodict xmltodict now raises an error while trying to parse the HTML content of https://pypi.org/simple/ page. So use BeautifulSoup HTML parser instead as it is aleady a requirement of swh-lister and it does not fail parsing the PyPI HTML page. Also drop no longer used xmltodict in requirements.	2021-02-05 14:23:11 +01:00
Antoine Lambert	8933544521	Remove no longer used legacy Lister API and update CLI options Legacy Lister classes from the swh.lister.core mdule are no longer used in swh-lister codebase so it is time to remove them. Also remove lister CLI options related to legacy Lister API. As a consequence, the following requirements are no longer needed: arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql. Closes T2442	2021-02-02 15:54:55 +01:00
Antoine Lambert	82ab96ad06	gnu: Remove dependency on pytz UTC timezone settings can be obtained from the datetime.timezone module from Python standard library so remove dependency on external pytz module.	2021-02-02 13:19:04 +01:00
Antoine Lambert	d1fbccd988	lister: Add utility decorator to ease HTTP requests rate limit handling Add swh.lister.utils.throttling_retry decorator enabling to retry a function that performs an HTTP request who can return a 429 status code. The implementation is based on the tenacity module and it is assumed that the requests library is used when querying an URL. The default wait strategy is based on exponential backoff. The default max number of attempts is set to 5, HTTPError exception will then be reraised. All tenacity.retry parameters can also be overridden in client code.	2021-01-18 11:28:51 +01:00
Léni Gauffier	58ef08b083	Added LaunchpadLister Summary: Related to T1734 From abandonned D2799 Reviewers: ardumont Reviewed By: ardumont Differential Revision: https://forge.softwareheritage.org/D2974	2020-04-12 01:00:12 +02:00
Antoine R. Dumont (@ardumont)	c6372eea7e	gnu.lister: Unify timestamp formats to isoformat date in model Related T2023	2019-11-04 10:08:01 +01:00
Archit Agrawal	b972a2a88d	swh.lister.cgit Implemented a lister to list the repos for a given CGit instance. Closes T1659	2019-06-28 19:27:25 +05:30
David Douard	c2c26d7e46	Fix the bitbucket lister; handle properly the date-like bounds	2019-02-01 15:38:11 +01:00
Nicolas Dandrimont	2922b68570	Clean up dependencies to enable tests on build	2017-10-30 17:04:49 +01:00
Stefano Zacchiroli	164922afe2	requirements.txt: add missing dependency on "arrow"	2017-09-05 10:54:54 +02:00
Antoine Pietri	c81c7de88c	requirements: remove celery (already required by swh.scheduler)	2017-04-12 15:21:22 +02:00
Avi Kelman (fiendish)	68d77fd43f	Refactor lister code Streamline production of new listers by aggressively moving core functionality into progressively inherited (A->B->C) base classes with the transport layer abstracted. This should make common individual forge listers straightforward to produce with minimal customization. Github and Bitbucket listers can be used as examples of the indexing type.	2017-03-06 12:35:49 +01:00
Antoine Pietri	ede9e5048c	requirements: split internal and external requirements in two separate files	2017-02-09 14:32:02 +01:00
Antoine R. Dumont (@ardumont)	b217f55cfe	Update storage configuration reading Related T613	2016-12-15 19:07:02 +01:00
Nicolas Dandrimont	d2483e7893	requirements.txt: use proper syntax	2016-10-20 17:27:26 +02:00
Nicolas Dandrimont	9ba8fedc4c	base: add implementation for adding origins	2016-10-19 16:53:32 +02:00
Nicolas Dandrimont	ca4d346451	requirements.txt: Add inter-swh dependencies	2016-10-19 15:41:26 +02:00
Nicolas Dandrimont	2a62db6827	Revert to the pre-qless refactoring version	2016-09-13 14:57:26 +02:00
Nicolas Dandrimont	52f9fd157e	sync packaging metadata	2016-03-17 19:01:10 +01:00
Nicolas Dandrimont	6fbabbe586	req_queue: use qless instead of a handmade queue	2016-03-17 17:50:03 +01:00
Nicolas Dandrimont	c7871e44e8	requirements.txt: add redis	2016-03-17 17:45:59 +01:00
Nicolas Dandrimont	533f6fa1a3	swh.lister.github: Refactor to use swh.storage instead of sqlalchemy	2016-03-09 19:03:35 +01:00
Stefano Zacchiroli	ecca87dccf	requirements.txt: add dependency on requests	2015-09-21 21:13:04 +02:00
Stefano Zacchiroli	376141397d	add requirements.txt, listing sqlalchemy as dependency	2015-09-21 21:11:45 +02:00

41 commits