Latest beautifulsoup4 release (4.13) seems to have fixed issues
related to unexpected encodings in XML files so a test that was
passing previously is now failing.
Update that test to check origin URL and visit type can be
successfully extracted from a POM file with unexpected encoding.
This new and special lister enables to verify a list of origins to archive
provided by users (for instance through the Web API).
Its purpose is to avoid polluting the scheduler database with origins that
cannot be loaded into the archive.
Each origin is identified by an URL and a visit type. For a given visit type
the lister is checking if the origin URL can be found and if the visit type
is valid.
The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus
the one for loading a tarball content into the archive.
Accepted origins are inserted or upserted in the scheduler database.
Rejected origins are stored in the lister state.
Related to #4709
packaging.version.parse is dedicated to parse Python package version
numbers but crate versions do not necessarily respect Python version
number conventions and thus some crate versions cannot be parsed.
Prefer to use looseversion.LooseVersion2 instead which in a drop-in
replacement for deprecated distutils.version.LooseVersion and enables
to parse all kind of version numbers.
Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.
Fix it by directly mocking time.sleep (was not working previously).
The CRAN lister improvements introduced in 91e4e33 originally used pyreadr
to read a RDS file from Python instead of rpy2.
As swh-lister was still packaged for debian at the time, the choice of using
rpy2 instead was made as a debian package is available for it while it is not
for pyreadr.
Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
based implementation which has the advantages of being faster and not depending
on the R language runtime.
Related to swh/meta#1709.
This module introduce Julia Lister.
It retrieves Julia packages origins from the Julia General Registry, a Git
repository made of per package directory with Toml definition files.
Previously, the lister was relying on the use of the CRANtools R module
but it has the drawback to only list the latest version of each registered
package in the CRAN registry.
In order to get all possible versions for each CRAN package, prefer to exploit
the content of the weekly dump of the CRAN database in RDS format.
To read the content of the RDS file from Python, the rpy2 package is used as
it has the advantage to be packaged in debian.
Related to swh/meta#1709.
Depending on some instances, we have some specific heuristics, some instances:
- have summary pages which do not not list metadata_url (so some
computation happens to list git:// origins which are cloneable)
- have summary page which reference metadata_url as a multiple comma separated urls
- lists relative urls of the repository so we need to join it with the main instance url
to have a complete cloneable origins (or summary page)
- lists "down" http origins (cloning those won't work) so lists those as cloneable https
ones (when the main url is behind https).
Refs. swh/devel/swh-lister#1800
Instead of using an undocumented rubygems HTTP endpoint that only
gives us the names of the gems, prefer to exploit the daily PostgreSQL
dump of the rubygems.org database.
It enables to list all gems but also all versions of a gem and its
release artifacts. For each relase artifact, the following info are
extracted: version, download URL, sha256 checksum, release date
plus a couple of extra metadata.
The lister will now set list of artifacts and list of metadata as extra
loader arguments when sending a listed origin to the scheduler database.
A last_update date is also computed which should ensure loading tasks
for rubygems will be scheduled only when new releases are available since
last loadings.
To be noted, the lister will spawn a temporary postgres instance so this
require the initdb executable from postgres server installation to be
available in the execution environment.
Related to T1777
xmltodict cannot parse POM files with multi-byte encoding so prefer to
use the XML parser of BeautifulSoup based on lxml instead.
Also drop xmltodict requirement as it is no longer used in swh-lister
codebase.
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.
Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.
Related to T1724
xmltodict now raises an error while trying to parse the HTML content
of https://pypi.org/simple/ page.
So use BeautifulSoup HTML parser instead as it is aleady a requirement
of swh-lister and it does not fail parsing the PyPI HTML page.
Also drop no longer used xmltodict in requirements.
Legacy Lister classes from the swh.lister.core mdule are no longer
used in swh-lister codebase so it is time to remove them.
Also remove lister CLI options related to legacy Lister API.
As a consequence, the following requirements are no longer needed:
arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql.
Closes T2442
Add swh.lister.utils.throttling_retry decorator enabling to retry a
function that performs an HTTP request who can return a 429 status code.
The implementation is based on the tenacity module and it is assumed
that the requests library is used when querying an URL.
The default wait strategy is based on exponential backoff.
The default max number of attempts is set to 5, HTTPError exception
will then be reraised.
All tenacity.retry parameters can also be overridden in client code.
Streamline production of new listers by aggressively moving core
functionality into progressively inherited (A->B->C) base classes
with the transport layer abstracted.
This should make common individual forge listers straightforward to
produce with minimal customization. Github and Bitbucket listers
can be used as examples of the indexing type.