As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.
As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.
So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.
It also has the advantage to simplify the code.
Some Guix packages correspond to subset exports of a subversion source
tree at a given revision, typically the Tex Live ones.
In that case, we must pass an extra parameter to the svn-export loader
to specify the sub-paths to export but also use a unique origin URL
for each package to archive as otherwise the same one would be used
and only a single package would be archived.
Related to swh/infra/sysadm-environment#5263.
In addition to query parameters also check if any part of URL path
contains a tarball filename.
It fixes the detection of some tarball URLs provided in Guix manifest.
Related to swh/meta#3781.
Use another Api endpoint that helps the lister to be stateful.
The Api endpoint used needs a ``since`` value that represents a
sequential index in the history.
The ``all_packages_count`` state helps in storing a count which will be
used as ``since`` argument on the next run.
The Elm Lister lists Elm packages origins from the Elm
lang registry.
It uses an http api endpoint to list packages origins.
Origins are Github repositories, releases take advantages
of Github relase Api.
Guix now provides a "submodule" info in the sources.jon file it
produced so exploit it to set the new "submodules" parameter of
the git-checkout loader in order to retrieve submodules only when
it is required.
Related to swh/devel/swh-loader-git#4751.
Add a state to the lister to store the ``last_seen_commit`` as a Git
commit hash.
Use Dulwich to retrieve a Git commit walker since
``last_seen_commit`` if any.
For each commit detect if it is a new package or a new package
version commit and returns its origin with commit date as
last_update.
The CRAN lister improvements introduced in 91e4e33 originally used pyreadr
to read a RDS file from Python instead of rpy2.
As swh-lister was still packaged for debian at the time, the choice of using
rpy2 instead was made as a debian package is available for it while it is not
for pyreadr.
Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
based implementation which has the advantages of being faster and not depending
on the R language runtime.
Related to swh/meta#1709.
That fails the current loader ingestion as this must be an exact value (when provided,
it's checked against the download operation).
Refs. swh/infra/sysadm-environment#4746
In order to simplify the testing of listers, allow to call the run command
of swh-lister CLI without scheduler configuration. In that case a temporary
scheduler instance with a postgresql backend is created and used.
It enables to easily test a lister with the following command:
$ swh -l DEBUG lister run <lister_name> url=<forge_url>
The implementation of `HTTPError` in `requests` does not guarantee that
the `response` property will always be set. So we need to ensure it is
not `None` before looking for the return code, for example.
This also makes mypy checks pass again, as `types-request` was updated
in 2.31.0.9 to better match this particular aspect. See:
https://github.com/python/typeshed/pull/10875
This module introduce Julia Lister.
It retrieves Julia packages origins from the Julia General Registry, a Git
repository made of per package directory with Toml definition files.
Similar to cgit, it exist cases where git clone URLs for projects hosted
on a gitweb instance cannot be found when scraping project pages or cannot
be easily derived from the gitweb instance root URL.
So add an optional base_git_url parameter enabling to compute correct clone
URLs by appending project names to it.
Some gitweb instances can have some string prefixes before the displayed
git clone URLs so ensure to strip them to properly extract URLs.
Related to swh/infra/sysadm-environment#5051.
rstrip is not a method to remove a string suffix so use another
way to extract gitweb project name.
It fixes the computation of some gitweb origin URLs.
Related to swh/infra/sysadm-environment#5050.
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.
Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.
Reated to swh/infra/sysadm-environment#5030.
Previously, the lister was relying on the use of the CRANtools R module
but it has the drawback to only list the latest version of each registered
package in the CRAN registry.
In order to get all possible versions for each CRAN package, prefer to exploit
the content of the weekly dump of the CRAN database in RDS format.
To read the content of the RDS file from Python, the rpy2 package is used as
it has the advantage to be packaged in debian.
Related to swh/meta#1709.
As Red Hat based linux distributions share the same type of package repository,
rework the fedora lister into a generic one to list RPM source packages and
their versions from numerous distributions.
For a given distribution, the RPM lister will fetch packages metadata from a
list of release identifiers and a list of software components. Source packages
are then processed and relevant info are extracted to be sent to the RPM loader.
When all releases and components were processed, the lister collected all versions
for each package name and send those info to the scheduler that will create RPM
loading tasks afterwards.
Nevertheless, as there is no generic way to list all releases and components for
a given distribution but also to guess the right URL to retrieve packages metadata
from, those info need to be manually provided to the lister as input parameters.
Some examples of those parameters for various distributions can be found in the
config directory of the lister.
Regarding the produced origin URLs, as there is no way to find valid HTTP ones
for all distributions, the same behavior as with the debian lister is used and
they have the following form: rpm://{instance}/packages/{package_name} where
the instance variable corresponds to the name of the listed distribution such
as Fedora, CentOS, or openSUSE.
Related to swh/meta#5011.
Instead of sending one page with all origins listed which is britle.
When something goes wrong during the listing, the lister currently records nothing.
With or without retry (for a future version of swh.core).
This skips the origin when this sporadically happens. It should get picked up by another
listing eventually.
The listing is currently failing to finish when the github server hangs up on the
process. Adding this behavior allows to skip the issue without breaking the listing.
The current lister implementation lists very few metadata with the hard-coded /p/ base
url (404 on mostly all packages). The packagist api implementation must have evolved
since the initial implementation of the lister (and the first deployment on staging).
Following the upstream documentation [1], it's sensible to first use the /p2/ as it's
performant from the packagist api side. It's then fallbacking to use /p2/+~dev url
scheme, then the /p/ scheme and finally the /packages/ base url if previous result are
either not found or empty (different than no modification since the last visit).
It keeps the initial implementation behavior of stopping immediately if a 304
NotModifiedSince is returned by the server.
[1] https://repo.packagist.org/apidoc
Prior to this commit, the newly introduced check on url validity was consuming the
stream of origins. In effect, this would no longer write origin records regularly.
For all listers, that would translate to flush origins only at the end of the listing
which could take a while for some (e.g. packagist lister has been running for more than
12h currently without writing anything in the scheduler).
That lister is really near the cgit & gitweb implementations. But the dom data is again
structured differently though so this implementation stands on its own.
Refs. swh/meta#5048
Gitiles instance returns voluntarily a malformed json output (json prefixed with
``)]}'\n``) [2]. The lister deals with it to properly parse the json response
nonetheless. It drops the prefix and then parses the json.
If at some point, they drop this prefix to return json directly, the lister will be able
to deal with it too. There are 2 tests one with 'standard' gitile format and another
with standard json to account for both case.
Refs. swh/meta#5045
[2] https://github.com/google/gitiles/issues/263