Previously it could be set by any call to the `set_state_in_scheduler`
method.
This was leading to side effects on the save bulk lister while updating
the scheduler state when encountering an invalid or not found origin,
and thus the listing failed.
Fixes#4712.
It enables to track last lister execution date and will be used to schedule
first visits with high priority for listed origins.
Related to swh/devel/swh-scheduler#4687.
The sourceforge lister sends various HTTP requests to get info about a
project, for instance to get the branch name of a Bazaar project.
If HTTP errors occurred during these steps, they were discarded in order
for the listing to continue but connection errors were not and as a
consequence the listing was failing when encountering such error.
Currently, the legacy Bazaar project hosted on sourceforge seems down and
connection errors are raised when attempting to fetch branch names so the
lister does not process all projects as it crashes in mid-flight.
This new and special lister enables to verify a list of origins to archive
provided by users (for instance through the Web API).
Its purpose is to avoid polluting the scheduler database with origins that
cannot be loaded into the archive.
Each origin is identified by an URL and a visit type. For a given visit type
the lister is checking if the origin URL can be found and if the visit type
is valid.
The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus
the one for loading a tarball content into the archive.
Accepted origins are inserted or upserted in the scheduler database.
Rejected origins are stored in the lister state.
Related to #4709
Instead of having a single crate and its versions info per page,
prefer to have up to 1000 crates per page to significantly speedup
the listing process.
Previously, the lister state was recorded regardless if errors occurred
when listing crates as the finalize method is called regardless of raised
exception during listing.
As a consequence some crates could be missed as the incremental listing
restarts from the dump date of the last processed crate database.
So ensure all crates have been processed by the lister before recording
its state.
packaging.version.parse is dedicated to parse Python package version
numbers but crate versions do not necessarily respect Python version
number conventions and thus some crate versions cannot be parsed.
Prefer to use looseversion.LooseVersion2 instead which in a drop-in
replacement for deprecated distutils.version.LooseVersion and enables
to parse all kind of version numbers.
Latest tenacity release adds some internal changes that broke the
mocking of sleep calls in tests.
Fix it by directly mocking time.sleep (was not working previously).
Gitea API return next pagination link with all query parameters provided
to an API request.
As we were also passing a dict of fixed query parameters to the page_request
method, some query parameters ended up having multiple instances in the URL
for fetching a new page of repositories data. So each time a new page was
requested, new instances of these parameters were appended to the URL which
could result in a really long URL if the number of pages to retrieve is high
and make the request fail.
Also remove a debug log already present in http_request method.
Redirection URLs can be long and quite obscure in some cases (GitHub CDN
for instance) so ensure to use the redirected URL as origin URL.
Related to swh/meta#5090.
As the types-beautifulsoup4 package gets installed in the swh virtualenv
as it is a swh-scanner test dependency, some mypy errors were reported
related to beautifulsoup4 typing.
As the returned type for the find method of bs4 is the following union:
Tag | NavigableString | None, isinstance calls must be used to ensure
proper typing which is not great.
So prefer to use the select_one method instead where a simple None check
must be done to ensure typing is correct as it is returning Optional[Tag].
In a similar manner, replace use of find_all method by select method.
It also has the advantage to simplify the code.
Some Guix packages correspond to subset exports of a subversion source
tree at a given revision, typically the Tex Live ones.
In that case, we must pass an extra parameter to the svn-export loader
to specify the sub-paths to export but also use a unique origin URL
for each package to archive as otherwise the same one would be used
and only a single package would be archived.
Related to swh/infra/sysadm-environment#5263.
In addition to query parameters also check if any part of URL path
contains a tarball filename.
It fixes the detection of some tarball URLs provided in Guix manifest.
Related to swh/meta#3781.
Commit c2402f405f renamed the entry points from `lister.*` without
updating the rest of the framework. Revert the changes (and sort the
list alphabetically).
Use another Api endpoint that helps the lister to be stateful.
The Api endpoint used needs a ``since`` value that represents a
sequential index in the history.
The ``all_packages_count`` state helps in storing a count which will be
used as ``since`` argument on the next run.
The Elm Lister lists Elm packages origins from the Elm
lang registry.
It uses an http api endpoint to list packages origins.
Origins are Github repositories, releases take advantages
of Github relase Api.
Guix now provides a "submodule" info in the sources.jon file it
produced so exploit it to set the new "submodules" parameter of
the git-checkout loader in order to retrieve submodules only when
it is required.
Related to swh/devel/swh-loader-git#4751.
Add a state to the lister to store the ``last_seen_commit`` as a Git
commit hash.
Use Dulwich to retrieve a Git commit walker since
``last_seen_commit`` if any.
For each commit detect if it is a new package or a new package
version commit and returns its origin with commit date as
last_update.
The CRAN lister improvements introduced in 91e4e33 originally used pyreadr
to read a RDS file from Python instead of rpy2.
As swh-lister was still packaged for debian at the time, the choice of using
rpy2 instead was made as a debian package is available for it while it is not
for pyreadr.
Now debian packaging was dropped for swh-lister we can reinstate the pyreadr
based implementation which has the advantages of being faster and not depending
on the R language runtime.
Related to swh/meta#1709.
That fails the current loader ingestion as this must be an exact value (when provided,
it's checked against the download operation).
Refs. swh/infra/sysadm-environment#4746
In order to simplify the testing of listers, allow to call the run command
of swh-lister CLI without scheduler configuration. In that case a temporary
scheduler instance with a postgresql backend is created and used.
It enables to easily test a lister with the following command:
$ swh -l DEBUG lister run <lister_name> url=<forge_url>
The implementation of `HTTPError` in `requests` does not guarantee that
the `response` property will always be set. So we need to ensure it is
not `None` before looking for the return code, for example.
This also makes mypy checks pass again, as `types-request` was updated
in 2.31.0.9 to better match this particular aspect. See:
https://github.com/python/typeshed/pull/10875