Previously it could be set by any call to the `set_state_in_scheduler`
method.
This was leading to side effects on the save bulk lister while updating
the scheduler state when encountering an invalid or not found origin,
and thus the listing failed.
Fixes#4712.
It enables to track last lister execution date and will be used to schedule
first visits with high priority for listed origins.
Related to swh/devel/swh-scheduler#4687.
Instead of sending one page with all origins listed which is britle.
When something goes wrong during the listing, the lister currently records nothing.
Prior to this commit, the newly introduced check on url validity was consuming the
stream of origins. In effect, this would no longer write origin records regularly.
For all listers, that would translate to flush origins only at the end of the listing
which could take a while for some (e.g. packagist lister has been running for more than
12h currently without writing anything in the scheduler).
This pushes the rather elementary logic within the lister's scope. This will simplify
and unify cli call between lister and scheduler clis. This will also allow to reduce
erroneous operations which can happen for example in the add-forge-now.
With the following, we will only have to provide the type and the instance, then
everything will be scheduled properly.
Refs. swh/devel/swh-lister#4693
Instead of fully consuming the get_origins_from_page generator into
a list and truncate it, prefer to consume the generator origin per
origin and abort the process when the max number of origin per page
is reached.
Indeed some non trivial listers like the cgit one can perform costly
processing, HTTP request for instance, for each origin in a page.
So better not consuming the full generator in a row to avoid such
side effects.
This cuts down one more manual step in the add forge now validation
process: we can add the relevant origins to the staging scheduler
without enabling them at all.
This will allow more automation of the staging add forge now process:
for known-good listers, we can limit the number of origins being
processed and reduce the amount of manual steps taken for each instance.
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.
This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.
Related to infra/sysadm-environment#4655
Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.
However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).
This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.
The following template is used to generate it:
"Software Heritage <lister_name> lister v<swh-lister version>
(+https://www.softwareheritage.org/contact)"
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.
That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.
All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.
It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.
Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.
Related to T3403
As origins is a generator, the previous behavior would try to consume the overall
generator to send the records.
This groups and sends batch of 100 origins to the scheduler for writing.
Related to T3003
Fix error when a configuration value loaded from a config file is also
given as keyword parameter to the from_configfile method.
Override configuration loaded from config file only if the provided
value is not None.
We stop depending on the ListerBase implementation. The main hoop we're jumping
through is the config override mechanism in swh.lister.get_lister, as it's
really specifc to the ListerBase `override_config` argument, which is dropped in
pattern.Lister (in favor of explicit arguments at lister instantiation).
We implement a small shim in swh.lister.pattern.Lister to give
backwards-compatibility for the new pattern to get_lister.
This generic configuration override mechanism will probably be completely
removed when the configuration mechanism is reworked. We'll see.
This new pattern uses the lister support features introduced in swh.scheduler to
replace the database management done in previous iterations of the listers.