Commit graph

26 commits

Author SHA1 Message Date
Antoine Lambert
eadb704494 pattern: Ensure termination date is set at the end of listing process
Previously it could be set by any call to the `set_state_in_scheduler`
method.

This was leading to side effects on the save bulk lister while updating
the scheduler state when encountering an invalid or not found origin,
and thus the listing failed.

Fixes #4712.
2024-10-24 12:33:40 +02:00
Antoine Lambert
0e1093e308 pattern: Add first_visits_queue_prefix parameter to Lister constructor
It enables to declare a lister whose first visits of listed origins must
be scheduled with high priority.

Related to swh/devel/swh-scheduler#4687.
2024-10-14 15:03:42 +02:00
Antoine Lambert
7609ebf7e1 pattern: Store termination date to scheduler database at end of listing
It enables to track last lister execution date and will be used to schedule
first visits with high priority for listed origins.

Related to swh/devel/swh-scheduler#4687.
2024-10-14 15:03:28 +02:00
Antoine R. Dumont (@ardumont)
b02144b4f9
packagist: Yield pages of origins to regularly record origins
Instead of sending one page with all origins listed which is britle.
When something goes wrong during the listing, the lister currently records nothing.
2023-08-04 11:09:58 +02:00
Antoine R. Dumont (@ardumont)
1f27250694
lister.pattern: Make batch record parametric and test it
This adds a test around the batch recording behavior to ensure it's not dropped by
mistake.
2023-08-01 15:06:21 +02:00
Antoine R. Dumont (@ardumont)
920ed0d529
lister.pattern: Restore flushing origin batch in the scheduler
Prior to this commit, the newly introduced check on url validity was consuming the
stream of origins. In effect, this would no longer write origin records regularly.

For all listers, that would translate to flush origins only at the end of the listing
which could take a while for some (e.g. packagist lister has been running for more than
12h currently without writing anything in the scheduler).
2023-08-01 10:04:48 +02:00
Antoine R. Dumont (@ardumont)
e91e0bf09c
cgit: Allow url to be optional
Some cgit instances are at a domain's root path so we can build their url directly from
their 'instance' parameter.

This unifies further the cli to register a lister and the cli to schedule the listed
origins from a forge.

[1]
```
https://git.kernel.org
https://source.codeaurora.org
https://git.trueelena.org
https://dev.sanctum.geek.nz
https://git.trueelena.org
https://git.dpkg.org
https://anongit.mindrot.org
https://git.aurel32.net
https://gitweb.gentoo.org
https://git.joeyh.name
https://git.adrian.geek.nz
```

Refs. swh/devel/swh-lister#4693
2023-05-23 11:47:51 +02:00
Antoine R. Dumont (@ardumont)
19bdeefb14
lister: Allow lister to build url out of the instance parameter
This pushes the rather elementary logic within the lister's scope. This will simplify
and unify cli call between lister and scheduler clis. This will also allow to reduce
erroneous operations which can happen for example in the add-forge-now.

With the following, we will only have to provide the type and the instance, then
everything will be scheduled properly.

Refs. swh/devel/swh-lister#4693
2023-05-19 15:03:49 +02:00
Antoine Lambert
4f57e84450 Use http_retry decorator from swh.core.retry module
The http_retry decorator has been moved to swh-core package in order
to ease its reuse across swh packages.
2023-04-13 14:19:57 +02:00
Antoine Lambert
35871896b2 pattern: Improve handling of max_origins_per_page parameter
Instead of fully consuming the get_origins_from_page generator into
a list and truncate it, prefer to consume the generator origin per
origin and abort the process when the max number of origin per page
is reached.

Indeed some non trivial listers like the cgit one can perform costly
processing, HTTP request for instance, for each origin in a page.
So better not consuming the full generator in a row to avoid such
side effects.
2023-03-21 16:56:48 +01:00
Nicolas Dandrimont
64267f8f50 Add a flag to not enable origins listed by a lister
This cuts down one more manual step in the add forge now validation
process: we can add the relevant origins to the staging scheduler
without enabling them at all.
2022-12-05 14:53:42 +01:00
Nicolas Dandrimont
b815737054 Add built-in page and origin count limit to listers
This will allow more automation of the staging add forge now process:
for known-good listers, we can limit the number of origins being
processed and reduce the amount of manual steps taken for each instance.
2022-12-05 14:53:42 +01:00
Valentin Lorentz
8ea4200909 Validate origin URLs before sending to the scheduler 2022-11-04 15:58:45 +01:00
Antoine R. Dumont (@ardumont)
92d494261f
lister: Make sure lister that requires github tokens can use it
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.

This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.

Related to infra/sysadm-environment#4655
2022-10-26 17:23:40 +02:00
Antoine Lambert
8d85b2e4e8 pattern: Ensure accurate origin counts returned by run method
Previously, the run method was returning the total count of ListedOrigin
objects sent to scheduler database.

However, some listers can send multiple ListedOrigin objects for a given
origin URL during the listing process, for instance when an origin is
contained in multiple pages (e.g. gogs listing) or when the listing
is gathering multiple versions of an origin spread across multiple
pages (e.g. maven listing).

This changes ensures an accurate count of listed origins by maintaining
a set of origin URLs associated to the sent ListedOrigin objects.
2022-09-29 11:14:08 +02:00
Antoine Lambert
d5c30a3ce3 Update value of User-Agent HTTP request header used by listers
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.

The following template is used to generate it:

"Software Heritage <lister_name> lister v<swh-lister version>
 (+https://www.softwareheritage.org/contact)"
2022-09-26 10:48:40 +02:00
Antoine Lambert
db6ce12e9e Refactor and deduplicate HTTP requests code in listers
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.

That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.

All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
2022-09-26 10:48:40 +02:00
Antoine Lambert
6c12350863 pattern: Use URL network location as instance name when not provided
Make the instance parameter of the base pattern lister optional and set
lister name to URL network location when not provided.

It simplifies lister creation when associated forge type have a lot of
instances in the wild (e.g. gitlab or cgit) while giving more details
about the listed forge instance.

Also process listers for forge with multiple instances (cgit, gitea,
gitlab, phabricator and tuleap) to ensure URL network location will be
used when instance parameter is not provided.

Related to T3403
2021-07-13 12:33:49 +02:00
Valentin Lorentz
40e1916510 Fix various Sphinx warnings 2021-04-13 21:56:08 +02:00
Antoine R. Dumont (@ardumont)
003cf5491f
pattern: Bump packet split to chunk of 1000 records
Listers like github and bitbucket should not be impacted as they already list 1000
records per page.
2021-01-29 16:55:29 +01:00
Antoine R. Dumont (@ardumont)
0ad37740d9
pattern: Make lister flush regularly origins to scheduler
As origins is a generator, the previous behavior would try to consume the overall
generator to send the records.

This groups and sends batch of 100 origins to the scheduler for writing.

Related to T3003
2021-01-28 16:52:03 +01:00
Antoine Lambert
9fd91f007d pattern: Fix and improve config overriding in from_configfile method
Fix error when a configuration value loaded from a config file is also
given as keyword parameter to the from_configfile method.

Override configuration loaded from config file only if the provided
value is not None.
2021-01-18 17:55:53 +01:00
Nicolas Dandrimont
734901747b Implement a base pattern for listers with no state storage 2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
f1eabc5283 Add a helper to instantiate a new-style lister from a config file
This helper will be used in the task entry points.
2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
525fc0102d Hook up listers implemented with the new pattern to the CLI
We stop depending on the ListerBase implementation. The main hoop we're jumping
through is the config override mechanism in swh.lister.get_lister, as it's
really specifc to the ListerBase `override_config` argument, which is dropped in
pattern.Lister (in favor of explicit arguments at lister instantiation).

We implement a small shim in swh.lister.pattern.Lister to give
backwards-compatibility for the new pattern to get_lister.

This generic configuration override mechanism will probably be completely
removed when the configuration mechanism is reworked. We'll see.
2021-01-11 11:00:29 +01:00
Nicolas Dandrimont
9e083c1eea Introduce a simpler base pattern for lister implementations.
This new pattern uses the lister support features introduced in swh.scheduler to
replace the database management done in previous iterations of the listers.
2021-01-11 11:00:29 +01:00