swh-lister

Author	SHA1	Message	Date
Antoine Lambert	eadb704494	pattern: Ensure termination date is set at the end of listing process Previously it could be set by any call to the `set_state_in_scheduler` method. This was leading to side effects on the save bulk lister while updating the scheduler state when encountering an invalid or not found origin, and thus the listing failed. Fixes #4712.	2024-10-24 12:33:40 +02:00
Antoine Lambert	0e1093e308	pattern: Add first_visits_queue_prefix parameter to Lister constructor It enables to declare a lister whose first visits of listed origins must be scheduled with high priority. Related to swh/devel/swh-scheduler#4687.	2024-10-14 15:03:42 +02:00
Antoine Lambert	7609ebf7e1	pattern: Store termination date to scheduler database at end of listing It enables to track last lister execution date and will be used to schedule first visits with high priority for listed origins. Related to swh/devel/swh-scheduler#4687.	2024-10-14 15:03:28 +02:00
Antoine R. Dumont (@ardumont)	b02144b4f9	packagist: Yield pages of origins to regularly record origins Instead of sending one page with all origins listed which is britle. When something goes wrong during the listing, the lister currently records nothing.	2023-08-04 11:09:58 +02:00
Antoine R. Dumont (@ardumont)	1f27250694	lister.pattern: Make batch record parametric and test it This adds a test around the batch recording behavior to ensure it's not dropped by mistake.	2023-08-01 15:06:21 +02:00
Antoine R. Dumont (@ardumont)	920ed0d529	lister.pattern: Restore flushing origin batch in the scheduler Prior to this commit, the newly introduced check on url validity was consuming the stream of origins. In effect, this would no longer write origin records regularly. For all listers, that would translate to flush origins only at the end of the listing which could take a while for some (e.g. packagist lister has been running for more than 12h currently without writing anything in the scheduler).	2023-08-01 10:04:48 +02:00
Antoine R. Dumont (@ardumont)	e91e0bf09c	cgit: Allow url to be optional Some cgit instances are at a domain's root path so we can build their url directly from their 'instance' parameter. This unifies further the cli to register a lister and the cli to schedule the listed origins from a forge. [1] ``` https://git.kernel.org https://source.codeaurora.org https://git.trueelena.org https://dev.sanctum.geek.nz https://git.trueelena.org https://git.dpkg.org https://anongit.mindrot.org https://git.aurel32.net https://gitweb.gentoo.org https://git.joeyh.name https://git.adrian.geek.nz ``` Refs. swh/devel/swh-lister#4693	2023-05-23 11:47:51 +02:00
Antoine R. Dumont (@ardumont)	19bdeefb14	lister: Allow lister to build url out of the instance parameter This pushes the rather elementary logic within the lister's scope. This will simplify and unify cli call between lister and scheduler clis. This will also allow to reduce erroneous operations which can happen for example in the add-forge-now. With the following, we will only have to provide the type and the instance, then everything will be scheduled properly. Refs. swh/devel/swh-lister#4693	2023-05-19 15:03:49 +02:00
Antoine Lambert	4f57e84450	Use http_retry decorator from swh.core.retry module The http_retry decorator has been moved to swh-core package in order to ease its reuse across swh packages.	2023-04-13 14:19:57 +02:00
Antoine Lambert	35871896b2	pattern: Improve handling of max_origins_per_page parameter Instead of fully consuming the get_origins_from_page generator into a list and truncate it, prefer to consume the generator origin per origin and abort the process when the max number of origin per page is reached. Indeed some non trivial listers like the cgit one can perform costly processing, HTTP request for instance, for each origin in a page. So better not consuming the full generator in a row to avoid such side effects.	2023-03-21 16:56:48 +01:00
Nicolas Dandrimont	64267f8f50	Add a flag to not enable origins listed by a lister This cuts down one more manual step in the add forge now validation process: we can add the relevant origins to the staging scheduler without enabling them at all.	2022-12-05 14:53:42 +01:00
Nicolas Dandrimont	b815737054	Add built-in page and origin count limit to listers This will allow more automation of the staging add forge now process: for known-good listers, we can limit the number of origins being processed and reduce the amount of manual steps taken for each instance.	2022-12-05 14:53:42 +01:00
Valentin Lorentz	8ea4200909	Validate origin URLs before sending to the scheduler	2022-11-04 15:58:45 +01:00
Antoine R. Dumont (@ardumont)	92d494261f	lister: Make sure lister that requires github tokens can use it Deploying the nixguix lister, I realized that even though the credentials configuration is properly set for all listers, the listers actually requiring github origin canonicalization do not have access to the github credentials. It's lost during the constructor to only focus on the lister's credentials. Which currently translates to listers being rate-limited. This commit fixes it by pushing the self.github_session instantiation in the constructor when the lister explicitely requires the github session. Hence lifting the rate limit for maven, packagist, nixguix, and github listers. Related to infra/sysadm-environment#4655	2022-10-26 17:23:40 +02:00
Antoine Lambert	8d85b2e4e8	pattern: Ensure accurate origin counts returned by run method Previously, the run method was returning the total count of ListedOrigin objects sent to scheduler database. However, some listers can send multiple ListedOrigin objects for a given origin URL during the listing process, for instance when an origin is contained in multiple pages (e.g. gogs listing) or when the listing is gathering multiple versions of an origin spread across multiple pages (e.g. maven listing). This changes ensures an accurate count of listed origins by maintaining a set of origin URLs associated to the sent ListedOrigin objects.	2022-09-29 11:14:08 +02:00
Antoine Lambert	d5c30a3ce3	Update value of User-Agent HTTP request header used by listers That HTTP header value will now contain the lister name but also a link to our contact form in order for sysadmins to easily reach us if needed. The following template is used to generate it: "Software Heritage <lister_name> lister v<swh-lister version> (+https://www.softwareheritage.org/contact)"	2022-09-26 10:48:40 +02:00
Antoine Lambert	db6ce12e9e	Refactor and deduplicate HTTP requests code in listers Numerous listers were using the same page_request method or equivalent in their implementation so prefer to deduplicate that code by adding an http_request method in base lister class: swh.lister.pattern.Lister. That method simply wraps a call to requests.Session.request and logs some useful info for debugging and error reporting, also an HTTPError will be raised if a request ends up with an error. All listers using that new method now benefit of requests retry when an HTTP error occurs thanks to the use of the http_retry decorator.	2022-09-26 10:48:40 +02:00
Antoine Lambert	6c12350863	pattern: Use URL network location as instance name when not provided Make the instance parameter of the base pattern lister optional and set lister name to URL network location when not provided. It simplifies lister creation when associated forge type have a lot of instances in the wild (e.g. gitlab or cgit) while giving more details about the listed forge instance. Also process listers for forge with multiple instances (cgit, gitea, gitlab, phabricator and tuleap) to ensure URL network location will be used when instance parameter is not provided. Related to T3403	2021-07-13 12:33:49 +02:00
Valentin Lorentz	40e1916510	Fix various Sphinx warnings	2021-04-13 21:56:08 +02:00
Antoine R. Dumont (@ardumont)	003cf5491f	pattern: Bump packet split to chunk of 1000 records Listers like github and bitbucket should not be impacted as they already list 1000 records per page.	2021-01-29 16:55:29 +01:00
Antoine R. Dumont (@ardumont)	0ad37740d9	pattern: Make lister flush regularly origins to scheduler As origins is a generator, the previous behavior would try to consume the overall generator to send the records. This groups and sends batch of 100 origins to the scheduler for writing. Related to T3003	2021-01-28 16:52:03 +01:00
Antoine Lambert	9fd91f007d	pattern: Fix and improve config overriding in from_configfile method Fix error when a configuration value loaded from a config file is also given as keyword parameter to the from_configfile method. Override configuration loaded from config file only if the provided value is not None.	2021-01-18 17:55:53 +01:00
Nicolas Dandrimont	734901747b	Implement a base pattern for listers with no state storage	2021-01-11 11:00:29 +01:00
Nicolas Dandrimont	f1eabc5283	Add a helper to instantiate a new-style lister from a config file This helper will be used in the task entry points.	2021-01-11 11:00:29 +01:00
Nicolas Dandrimont	525fc0102d	Hook up listers implemented with the new pattern to the CLI We stop depending on the ListerBase implementation. The main hoop we're jumping through is the config override mechanism in swh.lister.get_lister, as it's really specifc to the ListerBase `override_config` argument, which is dropped in pattern.Lister (in favor of explicit arguments at lister instantiation). We implement a small shim in swh.lister.pattern.Lister to give backwards-compatibility for the new pattern to get_lister. This generic configuration override mechanism will probably be completely removed when the configuration mechanism is reworked. We'll see.	2021-01-11 11:00:29 +01:00
Nicolas Dandrimont	9e083c1eea	Introduce a simpler base pattern for lister implementations. This new pattern uses the lister support features introduced in swh.scheduler to replace the database management done in previous iterations of the listers.	2021-01-11 11:00:29 +01:00

26 commits