swh-lister

Author	SHA1	Message	Date
David Douard	cccb8c21ff	Replace all remaining occurrences of the 'local' cls by 'postgresql' The former has been deprecated for ages...	2024-10-28 14:35:29 +01:00
Antoine Lambert	eadb704494	pattern: Ensure termination date is set at the end of listing process Previously it could be set by any call to the `set_state_in_scheduler` method. This was leading to side effects on the save bulk lister while updating the scheduler state when encountering an invalid or not found origin, and thus the listing failed. Fixes #4712.	2024-10-24 12:33:40 +02:00
Antoine Lambert	99f64ddbff	save_bulk: Ensure high priority scheduling for first visits of origins Related to swh/devel/swh-scheduler#4687.	2024-10-14 15:04:02 +02:00
Antoine Lambert	0e1093e308	pattern: Add first_visits_queue_prefix parameter to Lister constructor It enables to declare a lister whose first visits of listed origins must be scheduled with high priority. Related to swh/devel/swh-scheduler#4687.	2024-10-14 15:03:42 +02:00
Antoine Lambert	7609ebf7e1	pattern: Store termination date to scheduler database at end of listing It enables to track last lister execution date and will be used to schedule first visits with high priority for listed origins. Related to swh/devel/swh-scheduler#4687.	2024-10-14 15:03:28 +02:00
Antoine Lambert	927aebbd0b	sourceforge: Also skip ConnectionError when fetching project info The sourceforge lister sends various HTTP requests to get info about a project, for instance to get the branch name of a Bazaar project. If HTTP errors occurred during these steps, they were discarded in order for the listing to continue but connection errors were not and as a consequence the listing was failing when encountering such error. Currently, the legacy Bazaar project hosted on sourceforge seems down and connection errors are raised when attempting to fetch branch names so the lister does not process all projects as it crashes in mid-flight.	2024-09-05 14:52:56 +02:00
Antoine Lambert	af24960bc2	Add save-bulk lister to check origins prior their insertion in database This new and special lister enables to verify a list of origins to archive provided by users (for instance through the Web API). Its purpose is to avoid polluting the scheduler database with origins that cannot be loaded into the archive. Each origin is identified by an URL and a visit type. For a given visit type the lister is checking if the origin URL can be found and if the visit type is valid. The supported visit types are those for VCS (bzr, cvs, hg, git and svn) plus the one for loading a tarball content into the archive. Accepted origins are inserted or upserted in the scheduler database. Rejected origins are stored in the lister state. Related to #4709	2024-09-04 10:42:23 +02:00
Antoine Lambert	6618cf341c	Move tarball validation functions from nixguix to utils	2024-09-02 11:29:47 +02:00
David Douard	c0dc8edb05	Make qa tools happy again	2024-08-27 17:40:30 +02:00
David Douard	c6baacbcd7	Apply swh-py-template v0.2.3	2024-08-27 16:25:53 +02:00
Antoine Lambert	5003e6588f	crates: Remove crates metadata as loader argument Those extrinsic metadata can be directly fetched by the loader through the crates Web API, plus it contains more metadata fields.	2024-08-27 12:28:05 +02:00
Antoine Lambert	42e76ee62e	crates: Speedup listing by processing crates in batch Instead of having a single crate and its versions info per page, prefer to have up to 1000 crates per page to significantly speedup the listing process.	2024-08-27 12:28:05 +02:00
Antoine Lambert	c6aa490fc1	crates: Record lister state only if all crates were processed Previously, the lister state was recorded regardless if errors occurred when listing crates as the finalize method is called regardless of raised exception during listing. As a consequence some crates could be missed as the incremental listing restarts from the dump date of the last processed crate database. So ensure all crates have been processed by the lister before recording its state.	2024-08-27 12:28:05 +02:00
Antoine Lambert	aafaebd5de	crates: Use looseversion.LooseVersion2 to parse crate versions packaging.version.parse is dedicated to parse Python package version numbers but crate versions do not necessarily respect Python version number conventions and thus some crate versions cannot be parsed. Prefer to use looseversion.LooseVersion2 instead which in a drop-in replacement for deprecated distutils.version.LooseVersion and enables to parse all kind of version numbers.	2024-08-27 12:28:05 +02:00
Antoine Lambert	b2ece7ca63	crates: Bump csv field size limit A size limit of 1000000 was not enough to properly process all CSV crates data so bump to a higher value.	2024-08-27 12:28:05 +02:00
Nicolas Dandrimont	f7abfafffe	GitHub: record whether the origin is a fork For now this information is not used downstream, but it can be useful for specific analysis or one-shot scheduling.	2024-07-18 10:45:06 +02:00
Antoine Lambert	a7607abcf9	tests: Fix mocking of sleep calls with tenacity 8.4.2 Latest tenacity release adds some internal changes that broke the mocking of sleep calls in tests. Fix it by directly mocking time.sleep (was not working previously).	2024-06-28 18:15:36 +02:00
Antoine Lambert	323e277482	gitea, gogs: Ensure query parameters are not duplicated in API URLs Gitea API return next pagination link with all query parameters provided to an API request. As we were also passing a dict of fixed query parameters to the page_request method, some query parameters ended up having multiple instances in the URL for fetching a new page of repositories data. So each time a new page was requested, new instances of these parameters were appended to the URL which could result in a really long URL if the number of pages to retrieve is high and make the request fail. Also remove a debug log already present in http_request method.	2024-06-05 15:27:58 +02:00
Antoine Lambert	aaae1a6b0b	launchpad, npm: Port code to updated swh-scheduler API The oldest part of the scheduler API was updated to use model classes (based on attr package) instead of dictionaries in order to improve typing.	2024-05-22 17:44:00 +02:00
Antoine Lambert	e51b808d72	nixguix: Ensure to not use a redirection URL as an origin URL Redirection URLs can be long and quite obscure in some cases (GitHub CDN for instance) so ensure to use the redirected URL as origin URL. Related to swh/meta#5090.	2024-04-24 14:25:48 +02:00
Antoine Lambert	41407e0eff	Use beautifulsoup4 CSS selectors to simplify code and type checking As the types-beautifulsoup4 package gets installed in the swh virtualenv as it is a swh-scanner test dependency, some mypy errors were reported related to beautifulsoup4 typing. As the returned type for the find method of bs4 is the following union: Tag \| NavigableString \| None, isinstance calls must be used to ensure proper typing which is not great. So prefer to use the select_one method instead where a simple None check must be done to ensure typing is correct as it is returning Optional[Tag]. In a similar manner, replace use of find_all method by select method. It also has the advantage to simplify the code.	2024-04-16 11:22:51 +02:00
David Douard	e6a35c55b0	Apply swh-py-template v0.2.0	2024-03-29 13:55:23 +01:00
Antoine Lambert	fdeb086f77	nixguix: Handle creation of svn-export visit types on svn sub-trees Some Guix packages correspond to subset exports of a subversion source tree at a given revision, typically the Tex Live ones. In that case, we must pass an extra parameter to the svn-export loader to specify the sub-paths to export but also use a unique origin URL for each package to archive as otherwise the same one would be used and only a single package would be archived. Related to swh/infra/sysadm-environment#5263.	2024-03-14 16:23:32 +01:00
Antoine Lambert	b083b4f1f9	pytest: Fix tests execution with pytest 8.1 Remove use of --import-mode=importlib pytest option and use new option consider_namespace_packages to fix tests execution with latest pytest release.	2024-03-13 10:58:03 +01:00
Antoine Lambert	329cb2e44a	requirements-test: Add missing swh-scheduler[testing] dependency It fixes installation of dependencies required by swh-scheduler pytest plugin.	2024-03-13 10:56:47 +01:00
Antoine Lambert	32be94a89b	tox: Bump mypy to 1.8.0 Related to swh/meta#5075.	2024-02-05 16:14:17 +01:00
Antoine Lambert	65e51e2925	nixguix: Update heuristic checking if URL targets a tarball file In addition to query parameters also check if any part of URL path contains a tarball filename. It fixes the detection of some tarball URLs provided in Guix manifest. Related to swh/meta#3781.	2024-01-18 15:07:11 +01:00
David Douard	ed8de05eea	Remove the outdated list of swh.lister submodules from the readme Link to the user documentation instead. Also add a section on required binary tools.	2024-01-17 18:05:58 +01:00
Jérémy Bobbio (Lunar)	d70dd84939	Fix the listing of listers Commit `c2402f405f` renamed the entry points from `lister.*` without updating the rest of the framework. Revert the changes (and sort the list alphabetically).	2024-01-10 17:46:23 +01:00
Franck Bret	82ee095128	Elm stateful lister Use another Api endpoint that helps the lister to be stateful. The Api endpoint used needs a ``since`` value that represents a sequential index in the history. The ``all_packages_count`` state helps in storing a count which will be used as ``since`` argument on the next run.	2024-01-09 14:05:56 +01:00
Franck Bret	4b1f49ac22	Adapt and rebase 'url' and 'instance' are mandatory Add elm lister entry to pyproject.toml	2024-01-09 14:05:56 +01:00
Franck Bret	3a1beae36e	Elm Lister The Elm Lister lists Elm packages origins from the Elm lang registry. It uses an http api endpoint to list packages origins. Origins are Github repositories, releases take advantages of Github relase Api.	2024-01-09 14:05:56 +01:00
Antoine Lambert	f814e1179d	nixguix: Exploit new submodule info in sources.json from Guix Guix now provides a "submodule" info in the sources.jon file it produced so exploit it to set the new "submodules" parameter of the git-checkout loader in order to retrieve submodules only when it is required. Related to swh/devel/swh-loader-git#4751.	2024-01-08 16:11:02 +01:00
Franck Bret	99bbd9d68f	Stateful Julia lister Add a state to the lister to store the ``last_seen_commit`` as a Git commit hash. Use Dulwich to retrieve a Git commit walker since ``last_seen_commit`` if any. For each commit detect if it is a new package or a new package version commit and returns its origin with commit date as last_update.	2023-12-18 16:02:22 +01:00
David Douard	053f0a93d5	Add latest blackify to git-blame-ignore-revs	2023-12-05 14:04:51 +01:00
David Douard	714fccc3c7	python: Fix black formatting after bump to 23.1.0 in pre-commit	2023-12-05 10:33:07 +01:00
David Douard	ac52cfed21	Apply swh-py-template 0.1.6	2023-12-03 17:54:52 +01:00
Antoine Lambert	e4c707d807	pytest.ini: Ensure '--import-mode importlib' option is always used Fix hanging test when executed outside tox.	2023-12-01 14:43:03 +01:00
David Douard	c2402f405f	Migrate to copier-based swh-py-template	2023-11-29 17:23:28 +01:00
David Douard	553884fa56	docs: include the README file in the main index page Convert README from markdown to ReST to make it embeddable in docs/index.rst	2023-11-16 16:25:56 +01:00
David Douard	a9b2980f14	Fix pygment language declaration in the README file	2023-11-15 17:35:39 +01:00
Nicolas Dandrimont	4bcf4a4147	swh-core's github extra isn't needed anymore	2023-11-14 19:25:13 +01:00
Antoine Lambert	4aee4da784	cran: Use pyreadr instead of rpy2 to read a RDS file from Python The CRAN lister improvements introduced in `91e4e33` originally used pyreadr to read a RDS file from Python instead of rpy2. As swh-lister was still packaged for debian at the time, the choice of using rpy2 instead was made as a debian package is available for it while it is not for pyreadr. Now debian packaging was dropped for swh-lister we can reinstate the pyreadr based implementation which has the advantages of being faster and not depending on the R language runtime. Related to swh/meta#1709.	2023-11-14 17:09:42 +01:00
Antoine Lambert	42d8e24d7e	arch/lister: Drop artifact size approximation from the listing That fails the current loader ingestion as this must be an exact value (when provided, it's checked against the download operation). Refs. swh/infra/sysadm-environment#4746	2023-11-14 10:40:40 +01:00
Antoine Lambert	2eb3223496	cli: Print lister stats at the end of the run command Display the number of processed pages and listed origins after the listing process ended.	2023-11-07 19:00:53 +01:00
Antoine Lambert	7092e4e4ac	cli: Use temporary scheduler as fallback when no configuration detected In order to simplify the testing of listers, allow to call the run command of swh-lister CLI without scheduler configuration. In that case a temporary scheduler instance with a postgresql backend is created and used. It enables to easily test a lister with the following command: $ swh -l DEBUG lister run <lister_name> url=<forge_url>	2023-11-07 19:00:53 +01:00
Jérémy Bobbio (Lunar)	7344d264e7	Ensure HTTPError.response is not None The implementation of `HTTPError` in `requests` does not guarantee that the `response` property will always be set. So we need to ensure it is not `None` before looking for the return code, for example. This also makes mypy checks pass again, as `types-request` was updated in 2.31.0.9 to better match this particular aspect. See: https://github.com/python/typeshed/pull/10875	2023-10-18 10:41:57 +02:00
Franck Bret	968ddef295	Improve registry repository management Ensure the registry path does not exists before cloning the repository.	2023-10-12 14:31:48 +02:00
Franck Bret	360fa753ef	Remove useless triple single quote from bash script	2023-10-09 15:15:21 +02:00
Franck Bret	7f97c2da67	Use a temp directory instead of /tmp	2023-10-09 15:05:25 +02:00

1 2 3 4 5 ...

936 commits