swh-lister

Author	SHA1	Message	Date
Antoine Lambert	41407e0eff	Use beautifulsoup4 CSS selectors to simplify code and type checking As the types-beautifulsoup4 package gets installed in the swh virtualenv as it is a swh-scanner test dependency, some mypy errors were reported related to beautifulsoup4 typing. As the returned type for the find method of bs4 is the following union: Tag \| NavigableString \| None, isinstance calls must be used to ensure proper typing which is not great. So prefer to use the select_one method instead where a simple None check must be done to ensure typing is correct as it is returning Optional[Tag]. In a similar manner, replace use of find_all method by select method. It also has the advantage to simplify the code.	2024-04-16 11:22:51 +02:00
Antoine Lambert	fdeb086f77	nixguix: Handle creation of svn-export visit types on svn sub-trees Some Guix packages correspond to subset exports of a subversion source tree at a given revision, typically the Tex Live ones. In that case, we must pass an extra parameter to the svn-export loader to specify the sub-paths to export but also use a unique origin URL for each package to archive as otherwise the same one would be used and only a single package would be archived. Related to swh/infra/sysadm-environment#5263.	2024-03-14 16:23:32 +01:00
Antoine Lambert	65e51e2925	nixguix: Update heuristic checking if URL targets a tarball file In addition to query parameters also check if any part of URL path contains a tarball filename. It fixes the detection of some tarball URLs provided in Guix manifest. Related to swh/meta#3781.	2024-01-18 15:07:11 +01:00
Franck Bret	82ee095128	Elm stateful lister Use another Api endpoint that helps the lister to be stateful. The Api endpoint used needs a ``since`` value that represents a sequential index in the history. The ``all_packages_count`` state helps in storing a count which will be used as ``since`` argument on the next run.	2024-01-09 14:05:56 +01:00
Franck Bret	4b1f49ac22	Adapt and rebase 'url' and 'instance' are mandatory Add elm lister entry to pyproject.toml	2024-01-09 14:05:56 +01:00
Franck Bret	3a1beae36e	Elm Lister The Elm Lister lists Elm packages origins from the Elm lang registry. It uses an http api endpoint to list packages origins. Origins are Github repositories, releases take advantages of Github relase Api.	2024-01-09 14:05:56 +01:00
Antoine Lambert	f814e1179d	nixguix: Exploit new submodule info in sources.json from Guix Guix now provides a "submodule" info in the sources.jon file it produced so exploit it to set the new "submodules" parameter of the git-checkout loader in order to retrieve submodules only when it is required. Related to swh/devel/swh-loader-git#4751.	2024-01-08 16:11:02 +01:00
Franck Bret	99bbd9d68f	Stateful Julia lister Add a state to the lister to store the ``last_seen_commit`` as a Git commit hash. Use Dulwich to retrieve a Git commit walker since ``last_seen_commit`` if any. For each commit detect if it is a new package or a new package version commit and returns its origin with commit date as last_update.	2023-12-18 16:02:22 +01:00
David Douard	714fccc3c7	python: Fix black formatting after bump to 23.1.0 in pre-commit	2023-12-05 10:33:07 +01:00
David Douard	c2402f405f	Migrate to copier-based swh-py-template	2023-11-29 17:23:28 +01:00
Antoine Lambert	4aee4da784	cran: Use pyreadr instead of rpy2 to read a RDS file from Python The CRAN lister improvements introduced in `91e4e33` originally used pyreadr to read a RDS file from Python instead of rpy2. As swh-lister was still packaged for debian at the time, the choice of using rpy2 instead was made as a debian package is available for it while it is not for pyreadr. Now debian packaging was dropped for swh-lister we can reinstate the pyreadr based implementation which has the advantages of being faster and not depending on the R language runtime. Related to swh/meta#1709.	2023-11-14 17:09:42 +01:00
Antoine Lambert	42d8e24d7e	arch/lister: Drop artifact size approximation from the listing That fails the current loader ingestion as this must be an exact value (when provided, it's checked against the download operation). Refs. swh/infra/sysadm-environment#4746	2023-11-14 10:40:40 +01:00
Antoine Lambert	2eb3223496	cli: Print lister stats at the end of the run command Display the number of processed pages and listed origins after the listing process ended.	2023-11-07 19:00:53 +01:00
Antoine Lambert	7092e4e4ac	cli: Use temporary scheduler as fallback when no configuration detected In order to simplify the testing of listers, allow to call the run command of swh-lister CLI without scheduler configuration. In that case a temporary scheduler instance with a postgresql backend is created and used. It enables to easily test a lister with the following command: $ swh -l DEBUG lister run <lister_name> url=<forge_url>	2023-11-07 19:00:53 +01:00
Jérémy Bobbio (Lunar)	7344d264e7	Ensure HTTPError.response is not None The implementation of `HTTPError` in `requests` does not guarantee that the `response` property will always be set. So we need to ensure it is not `None` before looking for the return code, for example. This also makes mypy checks pass again, as `types-request` was updated in 2.31.0.9 to better match this particular aspect. See: https://github.com/python/typeshed/pull/10875	2023-10-18 10:41:57 +02:00
Franck Bret	968ddef295	Improve registry repository management Ensure the registry path does not exists before cloning the repository.	2023-10-12 14:31:48 +02:00
Franck Bret	360fa753ef	Remove useless triple single quote from bash script	2023-10-09 15:15:21 +02:00
Franck Bret	7f97c2da67	Use a temp directory instead of /tmp	2023-10-09 15:05:25 +02:00
Franck Bret	1984037fe1	Replace obsolete comment, improve docstring	2023-10-09 15:05:25 +02:00
Franck Bret	3e414c5397	url and instance are now mandatory (related #501 )	2023-10-09 15:05:25 +02:00
Franck Bret	f8cfa05f3f	Add Julia Lister for listing Julia Packages This module introduce Julia Lister. It retrieves Julia packages origins from the Julia General Registry, a Git repository made of per package directory with Toml definition files.	2023-10-09 15:05:25 +02:00
Antoine Lambert	7b932f46b5	gitweb: Add optional base_git_url parameter to lister Similar to cgit, it exist cases where git clone URLs for projects hosted on a gitweb instance cannot be found when scraping project pages or cannot be easily derived from the gitweb instance root URL. So add an optional base_git_url parameter enabling to compute correct clone URLs by appending project names to it.	2023-10-02 14:56:04 +02:00
Antoine Lambert	59a979642f	gitweb: Ensure to strip any prefix before git clone URL Some gitweb instances can have some string prefixes before the displayed git clone URLs so ensure to strip them to properly extract URLs. Related to swh/infra/sysadm-environment#5051.	2023-10-02 14:54:41 +02:00
Kumar Shivendu	88611642fc	Introduce bioconductor lister	2023-09-28 12:54:37 +00:00
Antoine Lambert	a04975571c	gitweb: Remove invalid use of str.rstrip rstrip is not a method to remove a string suffix so use another way to extract gitweb project name. It fixes the computation of some gitweb origin URLs. Related to swh/infra/sysadm-environment#5050.	2023-09-26 14:53:57 +02:00
Antoine Lambert	aa7b3fa7d8	rpm: Add config for listing EPEL source packages Extra Packages for Enterprise Linux is a set of additional packages community maintained that can be installed on many Red Hat based distributions.	2023-09-25 11:40:47 +02:00
Franck Bret	bf806f2c7b	Remove spurious space	2023-09-21 09:18:54 +02:00
Franck Bret	ebba50882f	Revert "Remove spurious space" This reverts commit `c9e2339af9`	2023-09-21 07:14:44 +00:00
Franck Bret	c9e2339af9	Remove spurious space	2023-09-20 17:01:35 +02:00
Franck Bret	cc686268ba	url and instance are now mandatory	2023-09-20 16:49:52 +02:00
Franck Bret	2793ef9aad	D lang lister Add a dlang module that retrieve origins from an http api endpoint. Each origin is a git based project url on github.com, gitlab.com or bitbucket.com.	2023-09-19 16:08:59 +02:00
Valentin Lorentz	1c964cccd3	maven/README: Fix links	2023-09-14 12:03:12 +00:00
Antoine Lambert	6e7bc49ec7	Harmonize listers parameters and add test to check mandatory ones Ensure that all lister classes have the same set of mandatory parameters in their constructors, notably: scheduler, url, instance and credentials. Add a new test checking listers classes have mandatory parameters declared in their constructors. The purpose is to avoid deployment issues on staging or production environment as celery tasks can fail to be executed if mandatory parameters are not handled by listers. Reated to swh/infra/sysadm-environment#5030.	2023-09-06 11:55:34 +02:00
Antoine R. Dumont (@ardumont)	5f717e311d	rpm: Adapt lister constructor to accept the credentials parameter Refs. swh/infra/sysadm-environment#5030	2023-09-05 17:40:09 +02:00
Antoine R. Dumont (@ardumont)	a02fdbb4c8	lister.github.utils: Drop no longer used module This got detected when working on the deployment of the new loader-git. Refs. swh/infra/sysadm-environment#5017	2023-08-22 11:15:04 +02:00
Antoine Lambert	91e4e33dd5	cran: Improve listing of R packages Previously, the lister was relying on the use of the CRANtools R module but it has the drawback to only list the latest version of each registered package in the CRAN registry. In order to get all possible versions for each CRAN package, prefer to exploit the content of the weekly dump of the CRAN database in RDS format. To read the content of the RDS file from Python, the rpy2 package is used as it has the advantage to be packaged in debian. Related to swh/meta#1709.	2023-08-21 16:38:08 +02:00
Antoine Lambert	95714f6f37	rpm: Turn fedora lister into a generic Red Hat based distribution one As Red Hat based linux distributions share the same type of package repository, rework the fedora lister into a generic one to list RPM source packages and their versions from numerous distributions. For a given distribution, the RPM lister will fetch packages metadata from a list of release identifiers and a list of software components. Source packages are then processed and relevant info are extracted to be sent to the RPM loader. When all releases and components were processed, the lister collected all versions for each package name and send those info to the scheduler that will create RPM loading tasks afterwards. Nevertheless, as there is no generic way to list all releases and components for a given distribution but also to guess the right URL to retrieve packages metadata from, those info need to be manually provided to the lister as input parameters. Some examples of those parameters for various distributions can be found in the config directory of the lister. Regarding the produced origin URLs, as there is no way to find valid HTTP ones for all distributions, the same behavior as with the debian lister is used and they have the following form: rpm://{instance}/packages/{package_name} where the instance variable corresponds to the name of the listed distribution such as Fedora, CentOS, or openSUSE. Related to swh/meta#5011.	2023-08-16 13:25:23 +00:00
Antoine R. Dumont (@ardumont)	fcfb7004db	pypi: Allow passing configuration arguments to task The constructor allows it but not the celery task. This also aligns the behavior with other lister tasks.	2023-08-04 15:34:02 +02:00
Antoine R. Dumont (@ardumont)	928d592e10	sourceforge: Allow passing configuration arguments to task The constructor allows it but not the celery task. This also aligns the behavior with other lister tasks.	2023-08-04 15:30:09 +02:00
Antoine R. Dumont (@ardumont)	b02144b4f9	packagist: Yield pages of origins to regularly record origins Instead of sending one page with all origins listed which is britle. When something goes wrong during the listing, the lister currently records nothing.	2023-08-04 11:09:58 +02:00
Antoine R. Dumont (@ardumont)	15a4c4cdb4	packagist: Skip package if unable to parse the last update date	2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)	d4f3e91466	packagist: Allow batch size records configuration in constructor This allows to configure smaller batch when testing from docker & cli.	2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)	f236f3d163	packagist: Continue listing when github server hangs up With or without retry (for a future version of swh.core). This skips the origin when this sporadically happens. It should get picked up by another listing eventually. The listing is currently failing to finish when the github server hangs up on the process. Adding this behavior allows to skip the issue without breaking the listing.	2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)	203f6db8f0	packagist: Randomize the packages list To avoid starting always in the same order the packages list when some problems occur in previous listing.	2023-08-04 11:09:57 +02:00
Antoine R. Dumont (@ardumont)	903ff367ec	packagist: Fix json parsing which is different depending on page	2023-08-02 16:34:32 +02:00
Antoine R. Dumont (@ardumont)	f1ae6825e5	packagist: Improve extract package metadata information algorithm The current lister implementation lists very few metadata with the hard-coded /p/ base url (404 on mostly all packages). The packagist api implementation must have evolved since the initial implementation of the lister (and the first deployment on staging). Following the upstream documentation [1], it's sensible to first use the /p2/ as it's performant from the packagist api side. It's then fallbacking to use /p2/+~dev url scheme, then the /p/ scheme and finally the /packages/ base url if previous result are either not found or empty (different than no modification since the last visit). It keeps the initial implementation behavior of stopping immediately if a 304 NotModifiedSince is returned by the server. [1] https://repo.packagist.org/apidoc	2023-08-02 10:34:55 +02:00
Antoine R. Dumont (@ardumont)	1f27250694	lister.pattern: Make batch record parametric and test it This adds a test around the batch recording behavior to ensure it's not dropped by mistake.	2023-08-01 15:06:21 +02:00
Antoine R. Dumont (@ardumont)	920ed0d529	lister.pattern: Restore flushing origin batch in the scheduler Prior to this commit, the newly introduced check on url validity was consuming the stream of origins. In effect, this would no longer write origin records regularly. For all listers, that would translate to flush origins only at the end of the listing which could take a while for some (e.g. packagist lister has been running for more than 12h currently without writing anything in the scheduler).	2023-08-01 10:04:48 +02:00
Antoine R. Dumont (@ardumont)	56b4fcc760	Add stagit lister That lister is really near the cgit & gitweb implementations. But the dom data is again structured differently though so this implementation stands on its own. Refs. swh/meta#5048	2023-07-13 11:50:51 +02:00
Antoine R. Dumont (@ardumont)	3ab856288c	Add gitiles lister Gitiles instance returns voluntarily a malformed json output (json prefixed with ``)]}'\n``) [2]. The lister deals with it to properly parse the json response nonetheless. It drops the prefix and then parses the json. If at some point, they drop this prefix to return json directly, the lister will be able to deal with it too. There are 2 tests one with 'standard' gitile format and another with standard json to account for both case. Refs. swh/meta#5045 [2] https://github.com/google/gitiles/issues/263	2023-07-13 10:30:51 +02:00

1 2 3 4 5 ...

724 commits