swh-lister

Author	SHA1	Message	Date
Antoine R. Dumont (@ardumont)	fdb420238c	gitlab: Allow ingestion of hg_git origins as hg ones Related to T3581#70593	2021-09-17 12:17:11 +02:00
Antoine R. Dumont (@ardumont)	4e4edee478	gitlab: Allow listing of instances providing multiple vcs_type This will allow to list the foss.heptapod.net instance for example. Related to T3581	2021-09-16 18:36:25 +02:00
Antoine Lambert	e904f4760e	gitlab: Handle HTTP status code 500 when listing projects GitLab API can return errors 500 when listing projects (see https://gitlab.com/gitlab-org/gitlab/-/issues/262629). To avoid ending the listing prematurely, skip buggy URLs and move to next pages. Related to T3442	2021-07-23 15:07:16 +02:00
Antoine Lambert	52c3150155	gitlab: Update requests query parameters Increase number of origins per page to the maximum value allowed by GitLab API (100) to send less requests. Ask for simple responses to reduce size of JSON data.	2021-07-23 14:05:38 +02:00
Antoine Lambert	73f85c0b8a	gitlab: Adapt requests retry policy to consider HTTP 50x status codes Temporarily server failures can happen when listing a GitLab instance, HTTP status codes 502, 503 or 520 are returned in that case. So adapt lister requests retry policy to execute requests again when such errors are encountered. Related to T3442	2021-07-23 13:51:17 +02:00
Antoine R. Dumont (@ardumont)	f00d41d0cd	opam: Directly use the --root flag instead of using an env variable This aligns the behavior with the opam loader Related to T3358	2021-07-20 16:46:10 +02:00
Antoine Lambert	6c12350863	pattern: Use URL network location as instance name when not provided Make the instance parameter of the base pattern lister optional and set lister name to URL network location when not provided. It simplifies lister creation when associated forge type have a lot of instances in the wild (e.g. gitlab or cgit) while giving more details about the listed forge instance. Also process listers for forge with multiple instances (cgit, gitea, gitlab, phabricator and tuleap) to ensure URL network location will be used when instance parameter is not provided. Related to T3403	2021-07-13 12:33:49 +02:00
Antoine R. Dumont (@ardumont)	df46b22098	Make PyPI lister incremental and complete in regards to last_update This rewrote the current implementation to actually use pypi's xml-rpc api which allows to be incremental. It also allows to fetch the last release date per package. This last part actually make it possible to update the "last_update" entry in the ListedOrigin model. Related to T3399	2021-07-09 12:51:58 +02:00
Antoine R. Dumont (@ardumont)	698be475e9	test_tasks: Align test consistently with other using mocker	2021-07-09 10:13:21 +02:00
zapashcanon	fe01d08cd9	add opam lister	2021-07-06 15:19:00 +02:00
Antoine Lambert	bf7d44db3c	mypy: Fix errors with release >= v0.900	2021-06-09 14:02:23 +02:00
David Douard	de23a2219e	Merge branch 'T3334_tuleap_lister'	2021-06-09 12:29:39 +02:00
Raphaël Gomès	e8f966de59	sourceforge: use http:// for Mercurial See inline comment as to why. This change also adds a Mercurial repo to the test data.	2021-06-04 10:51:56 +02:00
Raphaël Gomès	2e0c951be0	sourceforge: set the protocol for origin urls I previously forgot to add the `https://` prefix to the cloning URL. Whoops.	2021-06-03 10:01:33 +02:00
Antoine R. Dumont (@ardumont)	3a375d5bcc	Disable the sourceforge lister origins This is a temporary workaround the time we make a first pass on those repositories. Related to T3350	2021-05-31 15:59:49 +02:00
Antoine R. Dumont (@ardumont)	729e76168f	cgit/lister: Fix error when a missing version is not provided Related to [1] [1] https://sentry.softwareheritage.org/share/issue/afe7279f9f2d4bdc86f4b1b068a281a5/	2021-05-28 12:09:56 +02:00
Raphaël Gomès	9ca5295a40	sourceforge: retry for all retryable exceptions Since this lister is doing a lot more requests than most other, it makes sense that issues would arise more often. We want the lister to continue even if the website is having issues and not break on the first 500 or closed connection it encounters. This change introduces a mechanism to retry all exceptions worth retrying and uses it for the SourceForge lister. Other listers might benefit from this, but this is out of scope here. Tests had to be adjusted to stub the sleep function since retries happened way more often.	2021-05-26 12:05:39 +02:00
Boris Baldassari	04c0a50706	tuleap: initialise lister. tuleap-lister: fix args in test_task. tuleap-lister: Add rate-limiting test + fix debug and typo. tuleap-lister: code review: fix mocker + tests/setup_cli. tuleap-lister: code review: fix relister > lister. tuleap-lister: code review: fix test_task kwargs. tuleap-lister: code review: Remove authentication useless lines + fix typos. tuleap-lister: code review: improve results_simplified for svn repos. tuleap-lister: code review: add name to CONTRIBUTORS file. tuleap-lister: code review: Update tutorial for misc files to edit. tuleap-lister: code review: Update copyright to 2021 exactly. tuleap-lister: code review: Update py files perms -X. tuleap-lister: code review: minimise json files. tuleap-lister: code review: fix chmod on json files. tuleap-lister: code review: fix var names + add tests. tuleap-lister: code review: fix useless indirection. tuleap-lister: code review: Add empty repo test, minor typo fixes.	2021-05-26 11:09:12 +02:00
Raphaël Gomès	8f3bbacd5e	sourceforge: don't abort on error for project It's suboptimal to say the least to stop the entire lister process if a single project page is somehow broken (404, most likely). This change logs the issue as a warning and carries on, as well as some minor logging changes and comments touch ups.	2021-05-12 15:54:53 +02:00
Antoine R. Dumont (@ardumont)	2ff549e125	sourceforge/tasks: Allow incremental listing Related to T3310	2021-05-07 17:04:24 +02:00
Antoine R. Dumont (@ardumont)	7282647bb2	sourceforge/lister: Add credentials parameter The credentials parameter is not optional due to the instance constructor logic. Even if unused, this must be provided to the lister (from the task standpoint). Related to T3310#64801	2021-05-07 16:46:09 +02:00
Antoine Lambert	3167a6dcb7	sourceforge/tests: Ensure correct sleep function gets mocked This ensures the mocked sleep will work with all tenacity versions. Related to T3310	2021-05-07 14:40:13 +02:00
Antoine Lambert	1284eb1587	sourceforge/tests: Fix failing test with tenacity < 5.1 It fixes debian package build of swh-lister on buster.	2021-05-07 14:05:36 +02:00
Raphaël Gomès	3baf1d0999	Make the SourceForge lister incremental SourceForge's sitemaps (1 main one + many sharded) give us a "last modified" date for every subsitemap and project, allowing us to perform an incremental listing. We store the subsitemaps' "last modified" dates in the lister state, as well as those of the empty projects (projects which don't have any VCS registered), and the rest comes from the already visited origins from the database. The tests try to cover the possible cases of a subsitemap that has changed, one that hasn't, a project that has change, one that hasn't, and same for an empty project.	2021-05-06 10:28:27 +02:00
Antoine Lambert	6f8dd5d3f2	tox: Add sphinx environments to check sane doc build Enable to check package documentation can be built without producing sphinx warnings. The sphinx environment is designed to be used in continuous integration in order to prevent breaking documentation build when committing changes. The sphinx-dev environment is designed to be used inside a full swh development environment. Related to T3258	2021-04-28 14:05:20 +02:00
Valentin Lorentz	18b68bd8c7	s/REST( API)?/API/ Bitbucket's API kind of supports REST workflows, but the clearly use it like an RPC API (the hardcoded schema in `PROJECT_API_URL_FORMAT` make it particularly clear)	2021-04-27 18:13:13 +02:00
Valentin Lorentz	40e1916510	Fix various Sphinx warnings	2021-04-13 21:56:08 +02:00
Valentin Lorentz	465506a0ce	Remove old lister tutorial. Sphinx complains because it's an orphan document.	2021-04-13 18:31:31 +02:00
Hezekiah Maina	d5d7830b64	Added Hezekiah Maina as a contributor	2021-04-04 23:33:22 +03:00
Hezekiah Maina	7124627400	Added the right location on linter.yml file	2021-04-04 23:22:46 +03:00
Raphaël Gomès	f7b27c6930	Add a non-incremental sourceforge lister Following zack's work on T735, this change introduces an actual SWH lister for SourceForge. SourceForge provides a main sitemap that lists sharded sitemaps, which themselves list pages. Each page belongs to a project (or sub-project, though those are rare), information about which can be found by querying a REST API, which gives us the list of any and all VCS used for said project. Both sitemaps and pages have a "last modified" timestamp that will be used in a future patch to implement incremental listing. More precise information can be found as inline comments or docstrings.	2021-03-23 18:40:21 +01:00
Nicolas Dandrimont	879170a57d	GitHub: handle edge cases with empty responses	2021-03-19 16:53:52 +01:00
Nicolas Dandrimont	c375a61b16	GitHub: handle Server Errors These errors happen, sometimes, when requesting large pages of results.	2021-03-19 16:53:52 +01:00
Nicolas Dandrimont	4a215e68e0	GitHub: Move rate-limit reset logic to RateLimited exception This makes the logic easier to test.	2021-03-19 16:52:46 +01:00
Nicolas Dandrimont	cfd4169bd8	Retry GitHub requests on ChunkEncodingErrors These happen, sometimes, when the connection to the GitHub server resets, e.g. because of congestion on a slow link.	2021-03-19 16:52:46 +01:00
Nicolas Dandrimont	61c1d444c5	GitHub: Move rate limit handling to the request function	2021-03-19 15:58:01 +01:00
Nicolas Dandrimont	03b10e5c83	GitHub: Start moving the request logic to a separate function	2021-03-19 15:58:01 +01:00
Nicolas Dandrimont	8f7dbb7488	GitHub: Use function for requests.Session initialization This will help us to break the retry logic for the listing requests themselves to a separate function too.	2021-03-19 15:58:01 +01:00
Valentin Lorentz	df73073a67	docs: Fix title syntax	2021-03-19 09:58:38 +01:00
tenma	2e17729e97	docs: Add new "howto write a lister tutorial" with unified lister api This adds a new tutorial which details how to currently write the new listers (both incremental or stateless). This proposes a python template file to start a new lister. At last, this renames the previous tutorial into tutorial-2017. Related to T3073	2021-02-26 16:05:18 +01:00
Antoine Lambert	5b4dc289b7	debian: Update archive mirror URL templates to process Some distributions (e.g. debian-security) have a slightly different URL for retrieving source packages metadata. So add a new URL template to process when trying to download such data. Related to T3032#58239	2021-02-08 14:01:59 +01:00
Antoine Lambert	e72c15e97a	docs: Update listers execution instructions Remove outdated part about listers database and use swh CLI in README for executing a lister instead of raw Python code.	2021-02-05 14:51:24 +01:00
Antoine Lambert	1803b707e4	cran: Prevent multiple listing of an origin A CRAN package can appear twice in the JSON list returned by the list_all_packages.R script, most recent version of the package appearing first. So handle that edge case to avoid error when sending origins to the scheduler.	2021-02-05 14:34:37 +01:00
Antoine Lambert	b4c4c20bb9	cran: Add support for parsing date with milliseconds	2021-02-05 14:32:49 +01:00
Antoine Lambert	2461c97bbb	pypi: Use BeautifulSoup for parsing HTML instead of xmltodict xmltodict now raises an error while trying to parse the HTML content of https://pypi.org/simple/ page. So use BeautifulSoup HTML parser instead as it is aleady a requirement of swh-lister and it does not fail parsing the PyPI HTML page. Also drop no longer used xmltodict in requirements.	2021-02-05 14:23:11 +01:00
Antoine Lambert	4245c5046f	Remove no longer used models field in dict returned by register	2021-02-02 16:33:52 +01:00
Antoine Lambert	8933544521	Remove no longer used legacy Lister API and update CLI options Legacy Lister classes from the swh.lister.core mdule are no longer used in swh-lister codebase so it is time to remove them. Also remove lister CLI options related to legacy Lister API. As a consequence, the following requirements are no longer needed: arrow, SQLAlchemy, sqlalchemy-stubs and testing.postgresql. Closes T2442	2021-02-02 15:54:55 +01:00
Antoine Lambert	ff05191b7d	packagist: Reimplement lister using new Lister API The previous implementation was generating tasks for a non implemented Packagist loader. The new implementation extracts source repository URL, VCS type and last update date for each package referenced by Packagist and send those info to the scheduler. Packages metadata are retrieved using Packagist API endpoints whose responses are served from static files, which are guaranteed to be efficient on the Packagist side (no dymamic queries). Furthermore, subsequent listing will send the "If-Modified-Since" HTTP header to only retrieve packages metadata updated since the previous listing operation in order to save bandwidth and return only origins which might have new released versions. Closes T2991	2021-02-02 14:48:47 +01:00
Antoine Lambert	82ab96ad06	gnu: Remove dependency on pytz UTC timezone settings can be obtained from the datetime.timezone module from Python standard library so remove dependency on external pytz module.	2021-02-02 13:19:04 +01:00
Vincent SELLIER	8e4dd178f1	cgit: remove the repository urls's trailing / Ensure the behavior is the same when a base url is provided or not Related to T3013#57810	2021-02-01 17:31:08 +01:00

... 2 3 4 5 6 ...

787 commits