Commit graph

831 commits

Author SHA1 Message Date
Antoine Lambert
ff0035a60b pytest: Exclude build directory for tests discovery
Due to test modules being copied in subdirectories of the
build directory by setuptools, it makes pytest fail by raising
ImportPathMismatchError exceptions when invoked from root
directory of the module.

So ignore the build folder to discover tests.
2022-03-22 11:56:35 +01:00
Antoine Lambert
fd03941c5f sourceforge: Fix incremental listing since CVS origin URLs modification
Commit 6a7479553e modified the origin URLs for CVS projects
hosted on SourceForge but it also broke incremental listing
due to a no longer valid assertion, so fix that issue.
2022-03-11 11:48:15 +01:00
Antoine R. Dumont (@ardumont)
2568ecc7c2
launchpad: Ignore erratic page and continue listing next page
The decorator is dropped on `get_origins_from_page` as we cannot retry an iterator
consumption anyway.

Related to T3948
2022-02-18 10:36:54 +01:00
Antoine Lambert
6a7479553e sourceforge: Fix origin URLs for CVS projects
CVS projects are different from other VCS ones, they use the rsync
protocol, a list of modules needs to be fetched from an info page
and multiple origin URLs can be produced for a same project.

Related to T3789
2022-02-17 13:51:52 +01:00
Antoine R. Dumont (@ardumont)
4265e5dd77
launchpad: Drop extra filtering step which is no longer necessary
as the scheduler is now able to deduplicate it when recording listed origins.

Related to T3945
2022-02-17 12:11:25 +01:00
Antoine R. Dumont (@ardumont)
c86e4b43f4
launchpad: Use tuple instead of list
Related to T3945
2022-02-17 12:11:24 +01:00
Antoine R. Dumont (@ardumont)
fc2edd24aa
launchpad: Manage unhandled exceptions when listing
Prior to this commit, the listing could fail when either reading a page or the page of
results (lauchpad api raises RestfulError). This now retries when those kind of
exceptions happen. If the error persists (after multiple tryouts and exponential
backoff), the listing continues nonetheless (with warning logs).

Note that if the page ends up being empty, it's no longer accounted for.

This actually allows the listing to finish in case of issues.

Related to T3945
2022-02-17 12:11:09 +01:00
Antoine R. Dumont (@ardumont)
262f9369c8
launchpad: Allow bzr origins listing
Related to T3945
2022-02-16 17:56:13 +01:00
Raphaël Gomès
31b4429ced sourceforge: fix support for listing bzr origins
Bazaar support was removed a long time ago and predates a lot of the new
mechanisms in place in the API. Unfortunately, it looks like a lot of
the URLs are offline now, but there are still a few projects that can be
listed, this is pretty low-effort.
2022-02-14 14:56:38 +01:00
Antoine Lambert
b7524bbae0 debian/test_lister: Fix typo detected with codespell 2022-02-10 16:25:42 +01:00
Antoine Lambert
c17a932ccf pre-commit: Bump hooks and add new one to check commit message spelling
To install the new hook:

  $ pre-commit install -t commit-msg
2022-02-10 16:24:19 +01:00
Antoine R. Dumont (@ardumont)
7ff1390378
maven: Fix last update datetime
We need to avoid using naive datetime as this fails during conversion.

Related to T3746
Related to P1280
2022-02-09 16:59:40 +01:00
Boris Baldassari
d4e1e8212a maven: Fix undef last_update in ListedOrigins. 2022-02-08 07:51:01 +01:00
Boris Baldassari
24eeabfade maven: dismiss origins if they are malformed - e.g. wrong pom scm format, add test. 2022-02-08 07:51:01 +01:00
Antoine R. Dumont (@ardumont)
a1000dfeb7
requirements-test: Pin pytest to < 7.0.0
Related to T3916
2022-02-07 16:10:49 +01:00
Antoine R. Dumont (@ardumont)
a599493b48
maven: Let logging instruction do the formatting 2022-01-25 11:32:23 +01:00
Antoine R. Dumont (@ardumont)
8667b04abc
maven: Add more debug logging instruction
And log the metadata dictionary.
2022-01-25 11:32:23 +01:00
Valentin Lorentz
f6ca1dc3dc docs: Fix ReST syntax 2022-01-24 16:57:38 +01:00
Valentin Lorentz
771c406710 Fix ReST syntax 2022-01-21 11:04:51 +01:00
Antoine R. Dumont (@ardumont)
40e227efac
docs: Fix sphinx warning
Related to D6967
2022-01-19 10:24:26 +01:00
Antoine R. Dumont (@ardumont)
ec7838123b
Pin mypy and drop type annotations which makes mypy unhappy
This also drops spurious copyright headers to those files if present.

Related to T3812
2021-12-16 16:10:20 +01:00
Antoine Lambert
445d539b3f Remove no longer needed tenacity workarounds
Now that we have packaged tenacity 6.2 for debian buster and use it
in production, we can remove the workarounds to support tenacity < 5.
2021-12-08 13:28:11 +01:00
Valentin Lorentz
fa7ecc8fbd maven: Pass the base URL of the Maven instance to the loader
I would like to use it as the metadata authority URI in the loader,
instead of '{p_url.scheme}://{p_url.netloc}/', which I do not think
is accurate, as it is possible to have multiple Maven instances at
the same netloc.
2021-12-07 13:51:00 +01:00
Antoine Lambert
15fa84cf7e debian: Update last_update for a package when required
A debian package can have sources coming from multiple suites
so we need to ensure to update the last_update field in the
ListedOrigin model if the current processed suite has a greater
modification time for its sources index.

Related to T2400
2021-12-06 10:43:28 +01:00
Antoine Lambert
93f17d4d9c debian: Provide last_update to produced ListedOrigin models
Use the value of the "Last-Modified" header from the HTTP response
resulting of the debian sources index HTTP request.

It will prevent to create loading tasks for debian packages with no
changes since last listing.

Related to T2400
2021-12-03 16:09:44 +01:00
Antoine Lambert
605b13a676 debian: Do not raise when a component cannot be found for a suite
All debian suites do not necessarily have the same set of components.

So prefer to log that a component is missing for a suite instead of
raising an excption that will stop the listing.
2021-12-03 14:29:15 +01:00
Antoine Lambert
4ff3e44643 debian: Update extra_loader_arguments dict produced ListedOrigin models
Remove no longer used date parameter in extra_loader_arguments.

Related to T2400
2021-12-03 10:51:30 +01:00
Antoine Lambert
46425917c2 debian: Add missing file URIs in lister output
For a given package, the debian lister generates a dictionary mapping
distribution and version to a list of files to be processed by the
debian loader.

For each file to process, the debian loader expects to find an URI
in order to download it and then use its content to ingest package
source code into the archive.

However, it turns out these URIs were not computed by the lister
in its current implementation making any debian loading task fail
due to these missing info.

So add the computation of these URIS and ensure they will be provided
in the debian loader input parameters.

Related to T2400
2021-12-02 17:30:50 +01:00
Nicolas Dandrimont
5f567b3c34 Deduplicate origins in the GitHub lister
In some circumstances, GitHub will return two separate repos with the
same html_url in the same page. This makes the lister fail with a
cardinality error.
2021-12-01 16:00:14 +01:00
Boris Baldassari
8991c625ea lister: Add new maven lister
The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724
2021-11-29 17:33:13 +01:00
Antoine R. Dumont (@ardumont)
3ffea8f525
lister: Fix type
This fixes the master build [1]

[1] https://jenkins.softwareheritage.org/view/swh-draft/job/DLS/job/tests/1625/console
2021-11-23 10:13:19 +01:00
Antoine R. Dumont (@ardumont)
97553d8984
opam: Stop leaking temporary folders on machine 2021-11-10 16:58:35 +01:00
Valentin Lorentz
6243f800b4 cran: Pass the package name to the loader
It will be used to create a synthetic release message that contains
the package's name, like the Debian loader does.
2021-11-09 15:04:01 +01:00
Antoine Lambert
24bc671679 cgit: Enable to retry throttled HTTP requests
Related to T3645
2021-10-22 15:15:05 +02:00
Antoine Lambert
20232cc36e cran: Fix ListedOrigin visit type
CRAN origins must be loaded with the cran visit type and not the tar one.

Related to T3675
2021-10-22 14:42:32 +02:00
Antoine R. Dumont (@ardumont)
5bba1a783a
Let sourceforge origins be enabled by default
Related to T3470
2021-10-11 13:03:40 +02:00
Antoine R. Dumont (@ardumont)
04dc628091
docs: Explain task type registering to complete the save forge doc
Related to T3629
2021-10-08 16:07:41 +02:00
Antoine R. Dumont (@ardumont)
1a9c08c93f
docs: Add a save forge documentation
This does not yet enter into the registration of a new lister.

Related to T3629
2021-10-08 16:07:40 +02:00
Antoine R. Dumont (@ardumont)
e7716c0122
opam: Share opam root directory even on multiple instances
That avoids having multiple distinct opam root directories per opam lister instance. The
current opam commands used by the lister are actually listing specifically per instance.

Related to P1171
2021-09-24 11:55:07 +02:00
Antoine R. Dumont (@ardumont)
5ab6b00408
gnu: Respect the pattern docstring about state initialization
Any extra state initialization (outside the scheduler scope) is to happen in the
get_pages method.
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
332ed8e543
opam: Allow defining where to actually install the opam_root folder
Related to T3590
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
ff5e86ff48
opam: Make the instance optional and derived from the url
This matches how it's done for all other multi instances listers.

Related to T3590
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
b69b0b7fd6
opam: Move the state initialization into the get_pages method
We should avoid side-effects in the constructor as much as possible. That avoids
surprising behavior at object instantiation time. The state if needed must be
initialized into the `swh.lister.pattern.Lister.get_pages` method, as preconized in the
class docstring.

This also fixes the current test that actually bootstrap a real opam local "clone" in
/tmp.

Related to T3590
2021-09-21 11:17:16 +02:00
Antoine R. Dumont (@ardumont)
c803fc2b59
Allow gitlab lister's name to be overriden by task arguments
This will allow to dedicate the heptapod instances into its their own stats.

Related to T3581
2021-09-17 14:27:16 +02:00
Antoine R. Dumont (@ardumont)
fdb420238c
gitlab: Allow ingestion of hg_git origins as hg ones
Related to T3581#70593
2021-09-17 12:17:11 +02:00
Antoine R. Dumont (@ardumont)
4e4edee478
gitlab: Allow listing of instances providing multiple vcs_type
This will allow to list the foss.heptapod.net instance for example.

Related to T3581
2021-09-16 18:36:25 +02:00
Antoine Lambert
e904f4760e gitlab: Handle HTTP status code 500 when listing projects
GitLab API can return errors 500 when listing projects
(see https://gitlab.com/gitlab-org/gitlab/-/issues/262629).

To avoid ending the listing prematurely, skip buggy URLs and move
to next pages.

Related to T3442
2021-07-23 15:07:16 +02:00
Antoine Lambert
52c3150155 gitlab: Update requests query parameters
Increase number of origins per page to the maximum value allowed
by GitLab API (100) to send less requests.

Ask for simple responses to reduce size of JSON data.
2021-07-23 14:05:38 +02:00
Antoine Lambert
73f85c0b8a gitlab: Adapt requests retry policy to consider HTTP 50x status codes
Temporarily server failures can happen when listing a GitLab instance,
HTTP status codes 502, 503 or 520 are returned in that case.

So adapt lister requests retry policy to execute requests again when
such errors are encountered.

Related to T3442
2021-07-23 13:51:17 +02:00
Antoine R. Dumont (@ardumont)
f00d41d0cd
opam: Directly use the --root flag instead of using an env variable
This aligns the behavior with the opam loader

Related to T3358
2021-07-20 16:46:10 +02:00