Commit graph

958 commits

Author SHA1 Message Date
KShivendu
d34a6232a6 gogs: Introduce Gogs lister 2022-08-03 16:22:06 +05:30
Franck Bret
1bf11aa26d Add arch lister module (origins from archives).
After a first attempt with D7812 this one use a different strategy to
retrieve origins.

Fetch and extract "core.files.tar.gz", "extra.files.tar.gz" and "community.files.tar.gz" from archives.archlinux.org. That step ensure that we have a list of "official" packages.
Parse metadata from 'desc' file to build origins url.
Scrap the origin url to get artifacts metadata that list all versions of a package.

It also fetch and extract unofficial 'arm' packages from archlinuxarm.org but in this case we can not get all versions of an arm package.

Related T4233
2022-06-15 09:11:57 +02:00
Antoine R. Dumont (@ardumont)
263db667d0
Adapt maven lister to list canonical gh urls if any
That means detected github urls {https,git,http}://github.com/${user_repo}(.git) are
canonicalized to https://github.com/${user_repo} format.

This avoids duplication of origins.

Related to T4232
2022-05-23 14:47:11 +02:00
Antoine R. Dumont (@ardumont)
2ffe9c2aea
Use swh.core.github.pytest_plugin in github tests
Related to T4232
2022-05-20 16:06:11 +02:00
Pratyush Desai
aa8c8cb3bc add strict asyncio_mode in pytest.ini 2022-05-09 12:13:28 +02:00
Antoine Lambert
3f6c7edc24 maven: Prevent UnicodeDecodeError when processing pom file
Pass the raw bytes of pom file content in xmltodict.parse and let
it do the string decoding based on the encoding declared in pom file.

If the string decoding failed due to an invalid declared encoding,
xml.parsers.expat.ExpatError will be raised and will be caught by
the lister, ignoring the pom file and continuing listing.

Related to T3874
2022-05-02 14:01:58 +02:00
Antoine Lambert
0222a8f5c4 maven: Handle null mtime value in index for jar archive
It exists cases where the modification time for a jar archive in
a maven index is null which was leading to a processing error
by the lister.

So handle that case to avoid premature exit of the listing process.

Related to T3874
2022-04-29 13:59:17 +02:00
Antoine Lambert
378613ad82 maven: Remove extraction of groupId and artifactId from pom files
When parsing pom files, we are only interested to extract a VCS URL
(git, hg, svn) in order to create associated loading tasks.

In that case, the groupId and artifactId are not used by the lister
so better removing their extraction, plus it will prevent errors when
those info are missing in pom files.
2022-04-29 11:15:03 +02:00
Antoine Lambert
22bcd9deb2 maven: Create one origin per package instead of one per package version
Previously the maven lister was creating an origin for each source
archive (jar, zip) it discovered during the listing process.

This is not the way Software Heritage decided to archive sources
coming from package managers. Instead one origin should be created
per package and all its versions should be found as releases in the
snapshot produced by the package loader.

So modify the maven lister in order to create one origin per package
grouping all its versions.

This change also modifies the way incremental listing is handled,
ListedOrigin instances will be yielded only if we discovered new
versions of a package since the last listing.

Tests have been updated to reflect these changes.

Related to T3874
2022-04-29 10:57:04 +02:00
Franck Bret
985b71e80c crates: Create one origin per package instead of per version
Previously we had as many origins as version for a crate package, url was a link
to a specific crate version package.

Refactor to have one origin per package name and add an 'artifacts' entry to
extra_loader_arguments that list all versions, package url and checksum.
Origin url is now a link to the related http api endpoint for a package name.

Related to T4104
2022-04-28 16:10:33 +02:00
Valentin Lorentz
c251594a1f Bump mypy to v0.942 2022-04-26 13:05:44 +02:00
Valentin Lorentz
d715aaf903 Make user_agent a parameter of GitHubSession
So it can be set when used by other packages
2022-04-26 11:08:53 +02:00
Valentin Lorentz
2d04244cc9 Move GitHubSession from github/lister.py to github/utils.py
So it can be reused by other packages without importing lister.py itself
2022-04-26 11:08:49 +02:00
Valentin Lorentz
9ee4a99f15 github: Refactor rate-limiting out of the GitHubLister class
This will allow the GitHub Metadata Fetcher to reuse the logic
by importing the GitHubSession class.
2022-04-26 11:08:45 +02:00
Antoine Lambert
334c54091e maven: Remove duplicated code related to setting instance from netloc
That processing is already handled in the base Lister class constructor.
2022-04-25 17:31:02 +02:00
Valentin Lorentz
d0924f39d0 github: Remove dead code
Authentication is handled directly in the session
2022-04-21 20:32:45 +02:00
Antoine Lambert
2fa9f0abd2 sourceforge: Fix listing of bzr projects
Fix sourceforge origin URL for bzr projects,
http://project.bzr.sourceforge.net/bzrroot/project
redirects to http://project.bzr.sourceforge.net/bzr/project.

Handle bzr projects with multiple branches, one listed origin
must be created per branch.

Discard bzr projects that no longer exist from listing.
2022-04-21 18:19:07 +02:00
Antoine Lambert
63a744559f sourceforge: Do not consider Attic as a valid CVS module
The Attic folder that can sometimes be found in a CVS respository
is a special one used by CVS to store RCS files and should not be
considered as a valid module name when listing CVS projects.
2022-04-21 16:08:16 +02:00
Antoine Lambert
20c1351aa0 pre-commit: Remove codespell commit-msg hook
That hook can be frustrating as it can discard a long commit message
if it finds a typo in it so better removing it.
2022-04-21 13:39:42 +02:00
Antoine R. Dumont (@ardumont)
10bb8db345
maven: Fix argument of type 'NoneType' is not iterable
Related to T3874
2022-04-14 15:33:24 +02:00
Antoine R. Dumont (@ardumont)
7c8428d01c
maven: Continue listing if unable to retrieve pom information
This aligns the behavior with other listers (e.g. sourceforge, ...) to continue listing
if some information is not retrievable at all.

Related to T3874
2022-04-13 17:59:20 +02:00
Antoine R. Dumont (@ardumont)
e4b27a1e98
maven: log error message when not able to retrieve the index to read
Without this, the lister legitimately cannot list anything.
2022-04-13 17:41:44 +02:00
Antoine Lambert
0e0901acdc Add .git-blame-ignore-revs file with automatic reformatting commits 2022-04-08 15:15:26 +02:00
Antoine Lambert
d38e05cff7 python: Reformat code with black 22.3.0
Related to T3922
2022-04-08 15:15:09 +02:00
Antoine Lambert
00f1b99ad9 pre-commit, tox: Bump black from 19.10b0 to 22.3.0
black is considered stable since release 22.1.0 and the version
we are currently using is quite outdated and not compatible with
click 8.1.0, so it is time to bump it to its latest stable release.

Please note that E501 pycodestyle warning related to line length
is replaced by B950 one from flake8-bugbear as recommended by black.
https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html#line-length

Related to T3922
2022-04-08 15:13:41 +02:00
Antoine Lambert
b766b64088 requirements-test: Remove pytest pinning to < 7
pytest-postgresql 3.1.3 and pytest-redis 2.4.0 added support for
pytest >= 7 so we can now drop the pytest pinning.
2022-04-06 17:14:45 +02:00
Franck Bret
fea6fc04aa lister: Add new rust crates lister
The Crates lister retrieves crates package for Rust lang.

It basically fetches https://github.com/rust-lang/crates.io-index.git
to a temp directory and then walks through each file to get the
crate's info.
2022-03-28 08:42:31 +02:00
Antoine Lambert
ff0035a60b pytest: Exclude build directory for tests discovery
Due to test modules being copied in subdirectories of the
build directory by setuptools, it makes pytest fail by raising
ImportPathMismatchError exceptions when invoked from root
directory of the module.

So ignore the build folder to discover tests.
2022-03-22 11:56:35 +01:00
Antoine Lambert
fd03941c5f sourceforge: Fix incremental listing since CVS origin URLs modification
Commit 6a7479553e modified the origin URLs for CVS projects
hosted on SourceForge but it also broke incremental listing
due to a no longer valid assertion, so fix that issue.
2022-03-11 11:48:15 +01:00
Antoine R. Dumont (@ardumont)
2568ecc7c2
launchpad: Ignore erratic page and continue listing next page
The decorator is dropped on `get_origins_from_page` as we cannot retry an iterator
consumption anyway.

Related to T3948
2022-02-18 10:36:54 +01:00
Antoine Lambert
6a7479553e sourceforge: Fix origin URLs for CVS projects
CVS projects are different from other VCS ones, they use the rsync
protocol, a list of modules needs to be fetched from an info page
and multiple origin URLs can be produced for a same project.

Related to T3789
2022-02-17 13:51:52 +01:00
Antoine R. Dumont (@ardumont)
4265e5dd77
launchpad: Drop extra filtering step which is no longer necessary
as the scheduler is now able to deduplicate it when recording listed origins.

Related to T3945
2022-02-17 12:11:25 +01:00
Antoine R. Dumont (@ardumont)
c86e4b43f4
launchpad: Use tuple instead of list
Related to T3945
2022-02-17 12:11:24 +01:00
Antoine R. Dumont (@ardumont)
fc2edd24aa
launchpad: Manage unhandled exceptions when listing
Prior to this commit, the listing could fail when either reading a page or the page of
results (lauchpad api raises RestfulError). This now retries when those kind of
exceptions happen. If the error persists (after multiple tryouts and exponential
backoff), the listing continues nonetheless (with warning logs).

Note that if the page ends up being empty, it's no longer accounted for.

This actually allows the listing to finish in case of issues.

Related to T3945
2022-02-17 12:11:09 +01:00
Antoine R. Dumont (@ardumont)
262f9369c8
launchpad: Allow bzr origins listing
Related to T3945
2022-02-16 17:56:13 +01:00
Raphaël Gomès
31b4429ced sourceforge: fix support for listing bzr origins
Bazaar support was removed a long time ago and predates a lot of the new
mechanisms in place in the API. Unfortunately, it looks like a lot of
the URLs are offline now, but there are still a few projects that can be
listed, this is pretty low-effort.
2022-02-14 14:56:38 +01:00
Antoine Lambert
b7524bbae0 debian/test_lister: Fix typo detected with codespell 2022-02-10 16:25:42 +01:00
Antoine Lambert
c17a932ccf pre-commit: Bump hooks and add new one to check commit message spelling
To install the new hook:

  $ pre-commit install -t commit-msg
2022-02-10 16:24:19 +01:00
Antoine R. Dumont (@ardumont)
7ff1390378
maven: Fix last update datetime
We need to avoid using naive datetime as this fails during conversion.

Related to T3746
Related to P1280
2022-02-09 16:59:40 +01:00
Boris Baldassari
d4e1e8212a maven: Fix undef last_update in ListedOrigins. 2022-02-08 07:51:01 +01:00
Boris Baldassari
24eeabfade maven: dismiss origins if they are malformed - e.g. wrong pom scm format, add test. 2022-02-08 07:51:01 +01:00
Antoine R. Dumont (@ardumont)
a1000dfeb7
requirements-test: Pin pytest to < 7.0.0
Related to T3916
2022-02-07 16:10:49 +01:00
Antoine R. Dumont (@ardumont)
a599493b48
maven: Let logging instruction do the formatting 2022-01-25 11:32:23 +01:00
Antoine R. Dumont (@ardumont)
8667b04abc
maven: Add more debug logging instruction
And log the metadata dictionary.
2022-01-25 11:32:23 +01:00
Valentin Lorentz
f6ca1dc3dc docs: Fix ReST syntax 2022-01-24 16:57:38 +01:00
Valentin Lorentz
771c406710 Fix ReST syntax 2022-01-21 11:04:51 +01:00
Antoine R. Dumont (@ardumont)
40e227efac
docs: Fix sphinx warning
Related to D6967
2022-01-19 10:24:26 +01:00
Antoine R. Dumont (@ardumont)
ec7838123b
Pin mypy and drop type annotations which makes mypy unhappy
This also drops spurious copyright headers to those files if present.

Related to T3812
2021-12-16 16:10:20 +01:00
Antoine Lambert
445d539b3f Remove no longer needed tenacity workarounds
Now that we have packaged tenacity 6.2 for debian buster and use it
in production, we can remove the workarounds to support tenacity < 5.
2021-12-08 13:28:11 +01:00
Valentin Lorentz
fa7ecc8fbd maven: Pass the base URL of the Maven instance to the loader
I would like to use it as the metadata authority URI in the loader,
instead of '{p_url.scheme}://{p_url.netloc}/', which I do not think
is accurate, as it is possible to have multiple Maven instances at
the same netloc.
2021-12-07 13:51:00 +01:00