lister: Add new maven lister

The Maven lister retrieves the maven central indexes, exports them in a
convenient text format, and parse them to identify all src archives and
pom files in the maven repository. Then the pom files are downloaded and
analysed to find and yield any scm reference.

Note: This is a new version of the maven lister diff D6133 which takes
into account the initial round of reviews.

Related to T1724
This commit is contained in:
Boris Baldassari 2021-10-02 21:20:08 +02:00
parent 3ffea8f525
commit 8991c625ea
17 changed files with 1459 additions and 1 deletions

View file

@ -18,6 +18,7 @@ following Python modules:
- `swh.lister.gitlab`
- `swh.lister.gnu`
- `swh.lister.launchpad`
- `swh.lister.maven`
- `swh.lister.npm`
- `swh.lister.packagist`
- `swh.lister.phabricator`
@ -36,7 +37,7 @@ Local deployment
## lister configuration
Each lister implemented so far by Software Heritage (`bitbucket`, `cgit`, `cran`, `debian`,
`gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`)
`gitea`, `github`, `gitlab`, `gnu`, `launchpad`, `npm`, `packagist`, `phabricator`, `pypi`, `tuleap`, `maven`)
must be configured by following the instructions below (please note that you have to replace
`<lister_name>` by one of the lister name introduced above).

View file

@ -36,3 +36,7 @@ ignore_missing_imports = True
[mypy-urllib3.util.*]
ignore_missing_imports = True
[mypy-xmltodict.*]
ignore_missing_imports = True

View file

@ -5,3 +5,4 @@ iso8601
beautifulsoup4
launchpadlib
tenacity
xmltodict

View file

@ -71,6 +71,7 @@ setup(
lister.pypi=swh.lister.pypi:register
lister.sourceforge=swh.lister.sourceforge:register
lister.tuleap=swh.lister.tuleap:register
lister.maven=swh.lister.maven:register
""",
classifiers=[
"Programming Language :: Python :: 3",

142
swh/lister/maven/README.md Normal file
View file

@ -0,0 +1,142 @@
## The Maven lister
This readme describes the design decisions made during development.
More information can be found on the Software Heritage forge at [https://forge.softwareheritage.org/T1724](https://forge.softwareheritage.org/T1724) and on the diff of the lister at [https://forge.softwareheritage.org/D6133](https://forge.softwareheritage.org/D6133) .
## Execution sequence (TL;DR)
The complete sequence of actions to list the source artifacts and scm urls is as follows:
On the `index_exporter` server (asynchronously):
* Check the list of remote indexes, and compare it to the list of local index files.
* Retrieve the missing Maven Indexer indexes from the remote repository. \
Example of index from Maven Central: [https://repo1.maven.org/maven2/.index/](https://repo1.maven.org/maven2/.index/)
* Start execution of the Docker container:
* If the `indexes` directory doesn't exist, unpack the Lucene indexes from the Maven Indexer indexes using `indexer-cli`.\
This generates a set of binary files as shown below:
```
boris@castalia:maven$ ls -lh /media/home2/work/indexes/
total 5,2G
-rw-r--r-- 1 root root 500M juil. 7 22:06 _4m.fdt
-rw-r--r-- 1 root root 339K juil. 7 22:06 _4m.fdx
-rw-r--r-- 1 root root 2,2K juil. 7 22:07 _4m.fnm
-rw-r--r-- 1 root root 166M juil. 7 22:07 _4m_Lucene50_0.doc
-rw-r--r-- 1 root root 147M juil. 7 22:07 _4m_Lucene50_0.pos
-rw-r--r-- 1 root root 290M juil. 7 22:07 _4m_Lucene50_0.time
-rw-r--r-- 1 root root 3,1M juil. 7 22:07 _4m_Lucene50_0.tip
[SNIP]
-rw-r--r-- 1 root root 363 juil. 7 22:06 _e0.si
-rw-r--r-- 1 root root 1,7K juil. 7 22:07 segments_2
-rw-r--r-- 1 root root 8 juil. 7 21:54 timestamp
-rw-r--r-- 1 root root 0 juil. 7 21:54 write.lock
```
* If the `export` directory doesn't exist, export the Lucene documents from the Lucene indexes using `clue`.\
This generates a set of text files as shown below:
```
boris@castalia:~$ ls -lh /work/export/
total 49G
-rw-r--r-- 1 root root 13G juil. 7 22:12 _p.fld
-rw-r--r-- 1 root root 7,0K juil. 7 22:21 _p.inf
-rw-r--r-- 1 root root 2,9G juil. 7 22:21 _p.len
-rw-r--r-- 1 root root 33G juil. 7 22:20 _p.pst
-rw-r--r-- 1 root root 799 juil. 7 22:21 _p.si
-rw-r--r-- 1 root root 138 juil. 7 22:21 segments_1
-rw-r--r-- 1 root root 0 juil. 7 22:07 write.lock
```
* On the host, copy export files to `/var/www/html/` to make them available on the network.
On the lister side:
* Get the exports from the above local index server.
* Extract the list of all pom and source artefacts from the Lucene export.
* Yield the list of source artefacts to the Maven Loader as they are found.
* Download all poms from the above list.
* Parse all poms to extract the scm attribute, and yield the list of scm urls towards the classic loaders (git, svn, hg..).
The process has been optimised as much as it could be, scaling down from 140 GB on disk / 60 GB RAM / 90 mn exec time to 60 GB on disk / 2 GB (excl. docker) / 32 mn exec time.
For the long read about why we came to here, please continue.
## About the Maven ecosystem
Maven repositories are a loose, decentralised network of HTTP servers with a well-defined hosted structure. They are used according to the Maven dependency resolver[i](#sdendnote1sym), an inheritance-based mechanism used to identify and locate artefacts required in Maven builds.
There is no uniform, standardised way to list the contents of maven repositories, since consumers are supposed to know what artefacts they need. Instead, Maven repository owners usually setup a Maven Indexer[ii](#sdendnote2sym) to enablesource code identification and listing in IDEs for this reason, source jars usually dont have build files and information, only providing pure sources.
Maven Indexer is not a mandatory part of the maven repository stack, but it is the *de facto* standard for maven repositories indexing and querying. All major Maven repositories we have seen so far use it. Most artefacts are located in the main central repository: Maven Central[iii](#sdendnote3sym), hosted and run by Sonatype[iv](#sdendnote4sym). Other well-known repositories are listed on MVN Repository[v](#sdendnote5sym).
Maven repositories are mainly used for binary content (e.g. class jars), but the following sources of information are relevant to our goal in the maven repositories/ecosystem:
* SCM attributes in pom XML files contain the **scm URL** of the associated source code. They can be fed to standard Git/SVN/others loaders.
* **Source artefacts** contain pure source code (i.e. no build files) associated to the artefact. There are two main naming conventions for them, although not always enforced:
* ${artifactId}-${version}-source-release.zip
* ${artifactId}-${version}-src.zip
They come in various archiving formats (jar, zip, tar.bz2, tar.gz) and require a specific loader to attach the artefact metadata.
[i](#sdendnote1anc)Maven dependency resolver: [https://maven.apache.org/resolver/index.html](https://maven.apache.org/resolver/index.html)
[ii](#sdendnote2anc)Maven Indexer: [https://maven.apache.org/maven-indexer/](https://maven.apache.org/maven-indexer/)
[iii](#sdendnote3anc)Maven Central: [https://search.maven.org/](https://search.maven.org/)
[iv](#sdendnote4anc)Sonatype Company: [https://www.sonatype.com/](https://www.sonatype.com/)
[v](#sdendnote5anc)MVN Repository: [https://mvnrepository.com/repos](https://mvnrepository.com/repos)
## Preliminary research
Listing the full content of a Maven repository is very unusual, and the whole system has not been built for this purpose. Instead, tools and build systems can easily fetch individual artefacts according to their Maven coordinates (groupId, artifactId, version, classifier, extension). Usual listing means (e.g. scapping) are highly discouraged and will trigger bannishment easily. There is no common API defined either.
Once we have the artifactId/group we can easily get the list of versions (e.g. for updates) by reading the [maven-metadata.xml file at the package level](https://repo1.maven.org/maven2/ant/ant/maven-metadata.xml), although this is not always reliable. The various options that were investigated to get the interesting artefacts are:
* **Scrapping** could work but is explicitly forbidden[i](#sdendnote1sym). Pages could easily be parsed through, and it would allow to identify \*all\* artifacts.
* Using **Maven indexes** is the "official" way to retrieve information from a maven repository and most repositories provide this feature. It would also enable a smart incremental listing. The Maven Indexer data format however is not we
ll documented. It relies under the hood on an old version (Lucene54) of a lucene indexes, and the only libraries that can access it are written in java. This implies a dedicated Docker container with a jvm and some specific tools (maven indexer and luke for the lucene index), and thus would bring some complexity to the docker & prod setups.
* A third path could be to **parse all the pom.xml's** that we find and follow all artifactId's recursively, building a graph of dependencies and parent poms. This is more of a non-complete heuristic, and we would miss leaf nodes (i.e. artifacts that are not used by others), but it could help setup a basic list.
* It should be noted also that there are two main implementations of maven repositories: Nexus and Artifactory. By being more specific we could use the respective APIs of these products to get information. But getting the full list of artefacts is still not straightforward, and we'd lose any generic treatment doing so.
The best option in our opinion is to go with the Maven Indexer, for it is the most complete listing available (notably for the biggest repository by far: maven central).
[i](#sdendnote1anc)Maven repositorys Terms of Service: [https://repo1.maven.org/terms.html](https://repo1.maven.org/terms.html)
## Maven indexes conversion
[Maven-Indexer](https://maven.apache.org/maven-indexer/) is a (thick) wrapper around lucene. It parses the repository and stores documents, fields and terms in an index. One can extract the lucene index from a maven index using the command: `java -jar indexer-cli-5.1.1.jar --unpack nexus-maven-repository-index.gz --destination test --type full`. Note however that 5.1.1 is an old version of maven indexer; newer versions of the maven indexer won't work on the central indexes.
[Clue](https://maven.apache.org/maven-indexer/) is a CLI tool to read lucene indexes, and version 6.2.0 works with our maven indexes. One can use the following command to export the index to text: `java -jar clue-6.2.0-1.0.0.jar maven/central-lucene-index/ export central_export text`.
The exported text file looks like this:
```
doc 0
field 0
name u
type string
value com.redhat.rhevm.api|rhevm-api-powershell-jaxrs|1.0-rc1.16|javadoc|jar
field 1
name m
type string
value 1321264789727
field 2
name i
type string
value jar|1320743675000|768291|2|2|1|jar
field 10
name n
type string
value RHEV-M API Powershell Wrapper Implementation JAX-RS
field 13
name 1
type string
value 454eb6762e5bb14a75a21ae611ce2048dd548550
```
The execution of these two jars requires a Java virtual machine -- java execution in python is not possible without a JVM. Docker is a good way to run both tools and generate the exports independently, rather than add a JVM to the existing production environment.
We decided (2021-08-25) to install and execute a docker container on a separate server so the lister would simply have to fetch it on the network and parse it (the latter part in pure python).

View file

@ -0,0 +1,12 @@
# Copyright (C) 2021 the Software Heritage developers
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
def register():
from .lister import MavenLister
return {
"lister": MavenLister,
"task_modules": ["%s.tasks" % __name__],
}

341
swh/lister/maven/lister.py Normal file
View file

@ -0,0 +1,341 @@
# Copyright (C) 2021 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from dataclasses import asdict, dataclass
import logging
import re
from typing import Any, Dict, Iterator, Optional
from urllib.parse import urljoin
import requests
from tenacity.before_sleep import before_sleep_log
from urllib3.util import parse_url
import xmltodict
from swh.lister.utils import throttling_retry
from swh.scheduler.interface import SchedulerInterface
from swh.scheduler.model import ListedOrigin
from .. import USER_AGENT
from ..pattern import CredentialsType, Lister
logger = logging.getLogger(__name__)
RepoPage = Dict[str, Any]
@dataclass
class MavenListerState:
"""State of the MavenLister"""
last_seen_doc: int = -1
"""Last doc ID ingested during an incremental pass
"""
last_seen_pom: int = -1
"""Last doc ID related to a pom and ingested during
an incremental pass
"""
class MavenLister(Lister[MavenListerState, RepoPage]):
"""List origins from a Maven repository.
Maven Central provides artifacts for Java builds.
It includes POM files and source archives, which we download to get
the source code of artifacts and links to their scm repository.
This lister yields origins of types: git/svn/hg or whatever the Artifacts
use as repository type, plus maven types for the maven loader (tgz, jar)."""
LISTER_NAME = "maven"
def __init__(
self,
scheduler: SchedulerInterface,
url: str,
index_url: str = None,
instance: Optional[str] = None,
credentials: CredentialsType = None,
incremental: bool = True,
):
"""Lister class for Maven repositories.
Args:
url: main URL of the Maven repository, i.e. url of the base index
used to fetch maven artifacts. For Maven central use
https://repo1.maven.org/maven2/
index_url: the URL to download the exported text indexes from.
Would typically be a local host running the export docker image.
See README.md in this directory for more information.
instance: Name of maven instance. Defaults to url's network location
if unset.
incremental: bool, defaults to True. Defines if incremental listing
is activated or not.
"""
self.BASE_URL = url
self.INDEX_URL = index_url
self.incremental = incremental
if instance is None:
instance = parse_url(url).host
super().__init__(
scheduler=scheduler, credentials=credentials, url=url, instance=instance,
)
self.session = requests.Session()
self.session.headers.update(
{"Accept": "application/json", "User-Agent": USER_AGENT,}
)
def state_from_dict(self, d: Dict[str, Any]) -> MavenListerState:
return MavenListerState(**d)
def state_to_dict(self, state: MavenListerState) -> Dict[str, Any]:
return asdict(state)
@throttling_retry(before_sleep=before_sleep_log(logger, logging.WARNING))
def page_request(self, url: str, params: Dict[str, Any]) -> requests.Response:
logger.info("Fetching URL %s with params %s", url, params)
response = self.session.get(url, params=params)
if response.status_code != 200:
logger.warning(
"Unexpected HTTP status code %s on %s: %s",
response.status_code,
response.url,
response.content,
)
response.raise_for_status()
return response
def get_pages(self) -> Iterator[RepoPage]:
""" Retrieve and parse exported maven indexes to
identify all pom files and src archives.
"""
# Example of returned RepoPage's:
# [
# {
# "type": "maven",
# "url": "https://maven.xwiki.org/..-5.4.2-sources.jar",
# "time": 1626109619335,
# "gid": "org.xwiki.platform",
# "aid": "xwiki-platform-wikistream-events-xwiki",
# "version": "5.4.2"
# },
# {
# "type": "scm",
# "url": "scm:git:git://github.com/openengsb/openengsb-framework.git",
# "project": "openengsb-framework",
# },
# ...
# ]
# Download the main text index file.
logger.info(f"Downloading text index from {self.INDEX_URL}.")
assert self.INDEX_URL is not None
response = requests.get(self.INDEX_URL, stream=True)
response.raise_for_status()
# Prepare regexes to parse index exports.
# Parse doc id.
# Example line: "doc 13"
re_doc = re.compile(r"^doc (?P<doc>\d+)$")
# Parse gid, aid, version, classifier, extension.
# Example line: " value al.aldi|sprova4j|0.1.0|sources|jar"
re_val = re.compile(
r"^\s{4}value (?P<gid>[^|]+)\|(?P<aid>[^|]+)\|(?P<version>[^|]+)\|"
+ r"(?P<classifier>[^|]+)\|(?P<ext>[^|]+)$"
)
# Parse last modification time.
# Example line: " value jar|1626109619335|14316|2|2|0|jar"
re_time = re.compile(
r"^\s{4}value ([^|]+)\|(?P<mtime>[^|]+)\|([^|]+)\|([^|]+)\|([^|]+)"
+ r"\|([^|]+)\|([^|]+)$"
)
# Read file line by line and process it
out_pom: Dict = {}
jar_src: Dict = {}
doc_id: int = 0
jar_src["doc"] = None
url_src = None
iterator = response.iter_lines(chunk_size=1024)
for line_bytes in iterator:
# Read the index text export and get URLs and SCMs.
line = line_bytes.decode(errors="ignore")
m_doc = re_doc.match(line)
if m_doc is not None:
doc_id = int(m_doc.group("doc"))
if (
self.incremental
and self.state
and self.state.last_seen_doc
and self.state.last_seen_doc >= doc_id
):
# jar_src["doc"] contains the id of the current document, whatever
# its type (scm or jar).
jar_src["doc"] = None
else:
jar_src["doc"] = doc_id
else:
# If incremental mode, we don't record any line that is
# before our last recorded doc id.
if self.incremental and jar_src["doc"] is None:
continue
m_val = re_val.match(line)
if m_val is not None:
(gid, aid, version, classifier, ext) = m_val.groups()
ext = ext.strip()
path = "/".join(gid.split("."))
if classifier == "NA" and ext.lower() == "pom":
# If incremental mode, we don't record any line that is
# before our last recorded doc id.
if (
self.incremental
and self.state
and self.state.last_seen_pom
and self.state.last_seen_pom >= doc_id
):
continue
url_path = f"{path}/{aid}/{version}/{aid}-{version}.{ext}"
url_pom = urljoin(self.BASE_URL, url_path,)
out_pom[url_pom] = doc_id
elif (
classifier.lower() == "sources" or ("src" in classifier)
) and ext.lower() in ("zip", "jar"):
url_path = (
f"{path}/{aid}/{version}/{aid}-{version}-{classifier}.{ext}"
)
url_src = urljoin(self.BASE_URL, url_path)
jar_src["gid"] = gid
jar_src["aid"] = aid
jar_src["version"] = version
else:
m_time = re_time.match(line)
if m_time is not None and url_src is not None:
time = m_time.group("mtime")
jar_src["time"] = int(time)
logger.debug(f"* Yielding jar {url_src}.")
yield {
"type": "maven",
"url": url_src,
**jar_src,
}
url_src = None
logger.info(f"Found {len(out_pom)} poms.")
# Now fetch pom files and scan them for scm info.
logger.info("Fetching poms..")
for pom in out_pom:
text = self.page_request(pom, {})
try:
project = xmltodict.parse(text.content.decode())
if "scm" in project["project"]:
if "connection" in project["project"]["scm"]:
scm = project["project"]["scm"]["connection"]
gid = project["project"]["groupId"]
aid = project["project"]["artifactId"]
yield {
"type": "scm",
"doc": out_pom[pom],
"url": scm,
"project": f"{gid}.{aid}",
}
else:
logger.debug(f"No scm.connection in pom {pom}")
else:
logger.debug(f"No scm in pom {pom}")
except xmltodict.expat.ExpatError as error:
logger.info(f"Could not parse POM {pom} XML: {error}. Next.")
def get_origins_from_page(self, page: RepoPage) -> Iterator[ListedOrigin]:
"""Convert a page of Maven repositories into a list of ListedOrigins.
"""
assert self.lister_obj.id is not None
if page["type"] == "scm":
# If origin is a scm url: detect scm type and yield.
# Note that the official format is:
# scm:git:git://github.com/openengsb/openengsb-framework.git
# but many, many projects directly put the repo url, so we have to
# detect the content to match it properly.
m_scm = re.match(r"^scm:(?P<type>[^:]+):(?P<url>.*)$", page["url"])
if m_scm is not None:
scm_type = m_scm.group("type")
scm_url = m_scm.group("url")
origin = ListedOrigin(
lister_id=self.lister_obj.id, url=scm_url, visit_type=scm_type,
)
yield origin
else:
if page["url"].endswith(".git"):
origin = ListedOrigin(
lister_id=self.lister_obj.id, url=page["url"], visit_type="git",
)
yield origin
else:
# Origin is a source archive:
origin = ListedOrigin(
lister_id=self.lister_obj.id,
url=page["url"],
visit_type=page["type"],
extra_loader_arguments={
"artifacts": [
{
"time": page["time"],
"gid": page["gid"],
"aid": page["aid"],
"version": page["version"],
}
]
},
)
yield origin
def commit_page(self, page: RepoPage) -> None:
"""Update currently stored state using the latest listed doc.
Note: this is a noop for full listing mode
"""
if self.incremental and self.state:
# We need to differentiate the two state counters according
# to the type of origin.
if page["type"] == "maven" and page["doc"] > self.state.last_seen_doc:
self.state.last_seen_doc = page["doc"]
elif page["type"] == "scm" and page["doc"] > self.state.last_seen_pom:
self.state.last_seen_doc = page["doc"]
self.state.last_seen_pom = page["doc"]
def finalize(self) -> None:
"""Finalize the lister state, set update if any progress has been made.
Note: this is a noop for full listing mode
"""
if self.incremental and self.state:
last_seen_doc = self.state.last_seen_doc
last_seen_pom = self.state.last_seen_pom
scheduler_state = self.get_state_from_scheduler()
if last_seen_doc and last_seen_pom:
if (scheduler_state.last_seen_doc < last_seen_doc) or (
scheduler_state.last_seen_pom < last_seen_pom
):
self.updated = True

28
swh/lister/maven/tasks.py Normal file
View file

@ -0,0 +1,28 @@
# Copyright (C) 2021 the Software Heritage developers
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from typing import Dict
from celery import shared_task
from .lister import MavenLister
@shared_task(name=__name__ + ".FullMavenLister")
def list_maven_full(**lister_args) -> Dict[str, int]:
"""Full update of a Maven repository instance"""
lister = MavenLister.from_configfile(incremental=False, **lister_args)
return lister.run().dict()
@shared_task(name=__name__ + ".IncrementalMavenLister")
def list_maven_incremental(**lister_args) -> Dict[str, int]:
"""Incremental update of a Maven repository instance"""
lister = MavenLister.from_configfile(incremental=True, **lister_args)
return lister.run().dict()
@shared_task(name=__name__ + ".ping")
def _ping() -> str:
return "OK"

View file

View file

@ -0,0 +1,113 @@
doc 0
field 0
name u
type string
value al.aldi|sprova4j|0.1.0|sources|jar
field 1
name m
type string
value 1626111735737
field 2
name i
type string
value jar|1626109619335|14316|2|2|0|jar
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 1
field 0
name u
type string
value al.aldi|sprova4j|0.1.0|NA|pom
field 1
name m
type string
value 1626111735764
field 2
name i
type string
value jar|1626109636636|-1|1|0|0|pom
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 2
field 0
name u
type string
value al.aldi|sprova4j|0.1.1|sources|jar
field 1
name m
type string
value 1626111784883
field 2
name i
type string
value jar|1626111425534|14510|2|2|0|jar
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 3
field 0
name u
type string
value al.aldi|sprova4j|0.1.1|NA|pom
field 1
name m
type string
value 1626111784915
field 2
name i
type string
value jar|1626111437014|-1|1|0|0|pom
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 4
field 14
name DESCRIPTOR
type string
value NexusIndex
field 15
name IDXINFO
type string
value 1.0|index
doc 5
field 16
name allGroups
type string
value allGroups
field 17
name allGroupsList
type string
value al.aldi
doc 6
field 18
name rootGroups
type string
value rootGroups
field 19
name rootGroupsList
type string
value al
END
checksum 00000000003321211082

View file

@ -0,0 +1,134 @@
doc 0
field 0
name u
type string
value al.aldi|sprova4j|0.1.0|sources|jar
field 1
name m
type string
value 1633786348254
field 2
name i
type string
value jar|1626109619335|14316|2|2|0|jar
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 1
field 0
name u
type string
value al.aldi|sprova4j|0.1.0|NA|pom
field 1
name m
type string
value 1633786348271
field 2
name i
type string
value jar|1626109636636|-1|1|0|0|pom
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 2
field 0
name u
type string
value al.aldi|sprova4j|0.1.1|sources|jar
field 1
name m
type string
value 1633786370818
field 2
name i
type string
value jar|1626111425534|14510|2|2|0|jar
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 3
field 0
name u
type string
value al.aldi|sprova4j|0.1.1|NA|pom
field 1
name m
type string
value 1633786370857
field 2
name i
type string
value jar|1626111437014|-1|1|0|0|pom
field 10
name n
type string
value sprova4j
field 11
name d
type string
value Java client for Sprova Test Management
doc 4
field 0
name u
type string
value com.arangodb|arangodb-graphql|1.2|NA|pom
field 1
name m
type string
value 1634498235946
field 2
name i
type string
value jar|1624265143830|-1|0|0|0|pom
field 10
name n
type string
value arangodb-graphql
field 11
name d
type string
value ArangoDB Graphql
doc 5
field 14
name DESCRIPTOR
type string
value NexusIndex
field 15
name IDXINFO
type string
value 1.0|index_1
doc 6
field 16
name allGroups
type string
value allGroups
field 17
name allGroupsList
type string
value com.arangodb|al.aldi
doc 7
field 18
name rootGroups
type string
value rootGroups
field 19
name rootGroupsList
type string
value com|al
END
checksum 00000000004102281591

View file

@ -0,0 +1,208 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ DISCLAIMER
~ Copyright 2019 ArangoDB GmbH, Cologne, Germany
~
~ Licensed under the Apache License, Version 2.0 (the "License");
~ you may not use this file except in compliance with the License.
~ You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
~
~ Copyright holder is ArangoDB GmbH, Cologne, Germany
~
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.arangodb</groupId>
<artifactId>arangodb-graphql</artifactId>
<version>1.2</version>
<name>arangodb-graphql</name>
<description>ArangoDB Graphql</description>
<url>https://github.com/ArangoDB-Community/arangodb-graphql-java</url>
<licenses>
<license>
<name>Apache License 2.0</name>
<url>http://www.apache.org/licenses/LICENSE-2.0</url>
<distribution>repo</distribution>
</license>
</licenses>
<developers>
<developer>
<name>Colin Findlay</name>
</developer>
<developer>
<name>Michele Rastelli</name>
<url>https://github.com/rashtao</url>
</developer>
</developers>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<java.version>1.8</java.version>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.sonatype.plugins</groupId>
<artifactId>nexus-staging-maven-plugin</artifactId>
<version>1.6.8</version>
<extensions>true</extensions>
<configuration>
<serverId>ossrh</serverId>
<nexusUrl>https://oss.sonatype.org/</nexusUrl>
<stagingProfileId>84aff6e87e214c</stagingProfileId>
<autoReleaseAfterClose>false</autoReleaseAfterClose>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<version>3.1.0</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-javadoc-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<id>attach-javadocs</id>
<goals>
<goal>jar</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-deploy-plugin</artifactId>
<version>2.8.2</version>
<configuration>
<uniqueVersion>false</uniqueVersion>
<retryFailedDeploymentCount>10</retryFailedDeploymentCount>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-gpg-plugin</artifactId>
<version>1.6</version>
<executions>
<execution>
<id>sign-artifacts</id>
<phase>verify</phase>
<goals>
<goal>sign</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<executions>
<execution>
<id>assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
<configuration>
<finalName>
${project.artifactId}-${project.version}-standalone
</finalName>
<attach>false</attach>
<appendAssemblyId>false</appendAssemblyId>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>com.graphql-java</groupId>
<artifactId>graphql-java</artifactId>
<version>11.0</version>
</dependency>
<dependency>
<groupId>com.arangodb</groupId>
<artifactId>arangodb-java-driver</artifactId>
<version>6.5.0</version>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mockito</groupId>
<artifactId>mockito-core</artifactId>
<version>2.15.0</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.hamcrest</groupId>
<artifactId>hamcrest-library</artifactId>
<version>1.3</version>
<scope>test</scope>
</dependency>
</dependencies>
<distributionManagement>
<snapshotRepository>
<id>ossrh</id>
<url>https://oss.sonatype.org/content/repositories/snapshots</url>
</snapshotRepository>
<repository>
<id>ossrh</id>
<url>https://oss.sonatype.org/service/local/staging/deploy/maven2/</url>
</repository>
</distributionManagement>
<scm>
<url>https://github.com/ArangoDB-Community/arangodb-graphql-java</url>
<connection>scm:git:git://github.com/ArangoDB-Community/arangodb-graphql-java.git</connection>
<developerConnection>scm:git:git://github.com/ArangoDB-Community/arangodb-graphql-java.git</developerConnection>
</scm>
<organization>
<name>ArangoDB GmbH</name>
<url>https://www.arangodb.com</url>
</organization>
</project>

View file

@ -0,0 +1,86 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<modelVersion>4.0.0</modelVersion>
<groupId>al.aldi</groupId>
<artifactId>sprova4j</artifactId>
<version>0.1.0</version>
<name>sprova4j</name>
<description>Java client for Sprova Test Management</description>
<url>https://github.com/aldialimucaj/sprova4j</url>
<inceptionYear>2018</inceptionYear>
<licenses>
<license>
<name>The Apache Software License, Version 2.0</name>
<url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
<distribution>repo</distribution>
</license>
</licenses>
<developers>
<developer>
<id>aldi</id>
<name>Aldi Alimucaj</name>
<email>aldi.alimucaj@gmail.com</email>
</developer>
</developers>
<scm>
<connection>scm:git:git://github.com/aldialimucaj/sprova4j.git</connection>
<developerConnection>scm:git:git://github.com/aldialimucaj/sprova4j.git</developerConnection>
<url>https://github.com/aldialimucaj/sprova4j</url>
</scm>
<dependencies>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.3</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.3</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>3.10.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.squareup.okio</groupId>
<artifactId>okio</artifactId>
<version>1.0.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.glassfish</groupId>
<artifactId>javax.json</artifactId>
<version>1.1.2</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>javax.json</groupId>
<artifactId>javax.json-api</artifactId>
<version>1.1.2</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>javax.validation</groupId>
<artifactId>validation-api</artifactId>
<version>2.0.1.Final</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>mockwebserver</artifactId>
<version>3.10.0</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>

View file

@ -0,0 +1,86 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd" xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<modelVersion>4.0.0</modelVersion>
<groupId>al.aldi</groupId>
<artifactId>sprova4j</artifactId>
<version>0.1.1</version>
<name>sprova4j</name>
<description>Java client for Sprova Test Management</description>
<url>https://github.com/aldialimucaj/sprova4j</url>
<inceptionYear>2018</inceptionYear>
<licenses>
<license>
<name>The Apache Software License, Version 2.0</name>
<url>http://www.apache.org/licenses/LICENSE-2.0.txt</url>
<distribution>repo</distribution>
</license>
</licenses>
<developers>
<developer>
<id>aldi</id>
<name>Aldi Alimucaj</name>
<email>aldi.alimucaj@gmail.com</email>
</developer>
</developers>
<scm>
<connection>https://github.com/aldialimucaj/sprova4j.git</connection>
<developerConnection>https://github.com/aldialimucaj/sprova4j.git</developerConnection>
<url>https://github.com/aldialimucaj/sprova4j</url>
</scm>
<dependencies>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.2.3</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.8.5</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>3.10.0</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>com.squareup.okio</groupId>
<artifactId>okio</artifactId>
<version>1.14.1</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.glassfish</groupId>
<artifactId>javax.json</artifactId>
<version>1.1.2</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>javax.json</groupId>
<artifactId>javax.json-api</artifactId>
<version>1.1.2</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>javax.validation</groupId>
<artifactId>validation-api</artifactId>
<version>2.0.1.Final</version>
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>mockwebserver</artifactId>
<version>3.10.0</version>
<scope>test</scope>
</dependency>
</dependencies>
</project>

View file

@ -0,0 +1,252 @@
# Copyright (C) 2021 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from pathlib import Path
import pytest
import requests
from swh.lister.maven.lister import MavenLister
MVN_URL = "https://repo1.maven.org/maven2/" # main maven repo url
INDEX_URL = "http://indexes/export.fld" # index directory url
URL_POM_1 = MVN_URL + "al/aldi/sprova4j/0.1.0/sprova4j-0.1.0.pom"
URL_POM_2 = MVN_URL + "al/aldi/sprova4j/0.1.1/sprova4j-0.1.1.pom"
URL_POM_3 = MVN_URL + "com/arangodb/arangodb-graphql/1.2/arangodb-graphql-1.2.pom"
LIST_GIT = (
"git://github.com/aldialimucaj/sprova4j.git",
"https://github.com/aldialimucaj/sprova4j.git",
)
LIST_GIT_INCR = ("git://github.com/ArangoDB-Community/arangodb-graphql-java.git",)
LIST_SRC = (
MVN_URL + "al/aldi/sprova4j/0.1.0/sprova4j-0.1.0-sources.jar",
MVN_URL + "al/aldi/sprova4j/0.1.1/sprova4j-0.1.1-sources.jar",
)
LIST_SRC_DATA = (
{
"type": "maven",
"url": "https://repo1.maven.org/maven2/al/aldi/sprova4j"
+ "/0.1.0/sprova4j-0.1.0-sources.jar",
"time": 1626109619335,
"gid": "al.aldi",
"aid": "sprova4j",
"version": "0.1.0",
},
{
"type": "maven",
"url": "https://repo1.maven.org/maven2/al/aldi/sprova4j"
+ "/0.1.1/sprova4j-0.1.1-sources.jar",
"time": 1626111425534,
"gid": "al.aldi",
"aid": "sprova4j",
"version": "0.1.1",
},
)
@pytest.fixture
def maven_index(datadir) -> str:
text = Path(datadir, "http_indexes", "export.fld").read_text()
return text
@pytest.fixture
def maven_index_incr(datadir) -> str:
text = Path(datadir, "http_indexes", "export_incr.fld").read_text()
return text
@pytest.fixture
def maven_pom_1(datadir) -> str:
text = Path(datadir, "https_maven.org", "sprova4j-0.1.0.pom").read_text()
return text
@pytest.fixture
def maven_pom_2(datadir) -> str:
text = Path(datadir, "https_maven.org", "sprova4j-0.1.1.pom").read_text()
return text
@pytest.fixture
def maven_pom_3(datadir) -> str:
text = Path(datadir, "https_maven.org", "arangodb-graphql-1.2.pom").read_text()
return text
def test_maven_full_listing(
swh_scheduler, requests_mock, mocker, maven_index, maven_pom_1, maven_pom_2,
):
"""Covers full listing of multiple pages, checking page results and listed
origins, statelessness."""
lister = MavenLister(
scheduler=swh_scheduler,
url=MVN_URL,
instance="maven.org",
index_url=INDEX_URL,
incremental=False,
)
# Set up test.
index_text = maven_index
requests_mock.get(INDEX_URL, text=index_text)
requests_mock.get(URL_POM_1, text=maven_pom_1)
requests_mock.get(URL_POM_2, text=maven_pom_2)
# Then run the lister.
stats = lister.run()
# Start test checks.
assert stats.pages == 4
assert stats.origins == 4
scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
origin_urls = [origin.url for origin in scheduler_origins]
assert sorted(origin_urls) == sorted(LIST_GIT + LIST_SRC)
for origin in scheduler_origins:
if origin.visit_type == "maven":
for src in LIST_SRC_DATA:
if src.get("url") == origin.url:
artifact = origin.extra_loader_arguments["artifacts"][0]
assert src.get("time") == artifact["time"]
assert src.get("gid") == artifact["gid"]
assert src.get("aid") == artifact["aid"]
assert src.get("version") == artifact["version"]
break
else:
raise AssertionError
scheduler_state = lister.get_state_from_scheduler()
assert scheduler_state is not None
assert scheduler_state.last_seen_doc == -1
assert scheduler_state.last_seen_pom == -1
def test_maven_incremental_listing(
swh_scheduler,
requests_mock,
mocker,
maven_index,
maven_index_incr,
maven_pom_1,
maven_pom_2,
maven_pom_3,
):
"""Covers full listing of multiple pages, checking page results and listed
origins, with a second updated run for statefulness."""
lister = MavenLister(
scheduler=swh_scheduler,
url=MVN_URL,
instance="maven.org",
index_url=INDEX_URL,
incremental=True,
)
# Set up test.
requests_mock.get(INDEX_URL, text=maven_index)
requests_mock.get(URL_POM_1, text=maven_pom_1)
requests_mock.get(URL_POM_2, text=maven_pom_2)
# Then run the lister.
stats = lister.run()
# Start test checks.
assert lister.incremental
assert lister.updated
assert stats.pages == 4
assert stats.origins == 4
# Second execution of the lister, incremental mode
lister = MavenLister(
scheduler=swh_scheduler,
url=MVN_URL,
instance="maven.org",
index_url=INDEX_URL,
incremental=True,
)
scheduler_state = lister.get_state_from_scheduler()
assert scheduler_state is not None
assert scheduler_state.last_seen_doc == 3
assert scheduler_state.last_seen_pom == 3
# Set up test.
requests_mock.get(INDEX_URL, text=maven_index_incr)
requests_mock.get(URL_POM_3, text=maven_pom_3)
# Then run the lister.
stats = lister.run()
# Start test checks.
assert lister.incremental
assert lister.updated
assert stats.pages == 1
assert stats.origins == 1
scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
origin_urls = [origin.url for origin in scheduler_origins]
assert sorted(origin_urls) == sorted(LIST_SRC + LIST_GIT + LIST_GIT_INCR)
for origin in scheduler_origins:
if origin.visit_type == "maven":
for src in LIST_SRC_DATA:
if src.get("url") == origin.url:
artifact = origin.extra_loader_arguments["artifacts"][0]
assert src.get("time") == artifact["time"]
assert src.get("gid") == artifact["gid"]
assert src.get("aid") == artifact["aid"]
assert src.get("version") == artifact["version"]
break
else:
raise AssertionError
scheduler_state = lister.get_state_from_scheduler()
assert scheduler_state is not None
assert scheduler_state.last_seen_doc == 4
assert scheduler_state.last_seen_pom == 4
@pytest.mark.parametrize("http_code", [400, 404, 500, 502])
def test_maven_list_http_error(
swh_scheduler, requests_mock, mocker, maven_index, http_code
):
"""Test handling of some common HTTP errors:
- 400: Bad request.
- 404: Resource no found.
- 500: Internal server error.
- 502: Bad gateway ou proxy Error.
"""
lister = MavenLister(scheduler=swh_scheduler, url=MVN_URL, index_url=INDEX_URL)
# Test failure of index retrieval.
requests_mock.get(INDEX_URL, status_code=http_code)
with pytest.raises(requests.HTTPError):
lister.run()
# Test failure of artefacts retrieval.
requests_mock.get(INDEX_URL, text=maven_index)
requests_mock.get(URL_POM_1, status_code=http_code)
with pytest.raises(requests.HTTPError):
lister.run()
# If the maven_index step succeeded but not the get_pom step,
# then we get only the 2 maven-jar origins (and not the 2 additional
# src origins).
scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
assert len(scheduler_origins) == 2

View file

@ -0,0 +1,45 @@
# Copyright (C) 2021 The Software Heritage developers
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
import pytest
from swh.lister.pattern import ListerStats
def test_ping(swh_scheduler_celery_app, swh_scheduler_celery_worker):
res = swh_scheduler_celery_app.send_task("swh.lister.maven.tasks.ping")
assert res
res.wait()
assert res.successful()
assert res.result == "OK"
@pytest.mark.parametrize(
"task_name,incremental",
[("IncrementalMavenLister", True), ("FullMavenLister", False)],
)
def test_task_lister_maven(
task_name,
incremental,
swh_scheduler_celery_app,
swh_scheduler_celery_worker,
mocker,
):
lister = mocker.patch("swh.lister.maven.tasks.MavenLister")
lister.from_configfile.return_value = lister
lister.run.return_value = ListerStats(pages=10, origins=500)
kwargs = dict(
url="https://repo1.maven.org/maven2/", index_url="http://indexes/export.fld"
)
res = swh_scheduler_celery_app.send_task(
f"swh.lister.maven.tasks.{task_name}", kwargs=kwargs,
)
assert res
res.wait()
assert res.successful()
lister.from_configfile.assert_called_once_with(incremental=incremental, **kwargs)
lister.run.assert_called_once_with()

View file

@ -18,6 +18,10 @@ lister_args = {
"tuleap": {"url": "https://tuleap.net",},
"gitlab": {"url": "https://gitlab.ow2.org/api/v4", "instance": "ow2",},
"opam": {"url": "https://opam.ocaml.org", "instance": "opam"},
"maven": {
"url": "https://repo1.maven.org/maven2/",
"index_url": "http://indexes/export.fld",
},
}