Crates.io: Add last_update for each version of a crate

In order to reduce http api call amount made by the loader, download a
crates.io database dump, and parse its csv files to get a last_update
value for each versions of a Crate.
Those values are sent to the loader through extra_loader_arguments
'crates_metadata'.

'artifacts' and 'crates_metadata' now uses "version" as key.

Related T4104, D8171
This commit is contained in:
Franck Bret 2022-09-12 21:33:07 +02:00
parent 2e6e282d44
commit 4a09f660b3
8 changed files with 327 additions and 366 deletions

View file

@ -2,7 +2,6 @@
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
"""
Crates lister
=============
@ -20,20 +19,24 @@ versions.
Origins retrieving strategy
---------------------------
A json http api to list packages from crates.io but we choose a `different strategy`_
in order to reduce to its bare minimum the amount of http call and bandwidth.
We clone a git repository which contains a tree of directories whose last child folder
name corresponds to the package name and contains a Cargo.toml file with some json data
to describe all existing versions of the package.
It takes a few seconds to clone the repository and browse it to build a full index of
existing package and related versions.
The lister is incremental, so the first time it clones and browses the repository as
previously described then stores the last seen commit id.
Next time, it retrieves the list of new and changed files since last commit id and
returns new or changed package with all of their related versions.
A json http api to list packages from crates.io exists but we choose a
`different strategy`_ in order to reduce to its bare minimum the amount
of http call and bandwidth.
Note that all Git related operations are done with `Dulwich`_, a Python
implementation of the Git file formats and protocols.
We download a `db-dump.tar.gz`_ archives which contains csv files as an export of
the crates.io database. Crates.csv list package names, versions.csv list versions
related to package names.
It takes a few seconds to download the archive and parse csv files to build a
full index of existing package and related versions.
The archive also contains a metadata.json file with a timestamp corresponding to
the date the database dump started. The database dump is automatically generated
every 24 hours, around 02:00:00 UTC.
The lister is incremental, so the first time it downloads the db-dump.tar.gz archive as
previously described and store the last seen database dump timestamp.
Next time, it downloads the db-dump.tar.gz but retrieves only the list of new and
changed packages since last seen timestamp with all of their related versions.
Page listing
------------
@ -48,56 +51,45 @@ The data schema for each line is:
* **crate_file**: Package download url
* **checksum**: Package download checksum
* **yanked**: Whether the package is yanked or not
* **last_update**: Iso8601 last update date computed upon git commit date of the
related Cargo.toml file
* **last_update**: Iso8601 last update
Origins from page
-----------------
The lister yields one origin per page.
The origin url corresponds to the http api url for a package, for example
"https://crates.io/api/v1/crates/{package}".
"https://crates.io/crates/{crate}".
Additionally we add some data set to "extra_loader_arguments":
Additionally we add some data for each version, set to "extra_loader_arguments":
* **artifacts**: Represent data about the Crates to download, following
:ref:`original-artifacts-json specification <extrinsic-metadata-original-artifacts-json>`
* **crates_metadata**: To store all other interesting attributes that do not belongs
to artifacts. For now it mainly indicate when a version is `yanked`_.
to artifacts. For now it mainly indicate when a version is `yanked`_, and the version
last_update timestamp.
Origin data example::
{
"url": "https://crates.io/api/v1/crates/rand",
"url": "https://crates.io/api/v1/crates/regex-syntax",
"artifacts": [
{
"version": "0.1.0",
"checksums": {
"sha256": "48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d", # noqa: B950
"sha256": "398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944", # noqa: B950
},
"filename": "rand-0.1.1.crate",
"url": "https://static.crates.io/crates/rand/rand-0.1.1.crate",
"version": "0.1.1",
},
{
"checksums": {
"sha256": "6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7", # noqa: B950
},
"filename": "rand-0.1.2.crate",
"url": "https://static.crates.io/crates/rand/rand-0.1.2.crate",
"version": "0.1.2",
"filename": "regex-syntax-0.1.0.crate",
"url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate", # noqa: B950
},
],
"crates_metadata": [
{
"version": "0.1.1",
"yanked": False,
},
{
"version": "0.1.2",
"version": "0.1.0",
"last_update": "2017-11-30 03:37:17.449539",
"yanked": False,
},
],
}
},
Running tests
-------------
@ -128,8 +120,8 @@ You can follow lister execution by displaying logs of swh-lister service::
.. _Cargo: https://doc.rust-lang.org/cargo/guide/why-cargo-exists.html#enter-cargo
.. _Cargo.toml: https://doc.rust-lang.org/cargo/reference/manifest.html
.. _different strategy: https://crates.io/data-access
.. _Dulwich: https://www.dulwich.io/
.. _yanked: https://doc.rust-lang.org/cargo/reference/publishing.html#cargo-yank
.. _db-dump.tar.gz: https://static.crates.io/db-dump.tar.gz
"""

View file

@ -2,19 +2,20 @@
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from dataclasses import asdict, dataclass
import datetime
import io
import csv
from dataclasses import dataclass
from datetime import datetime
import json
import logging
from pathlib import Path
import shutil
import tarfile
import tempfile
from typing import Any, Dict, Iterator, List, Optional
from urllib.parse import urlparse
from dulwich import porcelain
from dulwich.patch import write_tree_diff
from dulwich.repo import Repo
import iso8601
from packaging.version import parse as parse_version
from swh.scheduler.interface import SchedulerInterface
from swh.scheduler.model import ListedOrigin
@ -30,36 +31,36 @@ CratesListerPage = List[Dict[str, Any]]
@dataclass
class CratesListerState:
"""Store lister state for incremental mode operations.
'last_commit' represents a git commit hash
'index_last_update' represents the UTC time the crates.io database dump was
started
"""
last_commit: str = ""
index_last_update: Optional[datetime] = None
class CratesLister(Lister[CratesListerState, CratesListerPage]):
"""List origins from the "crates.io" forge.
It basically fetches https://github.com/rust-lang/crates.io-index.git to a
temp directory and then walks through each file to get the crate's info on
the first run.
It downloads a tar.gz archive which contains crates.io database table content as
csv files which is automatically generated every 24 hours.
Parsing two csv files we can list all Crates.io package names and their related
versions.
In incremental mode, it relies on the same Git repository but instead of reading
each file of the repo, it get the differences through ``git log last_commit..HEAD``.
Resulting output string is parsed to build page entries.
In incremental mode, it check each entry comparing their 'last_update' value
with self.state.index_last_update
"""
# Part of the lister API, that identifies this lister
LISTER_NAME = "crates"
# (Optional) CVS type of the origins listed by this lister, if constant
VISIT_TYPE = "crates"
INSTANCE = "crates"
INDEX_REPOSITORY_URL = "https://github.com/rust-lang/crates.io-index.git"
DESTINATION_PATH = Path("/tmp/crates.io-index")
BASE_URL = "https://crates.io"
DB_DUMP_URL = "https://static.crates.io/db-dump.tar.gz"
CRATE_FILE_URL_PATTERN = (
"https://static.crates.io/crates/{crate}/{crate}-{version}.crate"
)
CRATE_API_URL_PATTERN = "https://crates.io/api/v1/crates/{crate}"
CRATE_URL_PATTERN = "https://crates.io/crates/{crate}"
def __init__(
self,
@ -69,172 +70,172 @@ class CratesLister(Lister[CratesListerState, CratesListerPage]):
super().__init__(
scheduler=scheduler,
credentials=credentials,
url=self.INDEX_REPOSITORY_URL,
url=self.BASE_URL,
instance=self.INSTANCE,
)
self.index_metadata: Dict[str, str] = {}
def state_from_dict(self, d: Dict[str, Any]) -> CratesListerState:
if "last_commit" not in d:
d["last_commit"] = ""
index_last_update = d.get("index_last_update")
if index_last_update is not None:
d["index_last_update"] = iso8601.parse_date(index_last_update)
return CratesListerState(**d)
def state_to_dict(self, state: CratesListerState) -> Dict[str, Any]:
return asdict(state)
d: Dict[str, Optional[str]] = {"index_last_update": None}
index_last_update = state.index_last_update
if index_last_update is not None:
d["index_last_update"] = index_last_update.isoformat()
return d
def get_index_repository(self) -> None:
"""Get crates.io-index repository up to date running git command."""
if self.DESTINATION_PATH.exists():
porcelain.pull(
self.DESTINATION_PATH, remote_location=self.INDEX_REPOSITORY_URL
)
else:
porcelain.clone(
source=self.INDEX_REPOSITORY_URL, target=self.DESTINATION_PATH
)
def get_crates_index(self) -> List[Path]:
"""Build a sorted list of file paths excluding dotted directories and
dotted files.
Each file path corresponds to a crate that lists all available
versions.
def is_new(self, dt_str: str):
"""Returns True when dt_str is greater than
self.state.index_last_update
"""
crates_index = sorted(
path
for path in self.DESTINATION_PATH.rglob("*")
if not any(part.startswith(".") for part in path.parts)
and path.is_file()
and path != self.DESTINATION_PATH / "config.json"
)
dt = iso8601.parse_date(dt_str)
last = self.state.index_last_update
return not last or (last is not None and last < dt)
return crates_index
def get_and_parse_db_dump(self) -> Dict[str, Any]:
"""Download and parse csv files from db_dump_path.
def get_last_commit_hash(self, repository_path: Path) -> str:
"""Returns the last commit hash of a git repository"""
assert repository_path.exists()
repo = Repo(str(repository_path))
head = repo.head()
last_commit = repo[head]
return last_commit.id.decode()
def get_last_update_by_file(self, filepath: Path) -> Optional[datetime.datetime]:
"""Given a file path within a Git repository, returns its last commit
date as iso8601
Returns a dict where each entry corresponds to a package name with its related versions.
"""
repo = Repo(str(self.DESTINATION_PATH))
# compute relative path otherwise it fails
relative_path = filepath.relative_to(self.DESTINATION_PATH)
walker = repo.get_walker(paths=[bytes(relative_path)], max_entries=1)
try:
commit = next(iter(walker)).commit
except StopIteration:
logger.error(
"Can not find %s related commits in repository %s", relative_path, repo
)
return None
else:
last_update = datetime.datetime.fromtimestamp(
commit.author_time, datetime.timezone.utc
)
return last_update
with tempfile.TemporaryDirectory() as tmpdir:
file_name = self.DB_DUMP_URL.split("/")[-1]
archive_path = Path(tmpdir) / file_name
# Download the Db dump
with self.http_request(self.DB_DUMP_URL, stream=True) as res:
with open(archive_path, "wb") as out_file:
for chunk in res.iter_content(chunk_size=1024):
out_file.write(chunk)
# Extract the Db dump
db_dump_path = Path(str(archive_path).split(".tar.gz")[0])
tar = tarfile.open(archive_path)
tar.extractall(path=db_dump_path)
tar.close()
csv.field_size_limit(1000000)
(crates_csv_path,) = list(db_dump_path.glob("*/data/crates.csv"))
(versions_csv_path,) = list(db_dump_path.glob("*/data/versions.csv"))
(index_metadata_json_path,) = list(db_dump_path.rglob("*metadata.json"))
with index_metadata_json_path.open("rb") as index_metadata_json:
self.index_metadata = json.load(index_metadata_json)
crates: Dict[str, Any] = {}
with crates_csv_path.open() as crates_fd:
crates_csv = csv.DictReader(crates_fd)
for item in crates_csv:
if self.is_new(item["updated_at"]):
# crate 'id' as key
crates[item["id"]] = {
"name": item["name"],
"updated_at": item["updated_at"],
"versions": {},
}
data: Dict[str, Any] = {}
with versions_csv_path.open() as versions_fd:
versions_csv = csv.DictReader(versions_fd)
for version in versions_csv:
if version["crate_id"] in crates.keys():
crate: Dict[str, Any] = crates[version["crate_id"]]
crate["versions"][version["num"]] = version
# crate 'name' as key
data[crate["name"]] = crate
return data
def page_entry_dict(self, entry: Dict[str, Any]) -> Dict[str, Any]:
"""Transform package version definition dict to a suitable
page entry dict
"""
crate_file = self.CRATE_FILE_URL_PATTERN.format(
crate=entry["name"], version=entry["version"]
)
filename = urlparse(crate_file).path.split("/")[-1]
return dict(
name=entry["name"],
version=entry["vers"],
checksum=entry["cksum"],
yanked=entry["yanked"],
crate_file=self.CRATE_FILE_URL_PATTERN.format(
crate=entry["name"], version=entry["vers"]
),
version=entry["version"],
checksum=entry["checksum"],
yanked=True if entry["yanked"] == "t" else False,
crate_file=crate_file,
filename=filename,
last_update=entry["updated_at"],
)
def get_pages(self) -> Iterator[CratesListerPage]:
"""Yield an iterator sorted by name in ascending order of pages.
Each page is a list of crate versions with:
- name: Name of the crate
- version: Version
- checksum: Checksum
- crate_file: Url of the crate file
- last_update: Date of the last commit of the corresponding index
file
"""Each page is a list of crate versions with:
- name: Name of the crate
- version: Version
- checksum: Checksum
- yanked: Whether the package is yanked or not
- crate_file: Url of the crate file
- filename: File name of the crate file
- last_update: Last update for that version
"""
# Fetch crates.io index repository
self.get_index_repository()
if not self.state.last_commit:
# First discovery
# List all crates files from the index repository
crates_index = self.get_crates_index()
else:
# Incremental case
# Get new package version by parsing a range of commits from index repository
repo = Repo(str(self.DESTINATION_PATH))
head = repo[repo.head()]
last = repo[self.state.last_commit.encode()]
outstream = io.BytesIO()
write_tree_diff(outstream, repo.object_store, last.tree, head.tree)
raw_diff = outstream.getvalue()
crates_index = []
for line in raw_diff.splitlines():
if line.startswith(b"+++ b/"):
filepath = line.split(b"+++ b/", 1)[1]
crates_index.append(self.DESTINATION_PATH / filepath.decode())
crates_index = sorted(crates_index)
# Fetch crates.io Db dump, then Parse the data.
dataset = self.get_and_parse_db_dump()
logger.debug("Found %s crates in crates_index", len(crates_index))
logger.debug("Found %s crates in crates_index", len(dataset))
# Each line of a crate file is a json entry describing released versions
# for a package
for crate in crates_index:
# Each entry from dataset will correspond to a page
for name, item in dataset.items():
page = []
last_update = self.get_last_update_by_file(crate)
# sort crate versions
versions: list = sorted(item["versions"].keys(), key=parse_version)
for version in versions:
v = item["versions"][version]
v["name"] = name
v["version"] = version
page.append(self.page_entry_dict(v))
with crate.open("rb") as current_file:
for line in current_file:
data = json.loads(line)
entry = self.page_entry_dict(data)
entry["last_update"] = last_update
page.append(entry)
yield page
def get_origins_from_page(self, page: CratesListerPage) -> Iterator[ListedOrigin]:
"""Iterate on all crate pages and yield ListedOrigin instances."""
assert self.lister_obj.id is not None
url = self.CRATE_API_URL_PATTERN.format(crate=page[0]["name"])
url = self.CRATE_URL_PATTERN.format(crate=page[0]["name"])
last_update = page[0]["last_update"]
artifacts = []
crates_metadata = []
for version in page:
filename = urlparse(version["crate_file"]).path.split("/")[-1]
for entry in page:
# Build an artifact entry following original-artifacts-json specification
# https://docs.softwareheritage.org/devel/swh-storage/extrinsic-metadata-specification.html#original-artifacts-json # noqa: B950
artifact = {
"filename": f"{filename}",
"checksums": {
"sha256": f"{version['checksum']}",
},
"url": version["crate_file"],
"version": version["version"],
}
artifacts.append(artifact)
data = {f"{version['version']}": {"yanked": f"{version['yanked']}"}}
crates_metadata.append(data)
artifacts.append(
{
"version": entry["version"],
"filename": entry["filename"],
"url": entry["crate_file"],
"checksums": {
"sha256": entry["checksum"],
},
}
)
crates_metadata.append(
{
"version": entry["version"],
"yanked": entry["yanked"],
"last_update": entry["last_update"],
}
)
yield ListedOrigin(
lister_id=self.lister_obj.id,
visit_type=self.VISIT_TYPE,
url=url,
last_update=last_update,
last_update=iso8601.parse_date(last_update),
extra_loader_arguments={
"artifacts": artifacts,
"crates_metadata": crates_metadata,
@ -242,18 +243,8 @@ class CratesLister(Lister[CratesListerState, CratesListerPage]):
)
def finalize(self) -> None:
last = self.get_last_commit_hash(repository_path=self.DESTINATION_PATH)
if self.state.last_commit == last:
self.updated = False
else:
self.state.last_commit = last
last: datetime = iso8601.parse_date(self.index_metadata["timestamp"])
if not self.state.index_last_update:
self.state.index_last_update = last
self.updated = True
logger.debug("Listing crates origin completed with last commit id %s", last)
# Cleanup by removing the repository directory
if self.DESTINATION_PATH.exists():
shutil.rmtree(self.DESTINATION_PATH)
logger.debug(
"Successfully removed %s directory", str(self.DESTINATION_PATH)
)

View file

@ -1,29 +1,3 @@
# Copyright (C) 2022 the Software Heritage developers
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
import os
from pathlib import PosixPath
import subprocess
from typing import Optional, Union
def prepare_repository_from_archive(
archive_path: str,
filename: Optional[str] = None,
tmp_path: Union[PosixPath, str] = "/tmp",
) -> str:
"""Given an existing archive_path, uncompress it.
Returns a file repo url which can be used as origin url.
This does not deal with the case where the archive passed along does not exist.
"""
if not isinstance(tmp_path, str):
tmp_path = str(tmp_path)
# uncompress folder/repositories/dump for the loader to ingest
subprocess.check_output(["tar", "xf", archive_path, "-C", tmp_path])
# build the origin url (or some derivative form)
_fname = filename if filename else os.path.basename(archive_path)
repo_url = f"file://{tmp_path}/{_fname}"
return repo_url

View file

@ -1,63 +1,66 @@
#!/usr/bin/env bash
# Script to generate fake-crates-repository.tar.gz
# Creates a git repository like https://github.com/rust-lang/crates.io-index
# for tests purposes
# Script to generate db-dump.tar.gz like https://static.crates.io/db-dump.tar.gz
# Creates csv and json files then build the archive for tests purposes
set -euo pipefail
# files and directories
mkdir -p tmp_dir/crates.io-index/
cd tmp_dir/crates.io-index/
# files and directories for first db dump
mkdir -p https_static.crates.io
mkdir -p tmp_dir/crates.io-db-dump/2022-08-08-020027/
cd tmp_dir/crates.io-db-dump/2022-08-08-020027/
mkdir data
mkdir -p .dot-dir
touch .dot-dir/empty
mkdir -p ra/nd
mkdir -p re/ge
echo -e '''created_at,description,documentation,downloads,homepage,id,max_upload_size,name,readme,repository,updated_at
2015-02-03 06:17:14.147783,"Random number generators and other randomness functionality.",https://docs.rs/rand,139933633,https://rust-random.github.io/book,1339,,rand,,https://github.com/rust-random/rand,2022-02-14 08:37:47.035988
2014-12-13 22:10:11.303311,"An implementation of regular expressions for Rust. This implementation uses finite automata and guarantees linear time matching on all inputs.",https://docs.rs/regex,85620996,https://github.com/rust-lang/regex,545,,regex,,https://github.com/rust-lang/regex,2022-07-05 18:00:33.712885
2015-05-27 23:19:16.839117,"A regular expression parser.",https://docs.rs/regex-syntax,84299774,https://github.com/rust-lang/regex,2233,,regex-syntax,,https://github.com/rust-lang/regex,2022-07-05 17:59:37.238137
''' > data/crates.csv
touch .dot-file
touch config.json
echo -e '''checksum,crate_id,crate_size,created_at,downloads,features,id,license,links,num,published_by,updated_at,yanked
398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944,2233,,2015-05-27 23:19:16.848643,1961,{},10855,MIT/Apache-2.0,,0.1.0,,2017-11-30 03:37:17.449539,f
343bd0171ee23346506db6f4c64525de6d72f0e8cc533f83aea97f3e7488cbf9,545,,2014-12-18 06:56:46.88489,845,{},1321,MIT/Apache-2.0,,0.1.2,,2017-11-30 02:29:20.01125,f
6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7,1339,,2015-02-03 11:15:19.001762,8211,{},4371,MIT/Apache-2.0,,0.1.2,,2017-11-30 03:14:27.545115,f
defb220c4054ca1b95fe8b0c9a6e782dda684c1bdf8694df291733ae8a3748e3,545,,2014-12-19 16:16:41.73772,1498,{},1363,MIT/Apache-2.0,,0.1.3,,2017-11-30 02:26:59.236947,f
48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d,1339,,2015-02-03 06:17:14.169972,7963,{},4362,MIT/Apache-2.0,,0.1.1,,2017-11-30 03:33:14.186028,f
f0ff1ca641d3c9a2c30464dac30183a8b91cdcc959d616961be020cdea6255c5,545,,2014-12-13 22:10:11.329494,3204,{},1100,MIT/Apache-2.0,,0.1.0,,2017-11-30 02:51:27.240551,f
a07bef996bd38a73c21a8e345d2c16848b41aa7ec949e2fedffe9edf74cdfb36,545,,2014-12-15 20:31:48.571836,889,{},1178,MIT/Apache-2.0,,0.1.1,,2017-11-30 03:03:20.143103,f
''' > data/versions.csv
# Init as a git repository
git init
git add .
git commit -m "Init fake crates.io-index repository for tests purpose"
echo -e '''{
"timestamp": "2022-08-08T02:00:27.645191645Z",
"crates_io_commit": "3e5f0b4d2a382ac0951898fd257f693734eadee2"
}
''' > metadata.json
echo '{"name":"rand","vers":"0.1.1","deps":[],"cksum":"48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d","features":{},"yanked":false}' > ra/nd/rand
git add .
git commit -m " Updating crate rand#0.1.1"
cd ../../
tar -czf db-dump.tar.gz -C crates.io-db-dump .
echo '{"name":"rand","vers":"0.1.2","deps":[{"name":"libc","req":"^0.1.1","features":[""],"optional":false,"default_features":true,"target":null,"kind":"normal"},{"name":"log","req":"^0.2.1","features":[""],"optional":false,"default_features":true,"target":null,"kind":"normal"}],"cksum":"6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7","features":{},"yanked":false}' >> ra/nd/rand
git add .
git commit -m " Updating crate rand#0.1.2"
# A second db dump with a new entry and a different timestamp
echo '{"name":"regex","vers":"0.1.0","deps":[],"cksum":"f0ff1ca641d3c9a2c30464dac30183a8b91cdcc959d616961be020cdea6255c5","features":{},"yanked":false}' > re/ge/regex
git add .
git commit -m " Updating crate regex#0.1.0"
mkdir -p crates.io-db-dump_visit1
cp -rf crates.io-db-dump/2022-08-08-020027 crates.io-db-dump_visit1/2022-09-05-020027
echo '{"name":"regex","vers":"0.1.1","deps":[{"name":"regex_macros","req":"^0.1.0","features":[""],"optional":false,"default_features":true,"target":null,"kind":"dev"}],"cksum":"a07bef996bd38a73c21a8e345d2c16848b41aa7ec949e2fedffe9edf74cdfb36","features":{},"yanked":false}' >> re/ge/regex
git add .
git commit -m " Updating crate regex#0.1.1"
cd crates.io-db-dump_visit1/2022-09-05-020027/
echo '{"name":"regex","vers":"0.1.2","deps":[{"name":"regex_macros","req":"^0.1.0","features":[""],"optional":false,"default_features":true,"target":null,"kind":"dev"}],"cksum":"343bd0171ee23346506db6f4c64525de6d72f0e8cc533f83aea97f3e7488cbf9","features":{},"yanked":false}' >> re/ge/regex
git add .
git commit -m " Updating crate regex#0.1.2"
echo -e '''{
"timestamp": "2022-09-05T02:00:27.687167108Z",
"crates_io_commit": "d3652ad81bd8bd837f2d2442ee08484ee5d4bac3"
}
''' > metadata.json
echo '{"name":"regex","vers":"0.1.3","deps":[{"name":"regex_macros","req":"^0.1.0","features":[""],"optional":false,"default_features":true,"target":null,"kind":"dev"}],"cksum":"defb220c4054ca1b95fe8b0c9a6e782dda684c1bdf8694df291733ae8a3748e3","features":{},"yanked":false}' >> re/ge/regex
git add .
git commit -m " Updating crate regex#0.1.3"
echo -e '''2019-01-08 15:11:01.560092,"A crate for safe and ergonomic pin-projection.",,48353738,,107436,,pin-project,,https://github.com/taiki-e/pin-project,2022-08-15 13:52:11.642129
''' >> data/crates.csv
echo '{"name":"regex-syntax","vers":"0.1.0","deps":[{"name":"rand","req":"^0.3","features":[""],"optional":false,"default_features":true,"target":null,"kind":"dev"},{"name":"quickcheck","req":"^0.2","features":[""],"optional":false,"default_features":true,"target":null,"kind":"dev"}],"cksum":"398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944","features":{},"yanked":false}' > re/ge/regex-syntax
git add .
git commit -m " Updating crate regex-syntax#0.1.0"
echo -e '''ad29a609b6bcd67fee905812e544992d216af9d755757c05ed2d0e15a74c6ecc,107436,56972,2022-08-15 13:52:11.642129,580330,{},602929,Apache-2.0 OR MIT,,1.0.12,33035,2022-08-15 13:52:11.642129,f
''' >> data/versions.csv
# Save some space
rm .git/hooks/*.sample
cd ../../
# Compress git directory as a tar.gz archive
cd ../
tar -cvzf fake-crates-repository.tar.gz crates.io-index
mv fake-crates-repository.tar.gz ../
tar -czf db-dump.tar.gz_visit1 -C crates.io-db-dump_visit1 .
# Move the generated tar.gz archives to a servable directory
mv db-dump.tar.gz ../https_static.crates.io/
mv db-dump.tar.gz_visit1 ../https_static.crates.io/
# Clean up tmp_dir
cd ../

View file

@ -2,142 +2,206 @@
# See the AUTHORS file at the top-level directory of this distribution
# License: GNU General Public License version 3, or any later version
# See top-level LICENSE file for more information
from pathlib import Path
from dulwich.repo import Repo
import iso8601
from swh.lister.crates.lister import CratesLister, CratesListerState
from swh.lister.crates.tests import prepare_repository_from_archive
expected_origins = [
{
"url": "https://crates.io/api/v1/crates/rand",
"url": "https://crates.io/crates/rand",
"artifacts": [
{
"version": "0.1.1",
"checksums": {
"sha256": "48a45b46c2a8c38348adb1205b13c3c5eb0174e0c0fec52cc88e9fb1de14c54d", # noqa: B950
},
"filename": "rand-0.1.1.crate",
"url": "https://static.crates.io/crates/rand/rand-0.1.1.crate",
"version": "0.1.1",
},
{
"version": "0.1.2",
"checksums": {
"sha256": "6e229ed392842fa93c1d76018d197b7e1b74250532bafb37b0e1d121a92d4cf7", # noqa: B950
},
"filename": "rand-0.1.2.crate",
"url": "https://static.crates.io/crates/rand/rand-0.1.2.crate",
"version": "0.1.2",
},
],
"metadata": [
"crates_metadata": [
{
"version": "0.1.1",
"last_update": "2017-11-30 03:33:14.186028",
"yanked": False,
},
{
"version": "0.1.2",
"last_update": "2017-11-30 03:14:27.545115",
"yanked": False,
},
],
},
{
"url": "https://crates.io/api/v1/crates/regex",
"url": "https://crates.io/crates/regex",
"artifacts": [
{
"version": "0.1.0",
"checksums": {
"sha256": "f0ff1ca641d3c9a2c30464dac30183a8b91cdcc959d616961be020cdea6255c5", # noqa: B950
},
"filename": "regex-0.1.0.crate",
"url": "https://static.crates.io/crates/regex/regex-0.1.0.crate",
"version": "0.1.0",
},
{
"version": "0.1.1",
"checksums": {
"sha256": "a07bef996bd38a73c21a8e345d2c16848b41aa7ec949e2fedffe9edf74cdfb36", # noqa: B950
},
"filename": "regex-0.1.1.crate",
"url": "https://static.crates.io/crates/regex/regex-0.1.1.crate",
"version": "0.1.1",
},
{
"version": "0.1.2",
"checksums": {
"sha256": "343bd0171ee23346506db6f4c64525de6d72f0e8cc533f83aea97f3e7488cbf9", # noqa: B950
},
"filename": "regex-0.1.2.crate",
"url": "https://static.crates.io/crates/regex/regex-0.1.2.crate",
"version": "0.1.2",
},
{
"version": "0.1.3",
"checksums": {
"sha256": "defb220c4054ca1b95fe8b0c9a6e782dda684c1bdf8694df291733ae8a3748e3", # noqa: B950
},
"filename": "regex-0.1.3.crate",
"url": "https://static.crates.io/crates/regex/regex-0.1.3.crate",
"version": "0.1.3",
},
],
"metadata": [
"crates_metadata": [
{
"version": "0.1.0",
"last_update": "2017-11-30 02:51:27.240551",
"yanked": False,
},
{
"version": "0.1.1",
"last_update": "2017-11-30 03:03:20.143103",
"yanked": False,
},
{
"version": "0.1.2",
"last_update": "2017-11-30 02:29:20.01125",
"yanked": False,
},
{
"version": "0.1.3",
"last_update": "2017-11-30 02:26:59.236947",
"yanked": False,
},
],
},
{
"url": "https://crates.io/api/v1/crates/regex-syntax",
"url": "https://crates.io/crates/regex-syntax",
"artifacts": [
{
"version": "0.1.0",
"checksums": {
"sha256": "398952a2f6cd1d22bc1774fd663808e32cf36add0280dee5cdd84a8fff2db944", # noqa: B950
},
"filename": "regex-syntax-0.1.0.crate",
"url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate",
"version": "0.1.0",
"url": "https://static.crates.io/crates/regex-syntax/regex-syntax-0.1.0.crate", # noqa: B950
},
],
"metadata": [
"crates_metadata": [
{
"version": "0.1.0",
"last_update": "2017-11-30 03:37:17.449539",
"yanked": False,
},
],
},
]
expected_origins_incremental = [expected_origins[1], expected_origins[2]]
expected_origins_incremental = {
"url": "https://crates.io/crates/pin-project",
"artifacts": [
{
"version": "1.0.12",
"url": "https://static.crates.io/crates/pin-project/pin-project-1.0.12.crate",
"filename": "pin-project-1.0.12.crate",
"checksums": {
"sha256": "ad29a609b6bcd67fee905812e544992d216af9d755757c05ed2d0e15a74c6ecc"
},
}
],
"crates_metadata": [
{
"version": "1.0.12",
"yanked": False,
"last_update": "2022-08-15 13:52:11.642129",
}
],
}
def test_crates_lister(datadir, tmp_path, swh_scheduler):
archive_path = Path(datadir, "fake-crates-repository.tar.gz")
repo_url = prepare_repository_from_archive(
archive_path, "crates.io-index", tmp_path
)
def test_crates_lister_is_new(swh_scheduler):
lister = CratesLister(scheduler=swh_scheduler)
lister.INDEX_REPOSITORY_URL = repo_url
lister.DESTINATION_PATH = tmp_path.parent / "crates.io-index-tests"
index_last_update_state = CratesListerState(
index_last_update=iso8601.parse_date("2022-08-15 13:52:11.642129")
)
lister.state = index_last_update_state
assert lister.is_new("2022-07-15 13:52:11.642129") is False # earlier
assert lister.is_new("2022-08-15 13:52:11.642129") is False # exactly equal
assert lister.is_new("2022-09-15 13:52:11.642129") is True # after
def test_crates_lister(datadir, tmp_path, swh_scheduler, requests_mock_datadir):
lister = CratesLister(scheduler=swh_scheduler)
res = lister.run()
assert res.pages == 3
assert res.origins == 3
scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
assert [
(
scheduled.visit_type,
scheduled.url,
scheduled.extra_loader_arguments["artifacts"],
scheduled.extra_loader_arguments["crates_metadata"],
)
for scheduled in sorted(scheduler_origins, key=lambda scheduled: scheduled.url)
] == [
(
"crates",
expected["url"],
expected["artifacts"],
expected["crates_metadata"],
)
for expected in sorted(expected_origins, key=lambda expected: expected["url"])
]
def test_crates_lister_incremental(
datadir, tmp_path, swh_scheduler, requests_mock_datadir_visits
):
lister = CratesLister(scheduler=swh_scheduler)
first = lister.run()
assert first.pages == 3
assert first.origins == 3
second = lister.run()
assert second.pages == 1
assert second.origins == 1
scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
expected_origins.append(expected_origins_incremental)
assert [
(
scheduled.visit_type,
@ -155,84 +219,21 @@ def test_crates_lister(datadir, tmp_path, swh_scheduler):
]
def test_crates_lister_incremental(datadir, tmp_path, swh_scheduler):
archive_path = Path(datadir, "fake-crates-repository.tar.gz")
repo_url = prepare_repository_from_archive(
archive_path, "crates.io-index", tmp_path
)
lister = CratesLister(scheduler=swh_scheduler)
lister.INDEX_REPOSITORY_URL = repo_url
lister.DESTINATION_PATH = tmp_path.parent / "crates.io-index-tests"
# The lister has not run yet, get the index repository
lister.get_index_repository()
# Set a CratesListerState with a last commit value to force incremental case
repo = Repo(lister.DESTINATION_PATH)
# Lets set this last commit to third one from head
step = list(repo.get_walker(max_entries=3))[-1]
last_commit_state = CratesListerState(last_commit=step.commit.id.decode())
lister.state = last_commit_state
res = lister.run()
assert res.pages == 2
assert res.origins == 2
scheduler_origins = swh_scheduler.get_listed_origins(lister.lister_obj.id).results
assert [
(
scheduled.visit_type,
scheduled.url,
scheduled.extra_loader_arguments["artifacts"],
)
for scheduled in sorted(scheduler_origins, key=lambda scheduled: scheduled.url)
] == [
(
"crates",
expected["url"],
expected["artifacts"],
)
for expected in sorted(
expected_origins_incremental, key=lambda expected: expected["url"]
)
]
def test_crates_lister_incremental_nothing_new(datadir, tmp_path, swh_scheduler):
def test_crates_lister_incremental_nothing_new(
datadir, tmp_path, swh_scheduler, requests_mock_datadir
):
"""Ensure incremental mode runs fine when the repository last commit is the same
than lister.state.las-_commit"""
archive_path = Path(datadir, "fake-crates-repository.tar.gz")
repo_url = prepare_repository_from_archive(
archive_path, "crates.io-index", tmp_path
)
than lister.state.last_commit"""
lister = CratesLister(scheduler=swh_scheduler)
lister.INDEX_REPOSITORY_URL = repo_url
lister.DESTINATION_PATH = tmp_path.parent / "crates.io-index-tests"
lister.get_index_repository()
repo = Repo(lister.DESTINATION_PATH)
lister.get_and_parse_db_dump()
# Set a CratesListerState with a last commit value to force incremental case
last_commit_state = CratesListerState(last_commit=repo.head().decode())
lister.state = last_commit_state
index_last_update_state = CratesListerState(
index_last_update=iso8601.parse_date(lister.index_metadata["timestamp"])
)
lister.state = index_last_update_state
res = lister.run()
assert res.pages == 0
assert res.origins == 0
def test_crates_lister_repository_cleanup(datadir, tmp_path, swh_scheduler):
archive_path = Path(datadir, "fake-crates-repository.tar.gz")
repo_url = prepare_repository_from_archive(
archive_path, "crates.io-index", tmp_path
)
lister = CratesLister(scheduler=swh_scheduler)
lister.INDEX_REPOSITORY_URL = repo_url
lister.DESTINATION_PATH = tmp_path.parent / "crates.io-index-tests"
lister.run()
# Repository directory should not exists after the lister runs
assert not lister.DESTINATION_PATH.exists()