Commit graph

62 commits

Author SHA1 Message Date
Antoine Lambert
61cfd77da1
debian: Fix error since python-debian 1.0 release
Since python-debian 1.0 release, an extra paragraph is returned
when calling Sources.iter_paragraphs that does not have the
expected schema so ensure to ignore it.
2025-03-13 13:33:33 +01:00
David Douard
cccb8c21ff Replace all remaining occurrences of the 'local' cls by 'postgresql'
The former has been deprecated for ages...
2024-10-28 14:35:29 +01:00
David Douard
714fccc3c7 python: Fix black formatting after bump to 23.1.0 in pre-commit 2023-12-05 10:33:07 +01:00
Antoine Lambert
6e7bc49ec7 Harmonize listers parameters and add test to check mandatory ones
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.

Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.

Reated to swh/infra/sysadm-environment#5030.
2023-09-06 11:55:34 +02:00
Nicolas Dandrimont
b2ff630c9b debian: refactor inner loop slightly to help mypy
mypy doesn't catch that multiple uses of
`self.listed_origins[origin_url]` in the same statement should be identical.
Using a temporary local variable for it seems to help.
2023-06-21 13:57:27 +02:00
Nicolas Dandrimont
e785e67315 Hook up recently introduced options to all listers
Hopefully one day we'll be able to replace all of this mess with PEP692
TypedDict kwargs, but that's only on track for Python 3.12.
2022-12-05 16:33:45 +01:00
Antoine Lambert
5426883c49 debian: Remove no longer needed code to get accurate origins count
The base lister class now ensures the count of listed origins will
be accurate.
2022-09-29 11:14:42 +02:00
Antoine Lambert
db6ce12e9e Refactor and deduplicate HTTP requests code in listers
Numerous listers were using the same page_request method or equivalent
in their implementation so prefer to deduplicate that code by adding
an http_request method in base lister class: swh.lister.pattern.Lister.

That method simply wraps a call to requests.Session.request and logs
some useful info for debugging and error reporting, also an HTTPError
will be raised if a request ends up with an error.

All listers using that new method now benefit of requests retry when
an HTTP error occurs thanks to the use of the http_retry decorator.
2022-09-26 10:48:40 +02:00
Antoine Lambert
d38e05cff7 python: Reformat code with black 22.3.0
Related to T3922
2022-04-08 15:15:09 +02:00
Antoine Lambert
b7524bbae0 debian/test_lister: Fix typo detected with codespell 2022-02-10 16:25:42 +01:00
Antoine Lambert
15fa84cf7e debian: Update last_update for a package when required
A debian package can have sources coming from multiple suites
so we need to ensure to update the last_update field in the
ListedOrigin model if the current processed suite has a greater
modification time for its sources index.

Related to T2400
2021-12-06 10:43:28 +01:00
Antoine Lambert
93f17d4d9c debian: Provide last_update to produced ListedOrigin models
Use the value of the "Last-Modified" header from the HTTP response
resulting of the debian sources index HTTP request.

It will prevent to create loading tasks for debian packages with no
changes since last listing.

Related to T2400
2021-12-03 16:09:44 +01:00
Antoine Lambert
605b13a676 debian: Do not raise when a component cannot be found for a suite
All debian suites do not necessarily have the same set of components.

So prefer to log that a component is missing for a suite instead of
raising an excption that will stop the listing.
2021-12-03 14:29:15 +01:00
Antoine Lambert
4ff3e44643 debian: Update extra_loader_arguments dict produced ListedOrigin models
Remove no longer used date parameter in extra_loader_arguments.

Related to T2400
2021-12-03 10:51:30 +01:00
Antoine Lambert
46425917c2 debian: Add missing file URIs in lister output
For a given package, the debian lister generates a dictionary mapping
distribution and version to a list of files to be processed by the
debian loader.

For each file to process, the debian loader expects to find an URI
in order to download it and then use its content to ingest package
source code into the archive.

However, it turns out these URIs were not computed by the lister
in its current implementation making any debian loading task fail
due to these missing info.

So add the computation of these URIS and ensure they will be provided
in the debian loader input parameters.

Related to T2400
2021-12-02 17:30:50 +01:00
Antoine Lambert
5b4dc289b7 debian: Update archive mirror URL templates to process
Some distributions (e.g. debian-security) have a slightly different URL
for retrieving source packages metadata.

So add a new URL template to process when trying to download such data.

Related to T3032#58239
2021-02-08 14:01:59 +01:00
Antoine Lambert
4245c5046f Remove no longer used models field in dict returned by register 2021-02-02 16:33:52 +01:00
Antoine R. Dumont (@ardumont)
130ad7d73e
Make debian lister constructors compatible with credentials
In effect, it just allows to add credentials to cgit, cran and pypi listers.

This fixes instances of error [1]

[1] https://sentry.softwareheritage.org/share/issue/a5fb50f8e43e4b328c4917771576c6b0/

Related to T2998
2021-01-28 18:46:52 +01:00
Antoine Lambert
bb0184c004 debian: Reimplement lister using new Lister API
Port debian lister to `swh.lister.pattern.Lister` API.

The new implementation will produce one instance of ListedOrigin model
per package, notably containing the set of parameters expected by the
debian loader.

The lister is also stateful, meaning only new packages and those with
new found versions since the last listing will be returned.

Closes T2979
2021-01-26 17:20:22 +01:00
Antoine R. Dumont (@ardumont)
b90ffa4bdd
tests: Reduce db initialization fixtures to a minimum 2020-10-30 13:24:38 +01:00
Antoine R. Dumont (@ardumont)
e2a861c801
debian.tests: Fix test
The scheduler fixture introduced truncates tables in between tests. The debian
tests unfortunately share state and it broke when that changed. This fixes the
test by avoiding the truncation of the scheduler db table "task".

Ideally those tests need to be reworked to avoid sharing state between tests.

[1] https://jenkins.softwareheritage.org/job/DLS/job/tests/1043
2020-10-30 09:09:56 +01:00
Antoine Lambert
22f7181294 python: Reorder imports with isort
Related to T2610
2020-09-17 17:48:27 +02:00
Antoine R. Dumont (@ardumont)
5a5b7ef70b
tests: Separate lister instantiations
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.

The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
2020-09-02 12:49:00 +02:00
Antoine R. Dumont (@ardumont)
9437a643ad
pytest: Define plugin and declare it in the root conftest
Then drop all unneeded and indirect imports
2020-09-02 12:25:15 +02:00
Nicolas Dandrimont
c9963d4302 Use the new names for the swh.scheduler test fixtures 2020-07-09 17:06:50 +02:00
David Douard
93a4d8b784 Enable black
- blackify all the python files,
- enable black in pre-commit,
- add a black tox environment.
2020-04-08 16:31:22 +02:00
Gautier Pugnonblanc Yann
60adc424be add anotation type in some lister file 2020-02-17 15:58:34 +01:00
Antoine R. Dumont (@ardumont)
5b652b3070
lister.debian: Make debian init step idempotent and up-to-date 2019-12-19 13:58:11 +01:00
Antoine R. Dumont (@ardumont)
4a9608f31c
lister/tasks: Standardize return statements
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.
2019-12-02 15:49:38 +01:00
Antoine R. Dumont (@ardumont)
d251201251
debian.models: Migrate tests from storage to debian lister model
Related bb5d405
2019-11-14 10:28:15 +01:00
Nicolas Dandrimont
b2e5ce32a9 Fix bogus NotImplementedError on Area.index_uris 2019-11-13 13:51:46 +01:00
Antoine R. Dumont (@ardumont)
ea7a08d05d
lister.debian: Actually use the db_engine passed to the hook function 2019-11-08 10:51:33 +01:00
Antoine R. Dumont (@ardumont)
e8a67a7650
swh.lister: Remove completely references to swh.storage.schemata
Related to 56d7cff
2019-11-06 15:46:04 +01:00
Antoine R. Dumont (@ardumont)
e0dbca759c
lister.debian: Move run method parameters to constructor 2019-11-05 17:44:45 +01:00
Antoine R. Dumont (@ardumont)
b745c5a735
lister.debian: Default to run a listing on debian distribution
That fixes the `swh lister run --lister debian` cli entrypoint.
2019-11-05 10:35:51 +01:00
Antoine R. Dumont (@ardumont)
a60e0bbc41
lister.debian: Fix task creation
By adding a `retries_left`
2019-11-05 10:35:51 +01:00
Antoine R. Dumont (@ardumont)
f872792407
debian.lister: Send origin url as load-debian task parameter
Instead of the old origin dict. That's what the debian loaders (old and new)
expect.
2019-11-05 10:35:51 +01:00
Antoine R. Dumont (@ardumont)
7c247c8a4a
debian/lister: Use url parameter name instead of origin
within the scheduled task.

Related D2135
2019-11-04 10:00:55 +01:00
Antoine R. Dumont (@ardumont)
56d7cff6e1
debian/model: Install lister model within the lister repository
This is no longer shared between the new debian loader and the lister.

The swh.storage.schemata module is still part of the swh.storage module though.
As this is still a dependency for the current swh.loader.debian production
loader. This will be cleaned up later.

Related D2135
2019-11-04 10:00:54 +01:00
Stefano Zacchiroli
6159faa2f5 mypy: add typing annotations for novel lister abstractions 2019-10-28 15:35:21 +01:00
Nicolas Dandrimont
78105940ff Stop binding tasks to a specific instance of the celery app
The celery.shared_task decorator allows late-binding of tasks to any celery app,
which is well suited for our "task plugin" architecture.
2019-10-18 18:02:25 +02:00
Antoine R. Dumont (@ardumont)
a64ae9641d
debian.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-15 12:21:24 +02:00
David Douard
b810876ef8 tasks: normalize the url argument name of most lister
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.

Also kill now useless `new_lister()` functions.
2019-09-04 15:38:01 +02:00
David Douard
8d9deeb8f8 plugins: add support for scheduler's task-type declaration
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:

- only create missing task-types (do not update them), but check that the
  backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
  plugin registry field will be checked and added if needed; tasks which name
  start wit an underscore will not be added,
- added task-type will have:
  - the 'type' field is derived from the task's function name (with underscores
    replaced with dashes),
  - the description field is the first line of that function's docstring,
  - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
    a simple pattern matching to have decent default values for full/incremental
    tasks),
  - these default values can be overloaded via the 'task_type' plugin registry
    entry.

For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).

Comes with some tests.
2019-09-04 15:36:08 +02:00
David Douard
e3c0ea9d90 implement listers as plugins
Listers are declared as plugins via the `swh.workers` entry_point.

As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:

- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
  (typically, create/initialize the database for this lister).
  If not set, the default implementation creates database tables (after
  optionally having deleted exisintg ones) according to models declared in
  the `models` register field.

There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).

Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
  level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
  initializing (especially with --drop-tables) the database for a single
  lister is unreliable, since all tables are created using a sibgle MetaData
  (in the same namespace).
2019-09-03 15:02:24 +02:00
Antoine R. Dumont (@ardumont)
b3463ecddc
Drop SWH prefix in classes everywhere
It's redundant with the swh modules in itself.
2019-06-20 19:08:46 +02:00
Antoine R. Dumont (@ardumont)
64a9bc691d
lister.core: Stop creating origins when scheduling tasks
Prior to this commit, lister did create origins as well in the archive. Now, we
only schedule new origins for ingestion.
2019-06-13 15:42:07 +02:00
Antoine R. Dumont (@ardumont)
b81621274b
lister: Unify credentials structure between listers
This becomes a dictionary of key <lister-name>, value a dict of key
<instance-name>, value list of dict username/password.

Related T1772
2019-05-29 14:00:11 +02:00
David Douard
f670de298f Remove debug logging from tasks' code
since this is now handled by the SWHTask itself.
2019-01-17 13:58:29 +01:00
David Douard
e6a4ae7619 flake8: remove unneeded imports 2019-01-15 18:17:20 +01:00