Commit graph

82 commits

Author SHA1 Message Date
Antoine R. Dumont (@ardumont)
ed73cea771
github.lister: Filter out partial repositories which break listing
This commit fixes the repository mapping to model. It broke when the listed
repository was either None or missing the id field [1]

[1] https://sentry.softwareheritage.org/share/issue/532d682182fc43d6a7a99400e3928811/
2020-01-20 10:25:57 +01:00
Antoine R. Dumont (@ardumont)
4b383abc56
github.lister: Use Retry-After header when rate limit reached
Following the github's documentation [1]

[1] https://developer.github.com/v3/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits

Related to T2170
2020-01-17 10:37:53 +01:00
Antoine R. Dumont (@ardumont)
5ab9d67d67
core: Align listers' task output (hg/git tasks) with expected format
Related to T2134
Related to D2409
Related to D2410
2019-12-09 15:12:17 +01:00
Antoine R. Dumont (@ardumont)
4a9608f31c
lister/tasks: Standardize return statements
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.
2019-12-02 15:49:38 +01:00
Nicolas Dandrimont
ff7fdf24db Use a uniform User-Agent on all listers
This also adds tests to make sure that we properly send our version number to
upstreams.
2019-11-22 15:49:23 +01:00
Stefano Zacchiroli
974f80f966 typing: minimal changes to make a no-op mypy run pass 2019-10-28 15:35:21 +01:00
Nicolas Dandrimont
78105940ff Stop binding tasks to a specific instance of the celery app
The celery.shared_task decorator allows late-binding of tasks to any celery app,
which is well suited for our "task plugin" architecture.
2019-10-18 18:02:25 +02:00
Antoine R. Dumont (@ardumont)
a8cde12d72
tests: Update pytest_plugin according to latest version change 2019-10-14 18:20:15 +02:00
Antoine R. Dumont (@ardumont)
0b8b1419e1
github.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-12 03:28:39 +02:00
Antoine Lambert
04d8fdf8df github/lister: Prevent erroneous scheduler tasks disabling
Closes T2014
2019-09-19 14:30:30 +02:00
Antoine Lambert
7572228f7c listers: Ensure run can be called without bounds arguments
Closes T2001
2019-09-17 15:09:04 +02:00
David Douard
b810876ef8 tasks: normalize the url argument name of most lister
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.

Also kill now useless `new_lister()` functions.
2019-09-04 15:38:01 +02:00
David Douard
8d9deeb8f8 plugins: add support for scheduler's task-type declaration
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:

- only create missing task-types (do not update them), but check that the
  backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
  plugin registry field will be checked and added if needed; tasks which name
  start wit an underscore will not be added,
- added task-type will have:
  - the 'type' field is derived from the task's function name (with underscores
    replaced with dashes),
  - the description field is the first line of that function's docstring,
  - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
    a simple pattern matching to have decent default values for full/incremental
    tasks),
  - these default values can be overloaded via the 'task_type' plugin registry
    entry.

For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).

Comes with some tests.
2019-09-04 15:36:08 +02:00
David Douard
e3c0ea9d90 implement listers as plugins
Listers are declared as plugins via the `swh.workers` entry_point.

As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:

- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
  (typically, create/initialize the database for this lister).
  If not set, the default implementation creates database tables (after
  optionally having deleted exisintg ones) according to models declared in
  the `models` register field.

There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).

Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
  level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
  initializing (especially with --drop-tables) the database for a single
  lister is unreliable, since all tables are created using a sibgle MetaData
  (in the same namespace).
2019-09-03 15:02:24 +02:00
David Douard
b87cd5d309 github: make GitHubLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00
Antoine R. Dumont (@ardumont)
3d473c307c
lister: Type correctly the 'indexable' column
instead of converting that column as a string

As a side effect, bitbucket wise, we provided improperly the after query
parameter as a date not url encoded. This resulted in improper api response from
bitbucket's (we received from time to time the same next index as the current
one).

Related T1826
2019-06-26 10:58:54 +02:00
Antoine R. Dumont (@ardumont)
b99617f976
relister: Fix consistently the behavior for the first time relisting
If nothing has been done prior to a full relisting, there is actually nothing
to list. So the relister in question does nothing.

In that context, the IndexingLister class's `db_partition_indices` method now
returns an empty list instead of raising a ValueError when there is nothing to
list.

Related T1826
Related e129e48
2019-06-25 14:48:17 +02:00
Antoine R. Dumont (@ardumont)
5ec3067b0d
Clean up code
- Remove unneeded return instructions
- Clarify tests code regarding request_index computations
2019-06-25 14:48:13 +02:00
Antoine R. Dumont (@ardumont)
b3463ecddc
Drop SWH prefix in classes everywhere
It's redundant with the swh modules in itself.
2019-06-20 19:08:46 +02:00
Valentin Lorentz
aef7d5952e Remove columns 'description' and 'origin_id'.
They are useless.
2019-06-19 10:29:15 +02:00
Antoine R. Dumont (@ardumont)
fc92c79b7e
models: Unify tablenames using singular as main archive's convention
Related P434
2019-06-18 07:18:34 +02:00
Antoine R. Dumont (@ardumont)
b81621274b
lister: Unify credentials structure between listers
This becomes a dictionary of key <lister-name>, value a dict of key
<instance-name>, value list of dict username/password.

Related T1772
2019-05-29 14:00:11 +02:00
David Douard
e5c3559033 tasks: fix handling of unsupported promise.save() calls
the exception can also be an AttributeError.

Also do not reraise this exception (in github/tasks.py). This promise
saving feature is used for tests.
2019-04-11 11:03:48 +02:00
David Douard
f670de298f Remove debug logging from tasks' code
since this is now handled by the SWHTask itself.
2019-01-17 13:58:29 +01:00
David Douard
e31b61bee1 Do not crash range tasks if celery result backend does not support saving the group's state 2019-01-15 15:32:07 +01:00
David Douard
4fc1968f1f Rename the bitbucket and github listers to remove the 'tld' part
so that we can easily manage its configuration (especially in the docker
environment) by referring to this lister as only 'bitbucket' everywhere
(ie. python package name and config file names).
2019-01-14 12:07:57 +01:00
David Douard
f46f3e2015 Remove explicit setting of the task base class
since it's now the default base class in swh-scheduler (>= 0.0.39)
2019-01-10 09:55:17 +01:00
David Douard
b7139619fd Add tests for github tasks
in order to be able to run unit tests using celery pytest fixtures, we
use a dedicated swh_app fixture that ensure the "main" celery app
is the test app (otherwise subtasks won't work).
2019-01-08 10:35:33 +01:00
David Douard
0583b0e685 Add a 'ping' task for every lister. 2019-01-08 10:35:33 +01:00
David Douard
2d1f0643ff Heavy refactor of the task system
Get rid of the class based task definition in favor of decorator-based
task declarations.

Doing so, we can get rid of core/tasks.py

Task names are explicitely set to keep compatibility with task
definitions in schedulers' database.

This also add debug statements at the beginning and end of each lister
task.
2019-01-08 10:33:32 +01:00
David Douard
5ff8093c5d Simplify listers Model constructor
the default implementation of SQLAlchemy's declarative API should
work just fine.
2018-12-12 18:27:11 +01:00
Antoine R. Dumont (@ardumont)
d88f1b60c9
core/lister: Make the tasks take an explicit lister_args argument
Avoid eating *all* arbitrary arguments and passing them along to the
new_lister method.
2018-07-17 15:48:48 +02:00
Antoine R. Dumont (@ardumont)
4db15aaf16
swh.lister.gitlab: Remove indexable column from gitlab lister 2018-07-12 13:41:47 +02:00
Antoine R. Dumont (@ardumont)
d640fdcc96
swh.lister.gitlab.tests: Separate properly tests per lister 2018-07-12 12:23:46 +02:00
Antoine R. Dumont (@ardumont)
4c4aa0ead2
swh.lister: Make LISTER_NAME a class attribute
swh.lister.gitlab: make the 'instance' a constructor parameter
2018-07-11 17:43:41 +02:00
Antoine R. Dumont (@ardumont)
7954e03627
swh.lister: Document swh.lister.tasks's intent
And remove uneeded indexing name from the RangeListerTask
2018-07-11 15:56:32 +02:00
Antoine R. Dumont (@ardumont)
ba146376d6
swh.lister: Add tests around the gitlab lister
Related T989
2018-07-11 15:56:32 +02:00
Antoine R. Dumont (@ardumont)
f4fe1b058b
swh.lister.*: Formatting 2018-07-03 12:17:46 +02:00
Nicolas Dandrimont
e477a46c60 Add missing __init__.py files
Helps with tests autodetection
2017-10-30 16:38:27 +01:00
Nicolas Dandrimont
cf3220d1fb github.models: handle the fork argument 2017-09-05 11:38:27 +02:00
Nicolas Dandrimont
4b56b6037c github.models: add fork information to repos 2017-09-04 19:40:45 +02:00
Nicolas Dandrimont
f6f077b789 github.tasks: the github api is rooted at api.github.com 2017-09-04 17:42:20 +02:00
Nicolas Dandrimont
75ba12e395 remove useless __init__.py file 2017-06-12 18:22:04 +02:00
Avi Kelman (fiendish)
68d77fd43f Refactor lister code
Streamline production of new listers by aggressively moving core
functionality into progressively inherited (A->B->C) base classes
with the transport layer abstracted.
This should make common individual forge listers straightforward to
produce with minimal customization. Github and Bitbucket listers
can be used as examples of the indexing type.
2017-03-06 12:35:49 +01:00
Antoine Pietri
a6e43f2777 config: use 5002 as the default storage port 2017-02-21 17:23:58 +01:00
Antoine R. Dumont (@ardumont)
b217f55cfe
Update storage configuration reading
Related T613
2016-12-15 19:07:02 +01:00
Nicolas Dandrimont
d47905b0a1 tasks: add tasks for incremental and full updates 2016-10-20 17:19:39 +02:00
Nicolas Dandrimont
6fd0184229 lister: update Copyright 2016-10-20 16:59:28 +02:00
Nicolas Dandrimont
7fa507e6ff lister: disable tasks for deleted repositories
When operating on a range of repositories, and we notice that a
repository has disappeared, we disable the task associated with that
repository.
2016-10-20 16:28:19 +02:00
Nicolas Dandrimont
a1a6228e05 lister: retrieve old task and origin id if a full_name has been recycled
If a repo changed hands, it is possible that a full_name is recycled. In
that case, we reuse the task_id and origin_id from the old repository
instead of recreating them.
2016-10-20 16:26:53 +02:00