Commit graph

117 commits

Author SHA1 Message Date
Antoine Lambert
bde37867d8 docs: Fix broken external links
Those were spotted thanks to the sphinx linkcheck builder
2025-02-20 10:15:54 +00:00
Antoine Lambert
88a715d0c1 github: Ensure range listers do not override shared lister state
Recent changes in base Lister class implementation turn the call to
self.scheduler.update_lister mandatory to update the last termination
date for a lister.

It has some side effects on the GitHub lister as there is one incremental
instance plus multiple range ones relisting previously discovered repos
executed in parallel.

Range GitHub listers should not override the shared incremental lister
state as StaleData exceptions might be raised otherwise, so override
the set_state_in_scheduler Lister method to ensure that.
2024-10-28 15:37:02 +00:00
Nicolas Dandrimont
f7abfafffe GitHub: record whether the origin is a fork
For now this information is not used downstream, but it can be useful
for specific analysis or one-shot scheduling.
2024-07-18 10:45:06 +02:00
Antoine Lambert
6e7bc49ec7 Harmonize listers parameters and add test to check mandatory ones
Ensure that all lister classes have the same set of mandatory parameters
in their constructors, notably: scheduler, url, instance and credentials.

Add a new test checking listers classes have mandatory parameters declared
in their constructors. The purpose is to avoid deployment issues on staging
or production environment as celery tasks can fail to be executed if mandatory
parameters are not handled by listers.

Reated to swh/infra/sysadm-environment#5030.
2023-09-06 11:55:34 +02:00
Antoine R. Dumont (@ardumont)
a02fdbb4c8
lister.github.utils: Drop no longer used module
This got detected when working on the deployment of the new loader-git.

Refs. swh/infra/sysadm-environment#5017
2023-08-22 11:15:04 +02:00
Antoine Lambert
bcf30aba90 github: Fix fixtures use in tests
requests_ratelimited fixture from swh-core was renamed to
github_requests_ratelimited.

remaining_requests parameter was added to the github_response_callback
function from swh-core, making it no longer compatible with requests_mock
callback for json responses.
2023-01-02 18:06:26 +01:00
Antoine Lambert
e218fbfef6 github: Fix test error with latest pytest release 2023-01-02 14:10:03 +01:00
Nicolas Dandrimont
e785e67315 Hook up recently introduced options to all listers
Hopefully one day we'll be able to replace all of this mess with PEP692
TypedDict kwargs, but that's only on track for Python 3.12.
2022-12-05 16:33:45 +01:00
Antoine R. Dumont (@ardumont)
92d494261f
lister: Make sure lister that requires github tokens can use it
Deploying the nixguix lister, I realized that even though the credentials configuration
is properly set for all listers, the listers actually requiring github origin
canonicalization do not have access to the github credentials. It's lost during the
constructor to only focus on the lister's credentials. Which currently translates to
listers being rate-limited.

This commit fixes it by pushing the self.github_session instantiation in the constructor
when the lister explicitely requires the github session. Hence lifting the rate limit
for maven, packagist, nixguix, and github listers.

Related to infra/sysadm-environment#4655
2022-10-26 17:23:40 +02:00
Antoine Lambert
d5c30a3ce3 Update value of User-Agent HTTP request header used by listers
That HTTP header value will now contain the lister name but also a link
to our contact form in order for sysadmins to easily reach us if needed.

The following template is used to generate it:

"Software Heritage <lister_name> lister v<swh-lister version>
 (+https://www.softwareheritage.org/contact)"
2022-09-26 10:48:40 +02:00
Antoine R. Dumont (@ardumont)
2ffe9c2aea
Use swh.core.github.pytest_plugin in github tests
Related to T4232
2022-05-20 16:06:11 +02:00
Valentin Lorentz
d715aaf903 Make user_agent a parameter of GitHubSession
So it can be set when used by other packages
2022-04-26 11:08:53 +02:00
Valentin Lorentz
2d04244cc9 Move GitHubSession from github/lister.py to github/utils.py
So it can be reused by other packages without importing lister.py itself
2022-04-26 11:08:49 +02:00
Valentin Lorentz
9ee4a99f15 github: Refactor rate-limiting out of the GitHubLister class
This will allow the GitHub Metadata Fetcher to reuse the logic
by importing the GitHubSession class.
2022-04-26 11:08:45 +02:00
Valentin Lorentz
d0924f39d0 github: Remove dead code
Authentication is handled directly in the session
2022-04-21 20:32:45 +02:00
Antoine Lambert
d38e05cff7 python: Reformat code with black 22.3.0
Related to T3922
2022-04-08 15:15:09 +02:00
Nicolas Dandrimont
5f567b3c34 Deduplicate origins in the GitHub lister
In some circumstances, GitHub will return two separate repos with the
same html_url in the same page. This makes the lister fail with a
cardinality error.
2021-12-01 16:00:14 +01:00
Nicolas Dandrimont
879170a57d GitHub: handle edge cases with empty responses 2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
c375a61b16 GitHub: handle Server Errors
These errors happen, sometimes, when requesting large pages of results.
2021-03-19 16:53:52 +01:00
Nicolas Dandrimont
4a215e68e0 GitHub: Move rate-limit reset logic to RateLimited exception
This makes the logic easier to test.
2021-03-19 16:52:46 +01:00
Nicolas Dandrimont
cfd4169bd8 Retry GitHub requests on ChunkEncodingErrors
These happen, sometimes, when the connection to the GitHub server
resets, e.g. because of congestion on a slow link.
2021-03-19 16:52:46 +01:00
Nicolas Dandrimont
61c1d444c5 GitHub: Move rate limit handling to the request function 2021-03-19 15:58:01 +01:00
Nicolas Dandrimont
03b10e5c83 GitHub: Start moving the request logic to a separate function 2021-03-19 15:58:01 +01:00
Nicolas Dandrimont
8f7dbb7488 GitHub: Use function for requests.Session initialization
This will help us to break the retry logic for the listing requests
themselves to a separate function too.
2021-03-19 15:58:01 +01:00
Antoine Lambert
4245c5046f Remove no longer used models field in dict returned by register 2021-02-02 16:33:52 +01:00
tenma
6cd31769c1 tests: Remove no longer used conftest files
All the fixtures declared in them are not used anymore in the
tests of the listers ported to the new Lister API.
2021-01-26 17:09:04 +01:00
Antoine Lambert
ea8ecee541 tests: Fix errors after swh-scheduler API update
The PaginatedListedOriginList model has been updated in
rDSCHb93aa5be2c2d5dc2130e1027698f3e1255052d8d and the origins
field has been renamed to results.
2021-01-25 17:11:54 +01:00
Nicolas Dandrimont
b63aa83b41 Reimplement the GitHub lister using the new pattern class
This replaces the test data with some manually generated answers, which allows
us to test a few more cases for instantiating the lister.

This also expands test coverage to test behavior on rate-limited requests.
2021-01-11 11:00:29 +01:00
Antoine R. Dumont (@ardumont)
978fbbe029
lister.github.tests: Clarify lister configuration 2020-10-30 13:30:15 +01:00
Antoine Lambert
22f7181294 python: Reorder imports with isort
Related to T2610
2020-09-17 17:48:27 +02:00
Antoine R. Dumont (@ardumont)
5a5b7ef70b
tests: Separate lister instantiations
Prior to this commit, all listers were instantiated at the same time even if
only one was needed. This commit separates those instantiations.

The only drawback to this is the db model initialization which now happens at
each lister instantiation. This can be dealt with if needed at another time
though.
2020-09-02 12:49:00 +02:00
Antoine R. Dumont (@ardumont)
9437a643ad
pytest: Define plugin and declare it in the root conftest
Then drop all unneeded and indirect imports
2020-09-02 12:25:15 +02:00
Nicolas Dandrimont
c9963d4302 Use the new names for the swh.scheduler test fixtures 2020-07-09 17:06:50 +02:00
David Douard
93a4d8b784 Enable black
- blackify all the python files,
- enable black in pre-commit,
- add a black tox environment.
2020-04-08 16:31:22 +02:00
Gautier Pugnonblanc Yann
60adc424be add anotation type in some lister file 2020-02-17 15:58:34 +01:00
Antoine R. Dumont (@ardumont)
ed73cea771
github.lister: Filter out partial repositories which break listing
This commit fixes the repository mapping to model. It broke when the listed
repository was either None or missing the id field [1]

[1] https://sentry.softwareheritage.org/share/issue/532d682182fc43d6a7a99400e3928811/
2020-01-20 10:25:57 +01:00
Antoine R. Dumont (@ardumont)
4b383abc56
github.lister: Use Retry-After header when rate limit reached
Following the github's documentation [1]

[1] https://developer.github.com/v3/guides/best-practices-for-integrators/#dealing-with-abuse-rate-limits

Related to T2170
2020-01-17 10:37:53 +01:00
Antoine R. Dumont (@ardumont)
5ab9d67d67
core: Align listers' task output (hg/git tasks) with expected format
Related to T2134
Related to D2409
Related to D2410
2019-12-09 15:12:17 +01:00
Antoine R. Dumont (@ardumont)
4a9608f31c
lister/tasks: Standardize return statements
The following commit adapts the return statements from both lister and their
associated tasks. This standardizes on what other modules (e.g. both dvcs and
package loaders) do.
2019-12-02 15:49:38 +01:00
Nicolas Dandrimont
ff7fdf24db Use a uniform User-Agent on all listers
This also adds tests to make sure that we properly send our version number to
upstreams.
2019-11-22 15:49:23 +01:00
Stefano Zacchiroli
974f80f966 typing: minimal changes to make a no-op mypy run pass 2019-10-28 15:35:21 +01:00
Nicolas Dandrimont
78105940ff Stop binding tasks to a specific instance of the celery app
The celery.shared_task decorator allows late-binding of tasks to any celery app,
which is well suited for our "task plugin" architecture.
2019-10-18 18:02:25 +02:00
Antoine R. Dumont (@ardumont)
a8cde12d72
tests: Update pytest_plugin according to latest version change 2019-10-14 18:20:15 +02:00
Antoine R. Dumont (@ardumont)
0b8b1419e1
github.lister: Add integration test which checks scheduled tasks
Related T2032
2019-10-12 03:28:39 +02:00
Antoine Lambert
04d8fdf8df github/lister: Prevent erroneous scheduler tasks disabling
Closes T2014
2019-09-19 14:30:30 +02:00
Antoine Lambert
7572228f7c listers: Ensure run can be called without bounds arguments
Closes T2001
2019-09-17 15:09:04 +02:00
David Douard
b810876ef8 tasks: normalize the url argument name of most lister
Since all the listing tasks accepts an url as first argument (whatever the
argument name is), it makes sense to use a simple common argument name for
this. I've chosen 'url' instead of api_baseurl/forge_url/url.

Also kill now useless `new_lister()` functions.
2019-09-04 15:38:01 +02:00
David Douard
8d9deeb8f8 plugins: add support for scheduler's task-type declaration
Add a new register-task-types cli that will create missing task-type entries in the
scheduler according to:

- only create missing task-types (do not update them), but check that the
  backend_name field is consistent,
- each SWHTask-based task declared in a module listed in the 'task_modules'
  plugin registry field will be checked and added if needed; tasks which name
  start wit an underscore will not be added,
- added task-type will have:
  - the 'type' field is derived from the task's function name (with underscores
    replaced with dashes),
  - the description field is the first line of that function's docstring,
  - default values as provided by the swh.lister.cli.DEFAULT_TASK_TYPE (with
    a simple pattern matching to have decent default values for full/incremental
    tasks),
  - these default values can be overloaded via the 'task_type' plugin registry
    entry.

For this, we had to rename all tasks names (eg. `cran_lister` -> `list_cran`).

Comes with some tests.
2019-09-04 15:36:08 +02:00
David Douard
e3c0ea9d90 implement listers as plugins
Listers are declared as plugins via the `swh.workers` entry_point.

As such, the registry function is expected to return a dict with the
`task_modules` field (as for generic worker plugins), plus:

- `lister`: the lister class,
- `models`: list of SQLAlchemy models used by this lister,
- `init` (optionnal): hook (callable) used to initialize the lister's state
  (typically, create/initialize the database for this lister).
  If not set, the default implementation creates database tables (after
  optionally having deleted exisintg ones) according to models declared in
  the `models` register field.

There is no need for explicitely add lister task modules in the main
`conftest` module, but any new/extra lister to be tested must be registered
(the tested lister module must be properly installed in the test environment).

Also refactor a bit the cli tools:
- add support for the standard --config-file option at the 'lister' group
  level,
- move the --db-url to the 'lister' group,
- drop the --lister option for the `swh lister db-init` cli tool:
  initializing (especially with --drop-tables) the database for a single
  lister is unreliable, since all tables are created using a sibgle MetaData
  (in the same namespace).
2019-09-03 15:02:24 +02:00
David Douard
b87cd5d309 github: make GitHubLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00