Commit graph

596 commits

Author SHA1 Message Date
David Douard
342964eda7 phabricator: fix the FullPhabricatorLister task
forgot the forge_url -> api_baseurl renaming in there.
2019-09-03 12:01:55 +02:00
David Douard
8785fc1a4e cgit: fix cgit's task module and tests
forgot some `url_prefix` there.
2019-09-03 12:01:55 +02:00
David Douard
87cec2f5c3 phabricator: refactor PhabricatorLister's constructor
- use the 'standard' api_baseurl as init argument,
- make it optional, with default to forge.softwareheritage.org,
- use origin_url as id.
2019-09-02 12:29:38 +02:00
David Douard
befe9a6d57 gitlab: make GitLabLister's api_baseurl init argument optional
and simplify a bit the code of the constructor.
2019-09-02 12:29:38 +02:00
David Douard
b87cd5d309 github: make GitHubLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00
David Douard
8950b0b32d bitbucket: make BitBucketLister's api_baseurl init argument optional 2019-09-02 12:29:38 +02:00
David Douard
22f2f2c43c core: make it possible to specify the api_baseurl init argument in override_config
This is required to be able to make lister classes instanciation easier and more
reliable, especially in the context of cli tools like 'swh lister run', for which
we want to be able to specify any lister init argument as extra parameter of the
command.
2019-09-02 12:29:38 +02:00
David Douard
3816b4d3bf cgit: rewrite the CGit lister
Simplify the code:
- do only inherit from ListerBase
- implement HTTP queries directly using requests
- get rid of convoluted code

Make the origin_url gathered from the git repo's "project" page instead of
building it from the 'url_prefix' hack. Now, the lister WILL make substancially
more requests, since it will make one request per listed git repo, but
the provided origin_url should be pretty reliable now.

When several url are provided as clonable URLs, choose the http/https one first,
otherwise, choose the first one of the list.

Add proper tests for the cgit lister.

Also, get rid of the 'time_updated' column in the model.
2019-09-02 12:29:31 +02:00
David Douard
e0ce68377d bitbucket: simplify a bit BitBucketLister's constructor
get rid of the "smart" flush_packet_db computation.
2019-08-30 17:56:19 +02:00
David Douard
d807d15f65 phabricator: randomly select the API token in the provided list
instead of picking the first one, so this behavior is consistent with
ListerHttpTransport's one.
2019-08-30 17:56:19 +02:00
David Douard
814779404c phabricator: small refactoring/simplification of the request_params method
and get rid of the unneeded _build_query_params method.
2019-08-30 17:56:19 +02:00
David Douard
83d138759c phabricator: kill PhabricatorLister's api_token argument
stick to the existing credentials mechanism provided by ListerHttpTransport.
2019-08-30 17:56:19 +02:00
David Douard
6f56d2c8d7 core: move credentials' docstring from request_params to request_instance_credentials
and fix empty values returned by this later (empty list instead of ampty dict).
2019-08-30 17:56:19 +02:00
Antoine R. Dumont (@ardumont)
4b2ab0488a
cli: Unify new_lister method name to get_lister 2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)
dee9fe93bf
cli: Bootstrap tests on cli 2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)
e0664c10cd
lister.cli: Allow to list forges with policy and priority
Example use case:

swh lister run --lister gitlab \
               --priority high \
               --policy oneshot \
               --db-url postgresql://postgres@localhost:5432/swh-listers \
               api_baseurl=https://gitlab.ow2.org/api/v4/

Related T1919
2019-08-28 16:29:26 +02:00
Antoine R. Dumont (@ardumont)
87d2a16df0
listers: Allow to override policy and priority for scheduled tasks
Prior to this commit, the policy and priority were hard-coded.
The default values are now the old hard-coded values.

This will allow to develop a cli to trigger forges listing with oneshot policy
and some priority tasks. Thus ingesting those faster and without manual
interventation as we currently do.
2019-08-28 11:57:10 +02:00
Archit Agrawal
5727f15cf3 swh.lister.packagist
Implement a packagist lister to list the
names and metadata url of all the
packages.

Closes 1776
2019-07-19 19:59:30 +05:30
Archit Agrawal
08ade29e6d swh.lister.pypi: Add tests
Add tests for pypi lister
Closes T1890
2019-07-18 17:13:13 +05:30
Archit Agrawal
f424f07c7e swh.lister.core: Add test for simple lister
There were previously no tests for the listers
which are using the class SimpleLister(like pypi)
Refractored test_lister.py of lister core to
accomodate tests for SimpleLister keeping the tests
undisturbed for other lister.
2019-07-18 17:13:13 +05:30
Stefano Zacchiroli
bb2dc77788 bitbucket lister: fix typo in docstring 2019-07-04 14:40:02 +02:00
Archit Agrawal
0bf24469b7 swh.lister.cgit: Remove repo page visit step
Remove the need to visit every page and extract the
origin url by introducing a parameter url_prefix.
The origin url is in format <prefix>/<repo_name> where
The prefix is same for all the repos for a particular
cgit instance.
2019-06-28 20:02:07 +05:30
Archit Agrawal
7e3c79bb1d swh.lister.cgit: Add pagination support
Some cgit instance have a pagination. Modifiy
lister to find all the pages and list all the repos
from all the pages.
2019-06-28 19:27:25 +05:30
Archit Agrawal
b972a2a88d swh.lister.cgit
Implemented a lister to list the repos for a given CGit instance.

Closes T1659
2019-06-28 19:27:25 +05:30
Antoine Lambert
d85bcdac5b simple_lister: Split models into smaller chunks to avoid oversized db transactions
Related T1659
2019-06-28 15:44:47 +02:00
Archit Agrawal
5ea9d5ed39 swh.lister.cran: Add description in task_dict
Add description in task_dict method because
the only metadata that can be found for a
package at CRAN is  its decsription.  That can
only br achived from the build in API in R,
which ister is already using. Hence instead of
getting metadata in loader, it is passed
by lister.
2019-06-27 14:57:51 +05:30
Valentin Lorentz
52b1de87c5 Finish dropping the 'description' column.
I missed some in aef7d5952e.
2019-06-26 14:46:27 +02:00
Antoine R. Dumont (@ardumont)
e54531510c
indexing_lister: Add docstrings to flush_packet_db & default_min_bound
Related D1635
2019-06-26 11:27:41 +02:00
Antoine R. Dumont (@ardumont)
3d473c307c
lister: Type correctly the 'indexable' column
instead of converting that column as a string

As a side effect, bitbucket wise, we provided improperly the after query
parameter as a date not url encoded. This resulted in improper api response from
bitbucket's (we received from time to time the same next index as the current
one).

Related T1826
2019-06-26 10:58:54 +02:00
Antoine R. Dumont (@ardumont)
b99617f976
relister: Fix consistently the behavior for the first time relisting
If nothing has been done prior to a full relisting, there is actually nothing
to list. So the relister in question does nothing.

In that context, the IndexingLister class's `db_partition_indices` method now
returns an empty list instead of raising a ValueError when there is nothing to
list.

Related T1826
Related e129e48
2019-06-25 14:48:17 +02:00
Antoine R. Dumont (@ardumont)
6662ae8db5
indexing_lister: Allow to define flush packet size
Prior to this commit, indexing lister instances were flushing every packet of
20. This can now be defined per sub classes.
2019-06-25 14:48:16 +02:00
Antoine R. Dumont (@ardumont)
5ec3067b0d
Clean up code
- Remove unneeded return instructions
- Clarify tests code regarding request_index computations
2019-06-25 14:48:13 +02:00
Antoine R. Dumont (@ardumont)
45428c25df
bitbucket: Unify logging instructions 2019-06-25 14:09:59 +02:00
Antoine R. Dumont (@ardumont)
9aa8a6f7ae
bitbucket: Allow to specify the number of repos per api request
This is independent but still, it somehow fixes the issue occurring on T1826.

Related T1826
2019-06-21 17:50:23 +02:00
Antoine R. Dumont (@ardumont)
e129e48c31
bitbucket: Fix full lister with fallback [start, end] if not provided
Related T1826
2019-06-21 15:46:51 +02:00
Antoine R. Dumont (@ardumont)
b3463ecddc
Drop SWH prefix in classes everywhere
It's redundant with the swh modules in itself.
2019-06-20 19:08:46 +02:00
Archit Agrawal
f76b96b825 swh.lister.gnu: Change origin type to tar
Change origin type from 'gnu' to 'tar'
2019-06-19 17:21:02 +05:30
Valentin Lorentz
aef7d5952e Remove columns 'description' and 'origin_id'.
They are useless.
2019-06-19 10:29:15 +02:00
Antoine R. Dumont (@ardumont)
e13912f711
phabricator_tests: Add missing headers 2019-06-18 14:41:50 +02:00
Antoine R. Dumont (@ardumont)
df2754e5a6
phabricator.tasks: Remove unused code
Related T1824
Related P438
2019-06-18 14:41:20 +02:00
Antoine R. Dumont (@ardumont)
af681ac128
phabricator: model: Reference the forge's instance name in model
As phabricator is an "instance" lister (there exists multiple instances of
phabricator in the wild), we need to reference that information.

In effect, this aligns phabricator lister with for example the gitlab one.

Related T1801
Related P434
2019-06-18 07:19:14 +02:00
Antoine R. Dumont (@ardumont)
fc92c79b7e
models: Unify tablenames using singular as main archive's convention
Related P434
2019-06-18 07:18:34 +02:00
Antoine R. Dumont (@ardumont)
6d11705908
phabricator.lister: Use credentials setup from configuration file
Prior to this commit, this expected the api.token to be provided at task
initialization. That behavior has been kept for cli purposes. It's no good for
production purposes though (as this leaks the credentials in the scheduler db).

So now, the credentials is fetched from the lister's configuration file as the
other listers do.

Another change is the authentication mechanism which is slighly different. It's
not using a basic `auth` mechanism. It's expecting an `api.token` query
parameter so the `request_params` is overriden to provide that.

Related T1809
2019-06-17 16:19:23 +02:00
Antoine R. Dumont (@ardumont)
ecdce4b0cc
gitlab.lister: Remove request_params method override
This should have been removed along with the code in b816212.

The request authentication has been reworked so that all listers use the same
credentials dict.

Related b816212
Related T1772
2019-06-14 18:26:36 +02:00
Antoine R. Dumont (@ardumont)
72e208aeed
lister_base: Clarify docstrings 2019-06-13 15:42:07 +02:00
Antoine R. Dumont (@ardumont)
64a9bc691d
lister.core: Stop creating origins when scheduling tasks
Prior to this commit, lister did create origins as well in the archive. Now, we
only schedule new origins for ingestion.
2019-06-13 15:42:07 +02:00
Archit Agrawal
a9a37a85bf swh.lister.cran
Add a lister to list all the CRAN packages .
It uses the build-in API in R language to list the packages
and get their metadata. 

Closes T1709
2019-06-11 21:26:31 +05:30
Archit Agrawal
7c6245e663 swh.lister.gnu: Add function to check for file extension.
Added a function which will derive the extension from filename
and check if the fie extension match the type of file that is to be
archived.
2019-06-11 15:12:53 +05:30
Archit Agrawal
709ba8a6e5 swh.lister.gnu: Add functionality to list all the tarballs for a package.
As discussed in T1389 to ingest all packages using base loader, it need
a list of all the tarballs for a pakage.
Hence modifified lister to recursively list all the tarballs for a
package with their last updated time.
2019-06-08 21:56:00 +05:30
Archit Agrawal
ebdb959823 swh.lister.gnu : Change download method of tree.json file to request
Previously gnu lister was using same code as that of tarball loader
to download, unzip and read tree.json file.
To make the code consise the downloading method is changed to
requests library.
2019-06-08 21:56:00 +05:30