Commit graph

931 commits

Author SHA1 Message Date
Antoine R. Dumont (@ardumont)
3d473c307c
lister: Type correctly the 'indexable' column
instead of converting that column as a string

As a side effect, bitbucket wise, we provided improperly the after query
parameter as a date not url encoded. This resulted in improper api response from
bitbucket's (we received from time to time the same next index as the current
one).

Related T1826
2019-06-26 10:58:54 +02:00
Antoine R. Dumont (@ardumont)
b99617f976
relister: Fix consistently the behavior for the first time relisting
If nothing has been done prior to a full relisting, there is actually nothing
to list. So the relister in question does nothing.

In that context, the IndexingLister class's `db_partition_indices` method now
returns an empty list instead of raising a ValueError when there is nothing to
list.

Related T1826
Related e129e48
2019-06-25 14:48:17 +02:00
Antoine R. Dumont (@ardumont)
6662ae8db5
indexing_lister: Allow to define flush packet size
Prior to this commit, indexing lister instances were flushing every packet of
20. This can now be defined per sub classes.
2019-06-25 14:48:16 +02:00
Antoine R. Dumont (@ardumont)
5ec3067b0d
Clean up code
- Remove unneeded return instructions
- Clarify tests code regarding request_index computations
2019-06-25 14:48:13 +02:00
Antoine R. Dumont (@ardumont)
45428c25df
bitbucket: Unify logging instructions 2019-06-25 14:09:59 +02:00
Antoine R. Dumont (@ardumont)
9aa8a6f7ae
bitbucket: Allow to specify the number of repos per api request
This is independent but still, it somehow fixes the issue occurring on T1826.

Related T1826
2019-06-21 17:50:23 +02:00
Antoine R. Dumont (@ardumont)
e129e48c31
bitbucket: Fix full lister with fallback [start, end] if not provided
Related T1826
2019-06-21 15:46:51 +02:00
Antoine R. Dumont (@ardumont)
b3463ecddc
Drop SWH prefix in classes everywhere
It's redundant with the swh modules in itself.
2019-06-20 19:08:46 +02:00
Archit Agrawal
8d1b5d2d2d Add new page "run_a new_lister"
Add new page in lister tutorial which guides
through the process of running a new lister in
docker.
2019-06-20 14:01:13 +05:30
Archit Agrawal
b2c6ddc35b tutorial: Add testing and how to run section
Add testing section for lister. Also added how to run a
new lister section which elaborates about the steps
required to run the now lister in docker
2019-06-20 14:01:13 +05:30
Archit Agrawal
f76b96b825 swh.lister.gnu: Change origin type to tar
Change origin type from 'gnu' to 'tar'
2019-06-19 17:21:02 +05:30
Valentin Lorentz
aef7d5952e Remove columns 'description' and 'origin_id'.
They are useless.
2019-06-19 10:29:15 +02:00
Nancy-Chauhan
83b3a75f11 README: Add missing triple back quote 2019-06-19 11:38:19 +05:30
Antoine R. Dumont (@ardumont)
e13912f711
phabricator_tests: Add missing headers 2019-06-18 14:41:50 +02:00
Antoine R. Dumont (@ardumont)
df2754e5a6
phabricator.tasks: Remove unused code
Related T1824
Related P438
2019-06-18 14:41:20 +02:00
Antoine R. Dumont (@ardumont)
af681ac128
phabricator: model: Reference the forge's instance name in model
As phabricator is an "instance" lister (there exists multiple instances of
phabricator in the wild), we need to reference that information.

In effect, this aligns phabricator lister with for example the gitlab one.

Related T1801
Related P434
2019-06-18 07:19:14 +02:00
Antoine R. Dumont (@ardumont)
fc92c79b7e
models: Unify tablenames using singular as main archive's convention
Related P434
2019-06-18 07:18:34 +02:00
Antoine R. Dumont (@ardumont)
6d11705908
phabricator.lister: Use credentials setup from configuration file
Prior to this commit, this expected the api.token to be provided at task
initialization. That behavior has been kept for cli purposes. It's no good for
production purposes though (as this leaks the credentials in the scheduler db).

So now, the credentials is fetched from the lister's configuration file as the
other listers do.

Another change is the authentication mechanism which is slighly different. It's
not using a basic `auth` mechanism. It's expecting an `api.token` query
parameter so the `request_params` is overriden to provide that.

Related T1809
2019-06-17 16:19:23 +02:00
Antoine R. Dumont (@ardumont)
ecdce4b0cc
gitlab.lister: Remove request_params method override
This should have been removed along with the code in b816212.

The request authentication has been reworked so that all listers use the same
credentials dict.

Related b816212
Related T1772
2019-06-14 18:26:36 +02:00
Antoine R. Dumont (@ardumont)
72e208aeed
lister_base: Clarify docstrings 2019-06-13 15:42:07 +02:00
Antoine R. Dumont (@ardumont)
64a9bc691d
lister.core: Stop creating origins when scheduling tasks
Prior to this commit, lister did create origins as well in the archive. Now, we
only schedule new origins for ingestion.
2019-06-13 15:42:07 +02:00
Archit Agrawal
a9a37a85bf swh.lister.cran
Add a lister to list all the CRAN packages .
It uses the build-in API in R language to list the packages
and get their metadata. 

Closes T1709
2019-06-11 21:26:31 +05:30
Archit Agrawal
7c6245e663 swh.lister.gnu: Add function to check for file extension.
Added a function which will derive the extension from filename
and check if the fie extension match the type of file that is to be
archived.
2019-06-11 15:12:53 +05:30
Archit Agrawal
709ba8a6e5 swh.lister.gnu: Add functionality to list all the tarballs for a package.
As discussed in T1389 to ingest all packages using base loader, it need
a list of all the tarballs for a pakage.
Hence modifified lister to recursively list all the tarballs for a
package with their last updated time.
2019-06-08 21:56:00 +05:30
Archit Agrawal
ebdb959823 swh.lister.gnu : Change download method of tree.json file to request
Previously gnu lister was using same code as that of tarball loader
to download, unzip and read tree.json file.
To make the code consise the downloading method is changed to
requests library.
2019-06-08 21:56:00 +05:30
Archit Agrawal
151f6cd223 swh.lister.gnu
Implement first pass of gnu lister to list all the
packages present in https://ftp.gnu.org/
Add GNU lister in README and cli.py

Closes T1722
2019-06-08 21:56:00 +05:30
Archit Agrawal
f8a2ae866b swh.lister.core: Remove abstractmethod
Some of the new listers like GNU and CRAN do not
follow the conventional way of making an HTTP
request, hence they do not need some of the methods
which are usually needed by in conventional HTTP
request.

But those method are marked abstractmethod in the
core making them necessary to be present. So  it is in
best to remove abstractmethod to increase the
readability of those listers.
2019-06-08 16:51:57 +05:30
Antoine R. Dumont (@ardumont)
b81621274b
lister: Unify credentials structure between listers
This becomes a dictionary of key <lister-name>, value a dict of key
<instance-name>, value list of dict username/password.

Related T1772
2019-05-29 14:00:11 +02:00
David Douard
d6169c7141 cli: register the 'lister' cli subcommand
also add a cli group named 'lister' for the sake of consistency with
other swh packages and rename the command as 'db-init', like:

  swh lister db-init LISTER [...]
2019-05-22 13:36:57 +02:00
Antoine Lambert
701d833cdf Update scheduler task names to new ones
Related T1508
2019-05-21 13:27:29 +02:00
David Douard
6bf226d85d tox: workaround to pip's inability to properly solve dependency resolution
This is (hopefully) a temporary fix that can be removed as soon as
  https://github.com/pypa/pip/issues/6239
is fixed, probably thanks to
  https://github.com/pypa/pip/issues/988
2019-05-20 17:12:31 +02:00
archit
fedfd73c8e swh.lister.phabricator
Add a lister of all hosted repositories on a Phabricator instance
Closes T808
2019-05-15 19:54:33 +05:30
Antoine Lambert
4efb2ce62b core.lister_base: Ensure deterministic _task_key return value 2019-05-15 15:37:19 +02:00
Antoine Lambert
7d192a2f1b README.md: Fix outdated instructions and improve formatting 2019-05-14 10:57:17 +02:00
Antoine Lambert
977d2459c3 Remove references to no more used lister_db_url conf entry 2019-05-13 15:18:56 +02:00
Nicolas Dandrimont
6269975d55 Update coverage gitignore 2019-04-12 12:03:08 +02:00
David Douard
e5c3559033 tasks: fix handling of unsupported promise.save() calls
the exception can also be an AttributeError.

Also do not reraise this exception (in github/tasks.py). This promise
saving feature is used for tests.
2019-04-11 11:03:48 +02:00
Antoine Lambert
0b8d1d464d npm.lister: Update loading task name
Related T1508
2019-04-10 18:24:35 +02:00
Antoine Lambert
dac7777cd6 listers: Align config filename with production 2019-04-10 18:22:45 +02:00
Antoine Lambert
8ffd8dadef cli: Fix initialization for all listers
Prior to this commit, initializing all listers was failing after the
debian lister processing because of global insert_minimum_data init

Related T1629
2019-04-10 11:33:55 +02:00
Archit Agrawal
4b27f9d9c4 Updated toplevel function names in README 2019-03-24 10:54:12 +05:30
Archit Agrawal
26232db926 Removed Extra blank space 2019-03-20 00:54:42 +05:30
Archit Agrawal
be804be0fc Removed unnecessary files 2019-03-20 00:45:30 +05:30
Archit Agrawal
fa91132364 Removed extra space from the README 2019-03-20 00:42:18 +05:30
Archit Agrawal
d7ae2f1305 Updated README for listers 2019-03-19 23:11:44 +05:30
Nicolas Dandrimont
dd148fc64d Guesstimate partition boundaries from extrema rather than using expensive offsets
Summary:
Using order by and offset makes the partitioning a n^2 operation on the number
of entries in the table, rather than an instant operation when using
min/max.

This assumes the indexable column is more or less uniform, which is not exactly
true but not the worst approximation either.

Test Plan: tox

Reviewers: #reviewers, douardda

Reviewed By: #reviewers, douardda

Subscribers: douardda, swh-public-ci

Differential Revision: https://forge.softwareheritage.org/D1267
2019-03-19 13:51:42 +01:00
Nicolas Dandrimont
c574897e2a Guesstimate partition boundaries from extrema rather than using expensive offsets
Summary:
Using order by and offset makes the partitioning a n^2 operation on the number
of entries in the table, rather than an instant operation when using
min/max.

This assumes the indexable column is more or less uniform, which is not exactly
true but not the worst approximation either.

Test Plan: tox

Reviewers: #reviewers, douardda

Reviewed By: #reviewers, douardda

Subscribers: douardda, swh-public-ci

Differential Revision: https://forge.softwareheritage.org/D1267
2019-03-19 13:51:38 +01:00
Archit Agrawal
5acb1fefc1 Updated README.md for listers 2019-03-19 15:57:29 +05:30
Antoine R. Dumont (@ardumont)
2a588d2d5a
d/*: debian packaging files migrated to separated branches 2019-02-14 10:47:40 +01:00
Antoine R. Dumont (@ardumont)
262c297a5e
lister.cli: Fix spelling typo 2019-02-14 10:47:28 +01:00