swh-lister/docs/tutorial.rst

.. _lister-tutorial:

Tutorial: list the content of your favorite forge in just a few steps
=====================================================================

(the `original version
<https://www.softwareheritage.org/2017/03/24/list-the-content-of-your-favorite-forge-in-just-a-few-steps/>`_
of this article appeared on the Software Heritage blog)

Back in November 2016, Nicolas Dandrimont wrote about structural code changes
`leading to a massive (+15 million!) upswing in the number of repositories
archived by Software Heritage
<https://www.softwareheritage.org/2016/11/09/listing-47-million-repositories-refactoring-our-github-lister/>`_
through a combination of automatic linkage between the listing and loading
scheduler, new understanding of how to deal with extremely large repository
hosts like `GitHub <https://github.com/>`_, and activating a new set of
repositories that had previously been skipped over.

In the post, Nicolas outlined the three major phases of work in Software
Heritage's preservation process (listing, scheduling updates, loading) and
highlighted that the ability to preserve the world's free software heritage
depends on our ability to find and list the repositories.

At the time, Software Heritage was only able to list projects on
GitHub. Focusing early on GitHub, one of the largest and most active forge in
the world, allowed for a big value-to-effort ratio and a rapid launch for the
archive. As the old Italian proverb goes, "Il meglio è nemico del bene," or in
modern English parlance, "Perfect is the enemy of good," right? Right. So the
plan from the beginning was to implement a lister for GitHub, then maybe
implement another one, and then take a few giant steps backward and squint our
eyes.

Why? Because source code hosting services don't behave according to a unified
standard. Each new service requires dedicated development time to implement a
new scraping client for the non-transferable requirements and intricacies of
that service's API. At the time, doing it in an extensible and adaptable way
required a level of exposure to the myriad differences between these services
that we just didn't think we had yet.

Nicolas' post closed by saying "We haven't carved out a stable API yet that
allows you to just fill in the blanks, as we only have the GitHub lister
currently, and a proven API will emerge organically only once we have some
diversity."

That has since changed. As of March 6, 2017, the Software Heritage **lister
code has been aggressively restructured, abstracted, and commented** to make
creating new listers significantly easier. There may yet be a few kinks to iron
out, but **now making a new lister is practically like filling in the blanks**.

Fundamentally, a basic lister must follow these steps:

1. Issue a network request for a service endpoint.
2. Convert the response into a canonical format.
3. Populate a work queue for fetching and ingesting source repositories.

Steps 1 and 3 are generic problems, so they can get generic solutions hidden
away in base code, most of which never needs to change. That leaves us to
implement step 2, which can be trivially done now for services with clean web
APIs.

In the new code we've tried to hide away as much generic functionality as
possible, turning it into set-and-forget plumbing between a few simple
customized elements. Different hosting services might use different network
protocols, rate-limit messages, or pagination schemes, but, as long as there is
some way to get a list of the hosted repositories, we think that the new base
code will make getting those repositories much easier.

First let me give you the 30,000 foot view…

The old GitHub-specific lister code looked like this (265 lines of Python):

.. figure:: images/old_github_lister.png

By contrast, the new GitHub-specific code looks like this (34 lines of Python):

.. figure:: images/new_github_lister.png

And the new BitBucket-specific code is even shorter and looks like this (24 lines of Python):

.. figure:: images/new_bitbucket_lister.png

And now this is common shared code in a few abstract base classes, with some new features and loads of docstring comments (in red):

.. figure:: images/new_base.png

So how does the lister code work now, and **how might a contributing developer
go about making a new one**

The first thing to know is that we now have a generic lister base class and ORM
model. A subclass of the lister base should already be able to do almost
everything needed to complete a listing task for a single service
request/response cycle with the following implementation requirements:

1. A member variable must be declared called ``MODEL``, which is equal to a
   subclass (Note: type, not instance) of the base ORM model. The reasons for
   using a subclass is mostly just because different services use different
   incompatible primary identifiers for their repositories. The model
   subclasses are typically only one or two additional variable declarations.

2. A method called ``transport_request`` must be implemented, which takes the
   complete target identifier (e.g., a URL) and tries to request it one time
   using whatever transport protocol is required for interacting with the
   service. It should not attempt to retry on timeouts or do anything else with
   the response (that is already done for you). It should just either return
   the response or raise a ``FetchError`` exception.

3. A method called ``transport_response_to_string`` must be implemented, which
   takes the entire response of the request in (1) and converts it to a string
   for logging purposes.

4. A method called ``transport_quota_check`` must be implemented, which takes
   the entire response of the request in (1) and checks to see if the process
   has run afoul of any query quotas or rate limits. If the service says to
   wait before making more requests, the method should return ``True`` and also
   the number of seconds to wait, otherwise it returns ``False``.

5. A method called ``transport_response_simplified`` must be implemented, which
   also takes the entire response of the request in (1) and converts it to a
   Python list of dicts (one dict for each repository) with keys given
   according to the aforementioned ``MODEL`` class members.

Because 1, 2, 3, and 4 are basically dependent only on the chosen network
protocol, we also have an HTTP mix-in module, which supplements the lister base
and provides default implementations for those methods along with optional
request header injection using the Python Requests library. The
``transport_quota_check`` method as provided follows the IETF standard for
communicating rate limits with `HTTP code 429
<https://tools.ietf.org/html/rfc6585#section-4>`_ which some hosting services
have chosen not to follow, so it's possible that a specific lister will need to
override it.

On top of all of that, we also provide another layer over the base lister class
which adds support for sequentially looping over indices. What are indices?
Well, some services (`BitBucket <https://bitbucket.org/>`_ and GitHub for
example) don't send you the entire list of all of their repositories at once,
because that server response would be unwieldy. Instead they paginate their
results, and they also allow you to query their APIs like this:
``https://server_address.tld/query_type?start_listing_from_id=foo``. Changing
the value of 'foo' lets you fetch a set of repositories starting from there. We
call 'foo' an index, and we call a service that works this way an indexing
service. GitHub uses the repository unique identifier and BitBucket uses the
repository creation time, but a service can really use anything as long as the
values monotonically increase with new repositories. A good indexing service
also includes the URL of the next page with a later 'foo' in its responses. For
these indexing services we provide another intermediate lister called the
indexing lister. Instead of inheriting from :class:`SWHListerBase
<swh.lister.core.lister_base.SWHListerBase>`, the lister class would inherit
from :class:`SWHIndexingLister
<swh.lister.core.indexing_lister.SWHIndexingLister>`. Along with the
requirements of the lister base, the indexing lister base adds one extra
requirement:

1. A method called ``get_next_target_from_response`` must be defined, which
   takes a complete request response and returns the index ('foo' above) of the
   next page.

So those are all the basic requirements. There are, of course, a few other
little bits and pieces (covered for now in the code's docstring comments), but
for the most part that's it. It sounds like a lot of information to absorb and
implement, but remember that most of the implementation requirements mentioned
above are already provided for 99% of services by the HTTP mix-in module. It
looks much simpler when we look at the actual implementations of the two
new-style indexing listers we currently have…

This is the entire source code for the BitBucket repository lister::

    # Copyright (C) 2017 the Software Heritage developers
    # License: GNU General Public License version 3 or later
    # See top-level LICENSE file for more information

    from urllib import parse
    from swh.lister.bitbucket.models import BitBucketModel
    from swh.lister.core.indexing_lister import SWHIndexingHttpLister

    class BitBucketLister(SWHIndexingHttpLister):
        PATH_TEMPLATE = '/repositories?after=%s'
        MODEL = BitBucketModel

        def get_model_from_repo(self, repo):
            return {'uid': repo['uuid'],
                    'indexable': repo['created_on'],
                    'name': repo['name'],
                    'full_name': repo['full_name'],
                    'html_url': repo['links']['html']['href'],
                    'origin_url': repo['links']['clone'][0]['href'],
                    'origin_type': repo['scm'],
                    'description': repo['description']}

        def get_next_target_from_response(self, response):
            body = response.json()
            if 'next' in body:
                return parse.unquote(body['next'].split('after=')[1])
            else:
                return None

        def transport_response_simplified(self, response):
            repos = response.json()['values']
            return [self.get_model_from_repo(repo) for repo in repos]

And this is the entire source code for the GitHub repository lister::

    # Copyright (C) 2017 the Software Heritage developers
    # License: GNU General Public License version 3 or later
    # See top-level LICENSE file for more information

    import time
    from swh.lister.core.indexing_lister import SWHIndexingHttpLister
    from swh.lister.github.models import GitHubModel

    class GitHubLister(SWHIndexingHttpLister):
	PATH_TEMPLATE = '/repositories?since=%d'
	MODEL = GitHubModel

	def get_model_from_repo(self, repo):
	    return {'uid': repo['id'],
		    'indexable': repo['id'],
		    'name': repo['name'],
		    'full_name': repo['full_name'],
		    'html_url': repo['html_url'],
		    'origin_url': repo['html_url'],
		    'origin_type': 'git',
		    'description': repo['description']}

	def get_next_target_from_response(self, response):
	    if 'next' in response.links:
		next_url = response.links['next']['url']
		return int(next_url.split('since=')[1])
	    else:
		return None

	def transport_response_simplified(self, response):
	    repos = response.json()
	    return [self.get_model_from_repo(repo) for repo in repos]

	def request_headers(self):
	    return {'Accept': 'application/vnd.github.v3+json'}

	def transport_quota_check(self, response):
	    remain = int(response.headers['X-RateLimit-Remaining'])
	    if response.status_code == 403 and remain == 0:
		reset_at = int(response.headers['X-RateLimit-Reset'])
		delay = min(reset_at - time.time(), 3600)
		return True, delay
	    else:
		return False, 0

We can see that there are some common elements:

* Both use the HTTP transport mixin (:class:`SWHIndexingHttpLister
  <swh.lister.core.indexing_lister.SWHIndexingHttpLister>`) just combines
  :class:`SWHListerHttpTransport
  <swh.lister.core.lister_transports.SWHListerHttpTransport>` and
  :class:`SWHIndexingLister
  <swh.lister.core.indexing_lister.SWHIndexingLister>`) to get most of the
  network request functionality for free.

* Both also define ``MODEL`` and ``PATH_TEMPLATE`` variables. It should be
  clear to developers that ``PATH_TEMPLATE``, when combined with the base
  service URL (e.g., ``https://some_service.com``) and passed a value (the
  'foo' index described earlier) results in a complete identifier for making
  API requests to these services. It is required by our HTTP module.

* Both services respond using JSON, so both implementations of
  ``transport_response_simplified`` are similar and quite short.

We can also see that there are a few differences:

* GitHub sends the next URL as part of the response header, while BitBucket
  sends it in the response body.

* GitHub differentiates API versions with a request header (our HTTP transport
mix-in will automatically use any headers provided by an optional
request_headers method that we implement here), while BitBucket has it as part
of their base service URL.  BitBucket uses the IETF standard HTTP 429 response
code for their rate limit notifications (the HTTP transport mix-in
automatically handles that), while GitHub uses their own custom response
headers that need special treatment.

* But look at them! 58 lines of Python code, combined, to absorb all
  repositories from two of the largest and most influential source code hosting
  services.

Ok, so what is going on behind the scenes?

To trace the operation of the code, let's start with a sample instantiation and
progress from there to see which methods get called when. What follows will be
a series of extremely reductionist pseudocode methods. This is not what the
code actually looks like (it's not even real code), but it does have the same
basic flow. Bear with me while I try to lay out lister operation in a
quasi-linear way…::

    # main task

    ghl = GitHubLister(lister_name='github.com',
		       api_baseurl='https://github.com')
    ghl.run()

⇓ (SWHIndexingLister.run)::

    # SWHIndexingLister.run

    identifier = None
    do
	response, repos = SWHListerBase.ingest_data(identifier)
	identifier = GitHubLister.get_next_target_from_response(response)
    while(identifier)

⇓ (SWHListerBase.ingest_data)::

    # SWHListerBase.ingest_data

    response = SWHListerBase.safely_issue_request(identifier)
    repos = GitHubLister.transport_response_simplified(response)
    injected = SWHListerBase.inject_repo_data_into_db(repos)
    return response, injected

⇓ (SWHListerBase.safely_issue_request)::

    # SWHListerBase.safely_issue_request

    repeat:
	resp = SWHListerHttpTransport.transport_request(identifier)
	retry, delay = SWHListerHttpTransport.transport_quota_check(resp)
	if retry:
	    sleep(delay)
    until((not retry) or too_many_retries)
    return resp

⇓ (SWHListerHttpTransport.transport_request)::

    # SWHListerHttpTransport.transport_request

    path = SWHListerBase.api_baseurl
	 + SWHListerHttpTransport.PATH_TEMPLATE % identifier
    headers = SWHListerHttpTransport.request_headers()
    return http.get(path, headers)

(Oh look, there's our ``PATH_TEMPLATE``)

⇓ (SWHListerHttpTransport.request_headers)::

    # SWHListerHttpTransport.request_headers

    override → GitHubLister.request_headers

↑↑ (SWHListerBase.safely_issue_request)

⇓ (SWHListerHttpTransport.transport_quota_check)::

    # SWHListerHttpTransport.transport_quota_check

    override → GitHubLister.transport_quota_check

And then we're done. From start to finish, I hope this helps you understand how
the few customized pieces fit into the new shared plumbing.

Now you can go and write up a lister for a code hosting site we don't have yet!