Example walkthoughs

This walkthrough assumes you've read the Quickstart guide, and you understand how to create a new bot and validate its data.

Topics

  1. Scraping some real data
  2. Saving variables and data locally
  3. Scraping via a search form
  4. Structured bots (submitting to OpenCorporates)
  5. Incremental bots

1. Scraping some real data

The template bot generated by turbot bots:generate only iterates over a loop. This is easy to understand, but not very useful.

The first example, therefore, illustrates scraping a real web page (in this case, something hosted on our server).

Bots must conform to the Turbot Specification to be valid. This means they must output valid lines of JSON, and must be accompanied by a valid manifest.json.

To try this example, create a bot as usual, replacing <bot_name> with something unique:

turbot bots:generate --bot <bot_name> --language=ruby
turbot bots:generate --bot <bot_name> --language=python

Then cd <bot_name> and edit scraper.rbscraper.py:

from datetime import date
import json

from bs4 import BeautifulSoup
import requests
import turbotlib

source_url = "http://turbot.opencorporates.com/examples/basic_bot.html"
sample_date = str(date.today())
turbotlib.log("Starting scrape...") # optional debug logging
response = requests.get(source_url)
html = response.text
doc = BeautifulSoup(html)
table = doc.find('table', class_='setoutList')
for tr in table.find('tbody').find_all('tr'):
    # Each tr element has two td elements.
    tds = tr.find_all('td')
    record = {
        'company_name': tds[0].text,
        'group_name': tds[1].text,
        'sample_date': sample_date,   # mandatory field
        'source_url': source_url      # mandatory field
    }
    # The important part of the Turbot specification is that your scraper outputs lines of JSON
    print json.dumps(record)
require 'json'
require 'mechanize'
require 'turbotlib'

SOURCE_URL = "http://turbot.opencorporates.com/examples/basic_bot.html"
agent = Mechanize.new

Turbotlib.log("Starting scrape...") # optional debug logging

doc = agent.get(SOURCE_URL).parser

doc.css('.setoutList tbody tr').each do |row|
  cols = row.css('td').map {|r| r.text }
  datum = {
    company_name: cols[0],
    company_number: cols[1],
    source_url: SOURCE_URL,     # mandatory field
    sample_date: Time.now       # mandatory field
  }
  # The important part of the Turbot specification is that your scraper outputs lines of JSON
  puts JSON.dump(datum)
end

As usual, you can test the bot with ruby scraper.rbpython scraper.py.

Finally, you'll need to change the identifying_fields key in the manifest to ["company_name"]

Then you can run turbot bots:validate to verify that the data produced by the bot is valid.

2. Saving variables and data locally

Sometimes, you need to store data locally in order to complete a run.

You can save things in the location provided by Turbolib.data_dirturbolib.data_dir; anything here will be persisted between runs. For example, if you needed to write a CSV before parsing it, you might do this:

require 'turbotlib'

csv_file = open("#{Turbotlib.data_dir}/something.csv", "w")
csv_file.write("test,csv,line")
csv_file.close()
import turbotlib

csv_file = open("%s/something.csv" % turbotlib.data_dir(), "w")
csv_file.write("test,csv,line")
csv_file.close()

A common use case to storing state is to keep a record of a bot's scraping progress: for bots that take several hours to run, it is usual helpful to be able to resume in the case tney are interrupted.

In turbotlib, we provide some convenience functions for saving single variables:

require 'json'
require 'turbotlib'

# Start from 0, or the last point we got to
start = Turbotlib.get_var("progress") || 0

(start...5).each do |n|
  data = {
    company: "Foo Inc #{n}",
    licence_number: "foo#{n}",
    sample_date: Time.now,
    source_url: "http://somewhere.com/#{n}"
  }
  Turbotlib.log("progress: #{n}")
  puts JSON.dump(data)

  # Save current progress
  Turbotlib.save_var("progress", n)

  # Wait a couple of seconds (to give you time to interrupt and test)
  sleep 2
end

# Important! we need to reset the progress when we've finished,
# so the next time we start from the beginning again
Turbotlib.save_var("progress", 0)

import json
import datetime
import turbotlib
import time

# Start from 0, or the last point we got to
try:
    start = turbotlib.get_var("progress")
except KeyError:
    start = 0

for n in range(start,5):
    data = {"company": "Foo Inc %s" % n,
            "licence_number": "foo%s" % n,
            "sample_date": datetime.datetime.now().isoformat(),
            "source_url": "http://somewhere.com/%s" % n}
    turbotlib.log("progress: %s" % n)
    print json.dumps(data)

    # Save progress
    turbotlib.save_var("progress", n)

    # Wait a couple of seconds (to give you time to interrupt and test)
    time.sleep(2)

# Important! we need to reset the progress when we've finished,
# so the next time we start from the beginning again
turbotlib.save_var("progress", 0)

Try running this bot (i.e. with ruby scraper.rbpython scraper.py), and then interrupting it with Ctrl+c. Then run it again. It should restart where it left off.

You can also store data in a local sqlite store. We provide some convenience methods for making working with SQLite data easier.

Access to a SQL store is useful if your bot needs to do multi-step preprocessing. For example, you might want to scrape a site that lists a set of company numbers on one page, but lists of information related to that number on several other pages. In that case, it might make sense to store the numbers in a SQL table; and then iterate over them after you've got them.

The sqlite convenience methods are provided via Scraperwiki libraries. See the scraperwiki python README the scraperwiki ruby README for more information.

Here's a simple example:

require 'json'
require 'turbotlib'
require 'scraperwiki'

require 'turbotlib'

# Save data in SQL
5.times do |n|
  datum = {
    uid: n,
    code: "ABC-#{n}"
  }
  Turbotlib.save_sqlite([:uid], datum)
end

Turbotlib.select("* from swdata") do |row|
  datum = {"extracted_code" => row["code"]}
  puts datum.to_json
end
import json
import datetime
import turbotlib

for n in range(0,5):
    datum = {"uid": n,
             "code": "ABC-%s" % n}
    turbotlib.sql.save(["uid"], datum)

for row in turbotlib.sql.select("* from swdata"):
    datum = {"extracted_code": row["code"]}
    print json.dumps(datum)

3. Scraping via a search form

If you're lucky, there's an easy way to iterate over every row of data. For example, your data source might make each entry available at URLs with incrementing IDs, like:

http://example.com/record?id=1
http://example.com/record?id=2

This makes your bot very easy to write; just try URLs with incrementing numbers until you get an error.

Very often, however, you are only given a search form for finding entries.

A common way to deal with this is to iterate over every possible 2-character permutation. For an English website, this means doing a seach for "AA", "AB", "AC", ... "9X", "9Y", "9Z". It's inefficient, and you will frequently get the same record more than once, but it's often the only option.

Here's one way of doing it:

chars = ('a'..'z').to_a + ('0'..'9').to_a
chars.repeated_permutation(2).each do |x,y|
   get_search_results_for(x + y)
end
import string
import itertools

chars = string.digits + string.ascii_lowercase
for x,y in itertools.product(chars, repeat=2)):
    get_search_results(x + y)

Because this is relatively inefficient, it's a good idea to store progress through the permutations, so you don't have to start from the beginning if interrupted.

4. Structured bots

So far, all the examples are for bots which produce "Primary Data". Turbot can process several kinds of data, defined by the data_type field in your manifest. Primary Data is designed to stay in turbot; all other Data Types are also stored in OpenCorporates against the main company record, where possible.

Primary Data is data that's as close as possible to the original source. For example, if you're scraping tables from a PDF document, the primary data would be rows of JSON corresponding to the rows of the PDF tables, and each JSON key would correspond with a table header. By making the raw data available in this way, without processing it or interpreting it, you're enabling others to use the data in ways you haven't thought of yet.

However, it's also desirable to get data into the main OpenCorporates database, so it can be cross-referenced with other data about the same company.

In order to do this, in addition to Primary Data, we encourage you to output interpreted data in structured formats which are supported by OpenCorporates. The target format is declared by the data_types in your manifest.

You can submit several data types (e.g. "primary data" and "licences") from a single bot, by providing transformers alongside the main scraper.

Consider this bot, which outputs primary data:

require 'json'

5.times do |i|
  datum = {
    :company => "Bank of Foo #{i}",
    :licence_number => "foo#{i}",
    :address => "2 High Street, London",
    :address_country => "UK",
    :website => "http://foo.com",
    :sample_date => '2014-06-01',
    :source_url => "example_data_source.com/foo#{i}"
  }

  puts JSON.dump(datum)
end
import json
import datetime
import turbotlib

for n in range(0,5):
    data = {"company": "Bank of Foo %s" % n,
            "address": "2 High Street, London",
            "address_country": "UK",
            "website": "http://foo.com",
            "licence_number": "foo%s" % n,
            "sample_date": datetime.datetime.now().isoformat(),
            "source_url": "http://somewhere.com/%s" % n}
    print json.dumps(data)

This primary data outputs several fields. A subset of these comprises all the information we need to make a Licence in OpenCorporates. By referring to the officially supported data types, you can see that simple licence data may, for example, have the following fields:

    company_name
    company_jurisdiction            # e.g. "France"
    source_url
    sample_date
    category
    licence_number
    status                          # e.g. "Revoked", "Current", etc
    jurisdiction_classification     # e.g. "Pet shop"
    confidence                      # Your confidence in your interpretation of the primary data
                                    # Must be HIGH, MEDIUM, or LOW

As well as outputting primary data it is very useful to write a transformer, which outputs a second set of standardised data. This is any script which reads lines from STDIN and writes them to STDOUT (if an input line does not produce desirable output, write an empty line to STDOUT). Here's one that converts the output of the primary data scraper above to a simple licence format - call it licence_transformer.rb:

require 'json'

STDIN.each_line do |line|
  raw_record = JSON.parse(line)

  licence_record = {
    :company_name => raw_record['company'],
    :company_jurisdiction => raw_record['address_country'],,
    :source_url => raw_record['source_url'],
    :sample_date => raw_record['sample_date'],
    :licence_number => raw_record['licence_number'],
    :jurisdiction_classification => 'Bank',
    :category => "Financial",
    :confidence => 'HIGH',
  }

  puts licence_record.to_json
end
import sys
import json

while True:
  line = sys.stdin.readline()
  if not line:
    break
  raw_record = json.loads(line)

  licence_record = {
    "company_name": raw_record['company'],
    "company_jurisdiction": raw_record['address_country'],
    "source_url": raw_record['source_url'],
    "sample_date": raw_record['sample_date'],
    "licence_number": raw_record['licence_number'],
    "jurisdiction_classification": 'Bank',
    "category": 'Financial',
    "confidence": 'HIGH',
  }

  print json.dumps(licence_record)

(Note that you cannot just iterate over sys.stdin, as this can lead to deadlocks because of buffering behaviour internal to Python/)

Finally, add a transformers field to the manifest. This instructs Turbot to use the transformer to produce records of type simple-licence:

{
  "bot_id": "bank_licences",
  "title": "Bank Licence scraper",
  "description": "Scrapes Banking Licences",
  "language": "ruby",
  "data_type": "primary data",
  "identifying_fields": [
    "licence_number"
  ],
  "files": ["scraper.rb", "licence_transformer.rb"],
  "transformers": [{
    "file":  "licence_transformer.rb",
    "data_type": "simple-licence",
    "identifying_fields": ["licence_number"]
  }],
  "frequency": "monthly",
   "publisher": {
    "name": "Publisher of the data",
    "url": "Publisher's website",
    "terms": "Copyright terms (e.g. Open Government License, None, etc)",
    "terms_url": "A place where these terms can be checked or verified"
  }
}
{
  "bot_id": "bank_licences",
  "title": "Bank Licence scraper",
  "description": "Scrapes Banking Licences",
  "language": "python",
  "data_type": "primary data",
  "identifying_fields": [
    "licence_number"
  ],
  "files": ["scraper.py", "licence_transformer.py"],
  "transformers": [{
    "file":  "licence_transformer.py",
    "data_type": "simple-licence",
    "identifying_fields": ["licence_number"]
  }],
  "frequency": "monthly",
   "publisher": {
    "name": "Publisher of the data",
    "url": "Publisher's website",
    "terms": "Copyright terms (e.g. Open Government License, None, etc)",
    "terms_url": "A place where these terms can be checked or verified"
  }
}

You can test a transformer in development by piping the output of your scraper to the transformer:

ruby scraper.rb | ruby licence_transformer.rb
python scraper.py | python licence_transformer.py

You can validate the combined output of your bot as usual with turbot bots:validate.

5. Finishing runs manually

Turbot has the concept of runs. If data is present in one run, but absent in the subsequent run, this allows Turbot to infer that the data is no longer true. By default, a run is considered ended when your bot finishes execution.

However, for really big bots, that take perhaps 2 days to run, it can make sense to update them incrementally, rather than rescraping the entire dataset each time. For example, a website might make recent updates available at a separate URL, or you might be able to infer that new entries are always added with sequential IDs.

In these cases, you will usually want to store state (see above), so you can work out which ID to try fetching next. In addition to this, you should add a new field to your manifest, manually_end_run: true. This tells Turbot not to consider the run completed unless explicitly told otherwise.

If you want to tell turbot that a run should be considered complete (i.e. the bot intends to start from the very beginning on its next run), you can write a special flag to SDTOUT (rather than JSON), thus:

RUN ENDED
Example to follow!