This walkthrough assumes you've read the Quickstart guide, and you understand how to create a new bot and validate its data.
The template bot generated by turbot bots:generate
only iterates over a loop. This is easy to understand, but not very useful.
The first example, therefore, illustrates scraping a real web page (in this case, something hosted on our server).
Bots must conform to the Turbot Specification to be valid. This means they must output valid lines of JSON, and must be accompanied by a valid manifest.json
.
To try this example, create a bot as usual, replacing <bot_name>
with something unique:
turbot bots:generate --bot <bot_name> --language=ruby
turbot bots:generate --bot <bot_name> --language=python
Then cd <bot_name>
and edit scraper.rb
scraper.py
:
from datetime import date import json from bs4 import BeautifulSoup import requests import turbotlib source_url = "https://turbot.opencorporates.com/examples/basic_bot.html" sample_date = str(date.today()) turbotlib.log("Starting scrape...") # optional debug logging response = requests.get(source_url) html = response.text doc = BeautifulSoup(html) table = doc.find('table', class_='setoutList') for tr in table.find('tbody').find_all('tr'): # Each tr element has two td elements. tds = tr.find_all('td') record = { 'company_name': tds[0].text, 'group_name': tds[1].text, 'sample_date': sample_date, # mandatory field 'source_url': source_url # mandatory field } # The important part of the Turbot specification is that your scraper outputs lines of JSON print json.dumps(record)
require 'json' require 'mechanize' require 'turbotlib' SOURCE_URL = "https://turbot.opencorporates.com/examples/basic_bot.html" agent = Mechanize.new Turbotlib.log("Starting scrape...") # optional debug logging doc = agent.get(SOURCE_URL).parser doc.css('.setoutList tbody tr').each do |row| cols = row.css('td').map {|r| r.text } datum = { company_name: cols[0], company_number: cols[1], source_url: SOURCE_URL, # mandatory field sample_date: Time.now # mandatory field } # The important part of the Turbot specification is that your scraper outputs lines of JSON puts JSON.dump(datum) end
As usual, you can test the bot with ruby scraper.rb
python scraper.py
.
Finally, you'll need to change the identifying_fields
key in the manifest to ["company_name"]
Then you can run turbot bots:validate
to verify that the data produced by the bot is valid.
Sometimes, you need to store data locally in order to complete a run.
You can save things in the location provided by Turbolib.data_dir
turbolib.data_dir
; anything here will be persisted between runs. For example, if you needed to write a CSV before parsing it, you might do this:
require 'turbotlib' csv_file = open("#{Turbotlib.data_dir}/something.csv", "w") csv_file.write("test,csv,line") csv_file.close()
import turbotlib csv_file = open("%s/something.csv" % turbotlib.data_dir(), "w") csv_file.write("test,csv,line") csv_file.close()
A common use case to storing state is to keep a record of a bot's scraping progress: for bots that take several hours to run, it is usual helpful to be able to resume in the case tney are interrupted.
In turbotlib
, we provide some convenience functions for saving single variables:
require 'json' require 'turbotlib' # Start from 0, or the last point we got to start = Turbotlib.get_var("progress") || 0 (start...5).each do |n| data = { company: "Foo Inc #{n}", licence_number: "foo#{n}", sample_date: Time.now, source_url: "http://somewhere.com/#{n}" } Turbotlib.log("progress: #{n}") puts JSON.dump(data) # Save current progress Turbotlib.save_var("progress", n) # Wait a couple of seconds (to give you time to interrupt and test) sleep 2 end # Important! we need to reset the progress when we've finished, # so the next time we start from the beginning again Turbotlib.save_var("progress", 0)
import json import datetime import turbotlib import time # Start from 0, or the last point we got to try: start = turbotlib.get_var("progress") except KeyError: start = 0 for n in range(start,5): data = {"company": "Foo Inc %s" % n, "licence_number": "foo%s" % n, "sample_date": datetime.datetime.now().isoformat(), "source_url": "http://somewhere.com/%s" % n} turbotlib.log("progress: %s" % n) print json.dumps(data) # Save progress turbotlib.save_var("progress", n) # Wait a couple of seconds (to give you time to interrupt and test) time.sleep(2) # Important! we need to reset the progress when we've finished, # so the next time we start from the beginning again turbotlib.save_var("progress", 0)
Try running this bot (i.e. with ruby scraper.rb
python scraper.py
), and then interrupting it with Ctrl+c. Then run it again. It should restart where it left off.
You can also store data in a local sqlite store. We provide some convenience methods for making working with SQLite data easier.
Access to a SQL store is useful if your bot needs to do multi-step preprocessing. For example, you might want to scrape a site that lists a set of company numbers on one page, but lists of information related to that number on several other pages. In that case, it might make sense to store the numbers in a SQL table; and then iterate over them after you've got them.
The sqlite convenience methods are provided via Scraperwiki libraries. See the scraperwiki python README the scraperwiki ruby README for more information.
Here's a simple example:
require 'json' require 'turbotlib' require 'scraperwiki' require 'turbotlib' # Save data in SQL 5.times do |n| datum = { uid: n, code: "ABC-#{n}" } Turbotlib.save_sqlite([:uid], datum) end Turbotlib.select("* from swdata") do |row| datum = {"extracted_code" => row["code"]} puts datum.to_json end
import json import datetime import turbotlib for n in range(0,5): datum = {"uid": n, "code": "ABC-%s" % n} turbotlib.sql.save(["uid"], datum) for row in turbotlib.sql.select("* from swdata"): datum = {"extracted_code": row["code"]} print json.dumps(datum)
If you're lucky, there's an easy way to iterate over every row of data. For example, your data source might make each entry available at URLs with incrementing IDs, like:
http://example.com/record?id=1 http://example.com/record?id=2
This makes your bot very easy to write; just try URLs with incrementing numbers until you get an error.
Very often, however, you are only given a search form for finding entries.
A common way to deal with this is to iterate over every possible 2-character permutation. For an English website, this means doing a seach for "AA", "AB", "AC", ... "9X", "9Y", "9Z"
. It's inefficient, and you will frequently get the same record more than once, but it's often the only option.
Here's one way of doing it:
chars = ('a'..'z').to_a + ('0'..'9').to_a chars.repeated_permutation(2).each do |x,y| get_search_results_for(x + y) end
import string import itertools chars = string.digits + string.ascii_lowercase for x,y in itertools.product(chars, repeat=2)): get_search_results(x + y)
Because this is relatively inefficient, it's a good idea to store progress through the permutations, so you don't have to start from the beginning if interrupted.
So far, all the examples are for bots which produce "Primary Data". Turbot can process several kinds of data, defined by the data_type
field in your manifest. Primary Data is designed to stay in turbot; all other Data Types are also stored in OpenCorporates against the main company record, where possible.
Primary Data is data that's as close as possible to the original source. For example, if you're scraping tables from a PDF document, the primary data would be rows of JSON corresponding to the rows of the PDF tables, and each JSON key would correspond with a table header. By making the raw data available in this way, without processing it or interpreting it, you're enabling others to use the data in ways you haven't thought of yet.
However, it's also desirable to get data into the main OpenCorporates database, so it can be cross-referenced with other data about the same company.
In order to do this, in addition to Primary Data, we encourage you to output interpreted data in structured formats which are supported by OpenCorporates. The target format is declared by the data_type
s in your manifest.
You can submit several data types (e.g. "primary data" and "licences") from a single bot, by providing transformers alongside the main scraper.
Consider this bot, which outputs primary data:
require 'json' 5.times do |i| datum = { :company => "Bank of Foo #{i}", :licence_number => "foo#{i}", :address => "2 High Street, London", :address_country => "UK", :website => "http://foo.com", :sample_date => '2014-06-01', :source_url => "example_data_source.com/foo#{i}" } puts JSON.dump(datum) end
import json import datetime import turbotlib for n in range(0,5): data = {"company": "Bank of Foo %s" % n, "address": "2 High Street, London", "address_country": "UK", "website": "http://foo.com", "licence_number": "foo%s" % n, "sample_date": datetime.datetime.now().isoformat(), "source_url": "http://somewhere.com/%s" % n} print json.dumps(data)
This primary data outputs several fields. A subset of these comprises all the information we need to make a Licence in OpenCorporates. By referring to the officially supported data types, you can see that simple licence data may, for example, have the following fields:
company_name company_jurisdiction # e.g. "France" source_url sample_date category licence_number status # e.g. "Revoked", "Current", etc jurisdiction_classification # e.g. "Pet shop" confidence # Your confidence in your interpretation of the primary data # Must be HIGH, MEDIUM, or LOW
As well as outputting primary data it is very useful to write a transformer, which outputs a second set of standardised data. This is any script which reads lines from STDIN and writes them to STDOUT (if an input line does not produce desirable output, write an empty line to STDOUT). Here's one that converts the output of the primary data scraper above to a simple licence format - call it licence_transformer.rb
:
require 'json' STDIN.each_line do |line| raw_record = JSON.parse(line) licence_record = { :company_name => raw_record['company'], :company_jurisdiction => raw_record['address_country'],, :source_url => raw_record['source_url'], :sample_date => raw_record['sample_date'], :licence_number => raw_record['licence_number'], :jurisdiction_classification => 'Bank', :category => "Financial", :confidence => 'HIGH', } puts licence_record.to_json end
import sys import json while True: line = sys.stdin.readline() if not line: break raw_record = json.loads(line) licence_record = { "company_name": raw_record['company'], "company_jurisdiction": raw_record['address_country'], "source_url": raw_record['source_url'], "sample_date": raw_record['sample_date'], "licence_number": raw_record['licence_number'], "jurisdiction_classification": 'Bank', "category": 'Financial', "confidence": 'HIGH', } print json.dumps(licence_record)
(Note that you cannot just iterate over sys.stdin, as this can lead to deadlocks because of buffering behaviour internal to Python/)
Finally, add a transformers
field to the manifest. This instructs Turbot to use the transformer to produce records of type simple-licence
:
{ "bot_id": "bank_licences", "title": "Bank Licence scraper", "description": "Scrapes Banking Licences", "language": "ruby", "data_type": "primary data", "identifying_fields": [ "licence_number" ], "files": ["scraper.rb", "licence_transformer.rb"], "transformers": [{ "file": "licence_transformer.rb", "data_type": "simple-licence", "identifying_fields": ["licence_number"] }], "frequency": "monthly", "publisher": { "name": "Publisher of the data", "url": "Publisher's website", "terms": "Copyright terms (e.g. Open Government License, None, etc)", "terms_url": "A place where these terms can be checked or verified" } }
{ "bot_id": "bank_licences", "title": "Bank Licence scraper", "description": "Scrapes Banking Licences", "language": "python", "data_type": "primary data", "identifying_fields": [ "licence_number" ], "files": ["scraper.py", "licence_transformer.py"], "transformers": [{ "file": "licence_transformer.py", "data_type": "simple-licence", "identifying_fields": ["licence_number"] }], "frequency": "monthly", "publisher": { "name": "Publisher of the data", "url": "Publisher's website", "terms": "Copyright terms (e.g. Open Government License, None, etc)", "terms_url": "A place where these terms can be checked or verified" } }
You can test a transformer in development by piping the output of your scraper to the transformer:
ruby scraper.rb | ruby licence_transformer.rb
python scraper.py | python licence_transformer.py
You can validate the combined output of your bot as usual with turbot bots:validate
.
Turbot has the concept of runs. If data is present in one run, but absent in the subsequent run, this allows Turbot to infer that the data is no longer true. By default, a run is considered ended when your bot finishes execution.
However, for really big bots, that take perhaps 2 days to run, it can make sense to update them incrementally, rather than rescraping the entire dataset each time. For example, a website might make recent updates available at a separate URL, or you might be able to infer that new entries are always added with sequential IDs.
In these cases, you will usually want to store state (see above), so you can work out which ID to try fetching next. In addition to this, you should add a new field to your manifest, manually_end_run: true
. This tells Turbot not to consider the run completed unless explicitly told otherwise.
If you want to tell turbot that a run should be considered complete (i.e. the bot intends to start from the very beginning on its next run), you can write a special flag to SDTOUT (rather than JSON), thus:
RUN ENDEDExample to follow!