Turbot specification

A bot must:

  1. Be written in Python or Ruby (more languages to follow)
  2. Only use libraries that are installed in our production system
  3. Be named scraper.py or scraper.rb
  4. Write JSON objects line-by-line to STDOUT in a format that passes the bots:validate command (i.e. which matches a supported schema)
  5. Include a valid manifest.json (see below)
  6. Be subject to a licence that permits OpenCorporates' reuse (e.g. MIT)

Manifest specification

These are required properties of a manifest:

 "bot_id": "my_amazing_bot",             # <- a unique identifier
 "title": "An amazing bot"               # <- descriptive title
 "description": "This is a simple bot",  # <- longer description
 "namespace_id": "my_amazing_bot",       # <- optional. Defaults to the value of `bot_id`.
 "language": "ruby",                     # <- language of bot (currently 'python' or 'ruby')
 "data_type": "primary data",            # <- reference to a Turbot schema
 "identifying_fields": ["number"],       # <- like primary key in a SQL database
 "files": ["scraper.rb"],                # <- list of files required for the bot to run
 "frequency": "monthly",                 # <- desired scrape frequency (once, daily, weekly, monthly
                                         #    or yearly)
 "publisher": {                          # <- essential so we can be sure it's open data
   "name": "Publisher of the data",
   "url": "Publisher's website",
   "terms": "Copyright terms (e.g. Open Government License, n/a, etc)",
   "terms_url": "A place where these terms can be checked or verified"

Manifests may also include the following optional fields:

"tags": ["licence", "financial"]        # <- arbitrary tags; use those suggested in missions (if any)
"manually_end_run": true                # <- Each time it's run, the bot gets more records,
                                        #    rather than restarting
"transformers": [{                      # <- An array of transformers
  "file":  "licence_transformer.py",    # <- Path to transformer
  "data_type": "simple-licence",        # <- Data type that transformer emits
  "identifying_fields": ["licence_number"]
"public_repo_url": "http://github.com/username/my_amazing_bot_repo
                                        # <- URL where bot's code is publicly available

For more about transformers and incremental bots, see the examples.