Your bot's manifest.json
should look something like this:
{ "bot_id": "my_amazing_bot", "title": "An amazing bot", "description": "This is a simple bot", "language": "ruby", "data_type": "primary data", "identifying_fields": ["number"], "files": ["scraper.rb"], "frequency": "monthly" }
See here for more about the fields in manifest.json
.
The data_type
field allows Turbot to know what to do with your data. Data Types are also used by Turbot to validate data (with turbot bots:validate
).
At the moment, we are only supporting two Data Types: primary data
and simple-licence
. Data Types to support Addresses, Industry Codes, Subsidiaries and Trademarks will be available in the next few weeks.
Primary Data is the first Data Type you will usually output. It is data that's as close as possible to the original source. For example, if you're scraping tables from a PDF document, the primary data would be rows of JSON corresponding to the rows of the PDF tables, and each JSON key would correspond with a table header. By making the raw data available in this way, without processing it or interpreting it, you're enabling others to use the data in ways you haven't thought of yet.
However, it's also desirable to get data into the main OpenCorporates database so it can be cross-referenced with other data about the same company.
In order to do this, in addition to Primary Data, we encourage you to output interpreted data in other structured formats which are supported by OpenCorporates (see examples of how to do this).
data_type: "primary data"
This Data Type is designed for recording the raw data as it appears on a source website. Where possible, we ask you to write bots that output specific, structured data types in addition to Primary Data (example).
Every row of JSON outputted by a primary data
bot must contain a source_url
field and a sample_date
field. Example:
{"name": "Foo Corp", "flavour": "Vanilla", "source_url": "http://my.data.source/foocorp", "sample_date": "2014-04-01"}JSON schema here.
data_type: "simple-licence"
In the broadest sense, simple licences are permissions by a regulatory authority for a company to do something. The only type we currently process is Financial Licences (for example, issued to Banks or Financial Advisors).
Here's an example:
{ "source_url": "http://somewhere.com/a/4/23", # Required "company_name": "Sterling Advisors Limited", # Required "company_jurisdiction": "United Kingdom", # Required "licence_jurisdiction": "Florida", # Optional. Where the activity is permitted "regulator": "Florida State Financial Board", # Optional "category": "Financial", # Optional. Must be 'Financial' or 'Business'(at the moment) "licence_number": "149923-AB-33", # Optional "jurisdiction_classification": "Advisors", # Optional. Can also be an array of strings. "status": "Revoked", # Optional "confidence": "HIGH", # Optional. Must be HIGH, MEDIUM, or LOW "sample_date": "2014/04/01", # Required "start_date": "2011/02/11", # Optional. When the licence started to be valid "end_date": "2015/05/03" # Optional. When the licence ended/will end }JSON schema here.
data_type: "licence"
The licence schema is an upgrade and replacement for the simple-licence schema, it allows us to capture much richer data and complex relationships. There is a worked example below of how fields should map to the new schema, and this reference page explains the individual header names in more detail.
Here's an example of a licence record in the wild, specifically a financial licence from the Texas Department of Banking. This could be captured into primary data as follows:
{ "name": "Thomas Cook Group Limited", "source_url": "http://www.dob.texas.gov/entity-search/entity-detail?bid=2602245&eid=26&bn=0", "type": "Money Services Business", "Alternate Name": "Thomas Cook Travellers Cheques Ltd.", "Physical Address": "100 Yonge St., Scotia Plaza, 14th Floor, Toronto, Ontario, M5C 2W1", "County": "", "Agent": "Hemant Dullabh", "Date Closed": "03/01/2004", "Phone": "", "Web": "", "Mailing Address": "100 Yonge St., Scotia Plaza, 14th Floor, Toronto, Ontario, , M5C 2W1", "Status": "Closed", "License No": "2245", "FDIC No": "", "sample_date": "2015-05-29" }
There are several fields here that we would not have been able to capture using the simple-licence schema. In addition to outputting the primary data we can add a licence transformer which would represent the above record as follows: (see these examples of how to apply licence transformers)
{ "confidence": "HIGH", "licence_holder": { "entity_properties": { "jurisdiction": "us_tx", "name": "Thomas Cook Group Limited", "mailing_address": "100 Yonge St., Scotia Plaza, 14th Floor, Toronto, Ontario, , M5C 2W1", "officers": [ { "name": "Hemant Dullabh", "position": "Agent" } ], "alternative_names": [ { "company_name": "Thomas Cook Travellers Cheques Ltd.", "type": "trading" } ] }, "entity_type": "unknown" }, "licenced_location": "100 Yonge St., Scotia Plaza, 14th Floor, Toronto, Ontario, M5C 2W1", "category": [ "Financial" ], "permissions": [ { "activity_name": "Money Services Business", "permission_type": "operating" } ], "source_url": "http://www.dob.texas.gov/entity-search/entity-detail?bid=2602245&eid=26&bn=0", "sample_date": "2015-05-29", "jurisdiction_of_licence": "us_tx", "end_date": "2004-03-01", "status": "Closed", "licence_number": "2245", "licence_issuer": { "jurisdiction": "Texas, United States", "name": "Department of Banking" } }
For a more in depth description of each licence field see the extended licence example.
JSON schema here.data_type: "simple-subsidiary"
A subsidiary captures a relationship of control between a parent and a child company.
Here's an example:
{ "source_url": "http://somewhere.com/a/4/23", # Required "parent_name": "Sterling Advisors Limited", # Required "parent_jurisdiction": "United Kingdom", # Required "direct": true, # Optional. If the control is direct, or via an intermediary. Omit this field if unknown. "percentage_controlled": "55.4", # Optional "subsidiary_name": "Sterling Widgets", # Required "subsidiary_jurisdiction": "United Kingdom", # Required "sample_date": "2014/04/01", # Required "start_date": "2011/02/11", # Optional. When the licence started to be valid "end_date": "2015/05/03" # Optional. When the licence ended/will end }JSON schema here.
data_type: "gazette-notice"
A notice published in a government gazette.
You may consult either the browsable or the raw JSON Schema for gazette notices.