title Securities and Exchange Commission of Cambodia (cambodia_securities) running
description Bot for scraping Securities and Exchange Commission data at http://www.secc.gov.kh/english/m51.php?pn=6
current run state scraping or awaiting scrape, docker status
last run single run snapshot 13 scrape succeeded on March 04, 2017 09:26
next run enqueued for a run, docker status
created by helenst (Helen ST)
last reviewed by peter.evans
State changed to running for run #10126, snapshot 14
commented about 7 years ago

A run started

State changed to scheduled for run #10126, snapshot 14
commented about 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented about 7 years ago

State changed to ingesting_data for run #10054, snapshot 13
commented about 7 years ago

The run's output is being ingested

State changed to storing_data for run #10054, snapshot 13
commented about 7 years ago

The run's output is being stored

State changed to running for run #10054, snapshot 13
commented about 7 years ago

A run started

State changed to scheduled for run #10054, snapshot 13
commented about 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented about 7 years ago

State changed to ingesting_data for run #9821, snapshot 12
commented about 7 years ago

The run's output is being ingested

State changed to storing_data for run #9821, snapshot 12
commented about 7 years ago

The run's output is being stored

State changed to running for run #9821, snapshot 12
commented about 7 years ago

A run started

State changed to scheduled for run #9821, snapshot 12
commented over 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented over 7 years ago

State changed to ingesting_data for run #9540, snapshot 11
commented over 7 years ago

The run's output is being ingested

State changed to storing_data for run #9540, snapshot 11
commented over 7 years ago

The run's output is being stored

State changed to running for run #9540, snapshot 11
commented over 7 years ago

A run started

State changed to scheduled for run #9540, snapshot 11
commented over 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented over 7 years ago

State changed to ingesting_data for run #9300, snapshot 10
commented over 7 years ago

The run's output is being ingested

State changed to storing_data for run #9300, snapshot 10
commented over 7 years ago

The run's output is being stored

State changed to running for run #9300, snapshot 10
commented over 7 years ago

A run started

State changed to scheduled for run #9300, snapshot 10
commented over 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented over 7 years ago

State changed to ingesting_data for run #9029, snapshot 9
commented over 7 years ago

The run's output is being ingested

State changed to storing_data for run #9029, snapshot 9
commented over 7 years ago

The run's output is being stored

State changed to running for run #9029, snapshot 9
commented over 7 years ago

A run started

State changed to scheduled for run #9029, snapshot 9
commented over 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented over 7 years ago

State changed to ingesting_data for run #8749, snapshot 8
commented over 7 years ago

The run's output is being ingested

State changed to storing_data for run #8749, snapshot 8
commented over 7 years ago

The run's output is being stored

State changed to running for run #8749, snapshot 8
commented over 7 years ago

A run started

State changed to scheduled for run #8749, snapshot 8
commented over 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented over 7 years ago

State changed to ingesting_data for run #8437, snapshot 7
commented over 7 years ago

The run's output is being ingested

State changed to storing_data for run #8437, snapshot 7
commented over 7 years ago

The run's output is being stored

State changed to running for run #8437, snapshot 7
commented over 7 years ago

A run started

State changed to scheduled for run #8437, snapshot 7
commented over 7 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented over 7 years ago

State changed to ingesting_data for run #8056, snapshot 6
commented over 7 years ago

The run's output is being ingested

State changed to storing_data for run #8056, snapshot 6
commented over 7 years ago

The run's output is being stored

State changed to running for run #8056, snapshot 6
commented over 7 years ago

A run started

State changed to scheduled for run #8056, snapshot 6
commented almost 8 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented almost 8 years ago

State changed to ingesting_data for run #7650, snapshot 5
commented almost 8 years ago

The run's output is being ingested

State changed to storing_data for run #7650, snapshot 5
commented almost 8 years ago

The run's output is being stored

State changed to running for run #7650, snapshot 5
commented almost 8 years ago

A run started

State changed to scheduled for run #7650, snapshot 5
commented almost 8 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented almost 8 years ago

State changed to ingesting_data for run #7390, snapshot 4
commented almost 8 years ago

The run's output is being ingested

State changed to storing_data for run #7390, snapshot 4
commented almost 8 years ago

The run's output is being stored

State changed to running
commented almost 8 years ago

For run #7390:
A run started

State changed to scheduled
commented almost 8 years ago

For run #7390:
A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented almost 8 years ago

State changed to ingesting_data
commented almost 8 years ago

For run #6791:
The run's output is being ingested

State changed to storing_data
commented almost 8 years ago

For run #6791:
The run's output is being stored

State changed to running
commented almost 8 years ago

For run #6791:
A run started

Bot state update
commented about 8 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented about 8 years ago

Bot state update
commented about 8 years ago

A run finished; its output is now being processed

Bot state update
commented about 8 years ago

A run started

Bot state update
commented about 8 years ago

A snapshot completed; scheduling the first run of the next snapshot

Saved vars cleared
helenst commented about 8 years ago

Bot state update
peter.evans commented about 8 years ago

The bot was accepted; starting run to ingest reviewed data

Bot update
commented about 8 years ago

A manual prescrape using data from /oc/openc/scrapers/output/draft/c/cambodia_securities/6478 was started

Bot state update
commented about 8 years ago

A draft run succeeded; sending for final review

Bot state update
peter.evans commented about 8 years ago

A moderator has approved the draft bot; running a full draft for final review

Saved vars cleared
helenst commented about 8 years ago

Bot state update
peter.evans commented about 8 years ago

A moderator has started reviewing the draft bot

(no subject)
peter.evans commented over 8 years ago

Great! Hopefully that's the last little fix :)
I wonder if we can connect with someone who speaks Thai on Slack ... Hera is pretty good at finding help with that sort of thing.
Cheers
Peter

Re: (turbot bot [cambodia_securities])
helenst commented over 8 years ago

Of course... done!
Work on the Thai bot is mainly getting my head around processing at the
moment since (like many of these sites) they're not well structured :-)
Will undoubtedly have questions when it comes to looking at the meaning of
all that data!
thanks
Helen

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

(no subject)
peter.evans commented over 8 years ago

Hi Helen,
Thanks for this, just gave it another check and realised that I'd overlooked the identifying_fields for the primary data which should also reflect this change.
"identifying_fields": [
"name",
"source_url"
],
If you don't mind editing this then we'll get the bot accepted.
Looking forward to having a look at for some Thai securities output - do feel free to get in touch if you have any questions about the data source or Turbot here or on Slack.
Thanks
Peter

Re: (turbot bot [cambodia_securities])
helenst commented over 8 years ago

Hi Peter
No worries, I was just making the most of a bit of quiet time over
Christmas! Hope you had a good break.
I have changed the identifying fields as suggested... glad everything else
seems OK and I feel like I'm starting to get the hang of it! I've already
made a start on my next bot which is Thai securities and is a bit more
involved - might be a little while before you start seeing anything of that!
Cheers and thanks for the feedback,
Helen

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

(no subject)
peter.evans commented over 8 years ago

Hi Helen,
Thanks for adding a licence transformer here and apologies for being out of touch over xmas and year, I've been on leave. Hope you had a good festive time!
The transformed output for this scraper is looking absolutely fine, which is great.
You're right that making the identifying_fields more specific is preferable to de-duping the output. In this case I think the permissions field would be a slightly better one to use than source_url, as below
"identifying_fields": [
"permissions",
"licence_holder.entity_properties.name"
]
The reason for this is that the URL of the record may change while the data remains the same, in that instance Turbot would treat that updated record as a new record. Using permissions instead isn't an ideal solution but it's much less open to problems than using a url, I think.
Anyway otherwise I think the bot is ready to accept & the data should then be fine to go straight into opencorporates, which is great :)
All the best,
Peter

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Re: (turbot bot [cambodia_securities])
helenst commented over 8 years ago

Hi Peter
Good to see you last week too, it was nice to have the chance to chat with
everybody.
I've made a start on the transformer script for the licence schema and have
submitted it. I had one case where I found the same bank (Acleda) was
listed on two different pages - since these are different categories I
figured it might be useful information so added the source URL as an
identifying field rather than de-duping, but I'd be interested in your
feedback about that - and of course anything else! Let me know what you
think!
Thanks and have a great Christmas!
Helen

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Bot state update
commented over 8 years ago

A draft run failed

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

(no subject)
peter.evans commented over 8 years ago

Hi Helen,
Thanks for this & good to see you yesterday, I thought that that was a fun lunch.
So the scraper output for the bot for Cambodia is looking fine now, which is great! Next up if you have time would be to add a transformer, you can choose which transformer to apply but the best one to use for licences is the `licence` schema as that should allow us to transform all of the scraped fields. The `simple-licence` schema is also fine but we should expect to not be able to display quite as much in OpenCorporates.
You've probably spotted this already but this is a good example of how to add a transformer: http://turbot.opencorporates.com/docs/examples#structured-bots
& the licence schema is explained a little in the docs: http://turbot.opencorporates.com/docs/supported_data_types#licence and http://turbot.opencorporates.com/docs/extended_licence_example
The schema itself can be found here: https://github.com/openc/openc-schema/blob/master/schemas/licence-schema.json - If anything is not clear anywhere please do let me know, I'm sure there are some places where we could refine the documentation, so it would be great to get feedback on that.
Thanks! & feel free to be in touch here or on Slack.
Peter

Re: (turbot bot [cambodia_securities])
helenst commented over 8 years ago

OK, I think I have got those fields behaving!
Resubmitted.
I have restricted it to those four pages, and moved the category / type
fields into a single concatenated field.
Look forward to your feedback!
Helen
Hi Helen,
Thank you for those responses. As usual it sounds like the main job is
understanding the nature of the quirky data source.
- I'm not sure why your code change to normalise email / e-mail etc would
not work on the site. Have you pushed the latest code? If so perhaps there
is something in the normalisation that is not working on the server, if you
paste a snippet I could run it by one of the devs here.
- I would say that the only relevant pages to scrape would be these:
Intermediaries
Market Operators
Cash Settlement Agent
Securities Registrar, Transfer Agent & Paying Agent
I think those are the only financial licences, the others seem more like
supporting services in the same sector.
- I agree that the headings from the sidebar would make better "type"
fields, also I think for "Intermediaries" we could concatenate the category
and the type E.g. "Intermediaties - Securities underwriters"
That's it! a few small changes I think and we can look at the transformer &
start getting data into OpenCorporates.
Thanks,
Peter

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Bot state update
commented over 8 years ago

A draft run failed

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

(no subject)
peter.evans commented over 8 years ago

Hi Helen,
Thank you for those responses. As usual it sounds like the main job is understanding the nature of the quirky data source.
- I'm not sure why your code change to normalise email / e-mail etc would not work on the site. Have you pushed the latest code? If so perhaps there is something in the normalisation that is not working on the server, if you paste a snippet I could run it by one of the devs here.
- I would say that the only relevant pages to scrape would be these:
Intermediaries
Market Operators
Cash Settlement Agent
Securities Registrar, Transfer Agent & Paying Agent
I think those are the only financial licences, the others seem more like supporting services in the same sector.
- I agree that the headings from the sidebar would make better "type" fields, also I think for "Intermediaries" we could concatenate the category and the type E.g. "Intermediaties - Securities underwriters"
That's it! a few small changes I think and we can look at the transformer & start getting data into OpenCorporates.
Thanks,
Peter

Re: (turbot bot [cambodia_securities])
helenst commented over 8 years ago

Hi Peter
Thank you for your feedback.
* I'm not sure what is going on with the "email" and "e-mail" fields. My
scraper was originally outputting both of those as it's different on
different entries, but then I added some code to normalize them - same for
"web" and "website". My scraper script output for that is looking fine now,
but the output on the site isn't. (it should be just "website" and
"email"). I'm not quite sure how to get this showing correctly on the site.
* With regard to the categories - this is where I am a little limited in my
understanding of the content - can I check which of the pages in the
sidebar of http://www.secc.gov.kh/english/m52.php?pn=6 you think would be
relevant? Obviously the first couple of pages, but things like Printing
services perhaps not?
* The 'type' field is coming from the page heading, which seems to be
"Market Operators" for three of those pages - possibly an oversight in the
site content itself, it may be better for me to take that from the sidebar.
* The "category" field is coming from (at least on the first page) the
subheadings within the tables.... but since that only applies on one page,
the rest are blank.
I was doing this at the meetup the other week so ran out of time but I'll
continue and tidy up some loose ends. I'd be happy to do a transformer
after the scraper's complete!
Helen

(no subject)
peter.evans commented over 8 years ago

Hi Helen,
Thank you for pushing your code for the bot scraping the Securities and Exchange Commission of Cambodia. The output of the scraper is looking good. There's a couple of small things that we should address & then if possible I think we should add a transformer so that we can output the data in standardised format which will allow us to import it into our main Opencorporates database.
Let's look at the scraper output first:
I think generally this is fine - we seem to be outputting 2 email fields `email` and `e-mail` which we could possibly combine into one. We could do this in the transformer though if that's easier.
Secondly I'm a little confused about the licence types that are being scraped, we have a general category in the `type` field of Intermediaries and Market Operators but we seem to be taking records from pages beyond just those two types?
Also we have sub categories like Securities Underwriters and Securities Dealers in the `category` field but some of these seem to be blank - is that intentional I.e. were those sub categories not available on the website?
Apologies if I've missed something obvious there! Thanks for your work on this scraper!
Peter

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Bot state update
commented over 8 years ago

Run succeeded; sending for draft review

Bot state update
helenst commented over 8 years ago

The bot was pushed; scheduling a draft run

Run history

event metadata
single run snapshot draft scrape succeeded on November 24, 2015 19:59 18 rows in less than a minute
single run snapshot draft scrape succeeded on November 24, 2015 20:08 18 rows in less than a minute
single run snapshot draft scrape succeeded on November 24, 2015 20:12 18 rows in less than a minute
single run snapshot draft scrape succeeded on November 24, 2015 20:15 18 rows in less than a minute
single run snapshot draft scrape succeeded on November 24, 2015 20:31 18 rows in less than a minute
single run snapshot draft scrape succeeded on November 24, 2015 20:39 18 rows in less than a minute
single run snapshot draft scrape failed on December 14, 2015 08:22 0 rows in less than a minute
single run snapshot draft scrape succeeded on December 14, 2015 08:30 18 rows in less than a minute
single run snapshot draft scrape failed on December 23, 2015 19:52 0 rows in less than a minute
single run snapshot draft scrape succeeded on December 23, 2015 19:56 18 rows in less than a minute
single run snapshot draft scrape succeeded on December 29, 2015 17:26 18 rows in less than a minute
single run snapshot draft scrape succeeded on January 08, 2016 08:56 18 rows in less than a minute
single run snapshot draft scrape succeeded on January 08, 2016 15:42 18 rows in less than a minute
single run snapshot final draft scrape succeeded on March 04, 2016 09:25 18 rows in less than a minute
single run snapshot 1 prescrape scrape succeeded on March 04, 2016 09:26 18 rows in less than a minute
single run snapshot 2 scrape succeeded on April 04, 2016 09:26 18 rows in 1 minute
single run snapshot 3 scrape succeeded on May 04, 2016 09:26 18 rows in less than a minute
single run snapshot 4 scrape succeeded on June 04, 2016 09:26 18 rows in less than a minute
single run snapshot 5 scrape succeeded on July 04, 2016 09:26 18 rows in less than a minute
single run snapshot 6 scrape succeeded on August 04, 2016 09:26 18 rows in less than a minute
single run snapshot 7 scrape succeeded on September 04, 2016 09:26 18 rows in less than a minute
single run snapshot 8 scrape succeeded on October 04, 2016 09:26 18 rows in less than a minute
single run snapshot 9 scrape succeeded on November 04, 2016 09:26 18 rows in less than a minute
single run snapshot 10 scrape succeeded on December 04, 2016 09:26 18 rows in less than a minute
single run snapshot 11 scrape succeeded on January 04, 2017 09:26 28 rows in less than a minute
single run snapshot 12 scrape succeeded on February 04, 2017 09:26 28 rows in less than a minute
single run snapshot 13 scrape succeeded on March 04, 2017 09:26 28 rows in less than a minute
single run snapshot 14 scrape scheduled on April 04, 2017 09:26 0 rows

Config

{
  "bot_id": "cambodia_securities",
  "title": "Securities and Exchange Commission of Cambodia",
  "description": "Bot for scraping Securities and Exchange Commission data at http://www.secc.gov.kh/english/m51.php?pn=6",
  "language": "python",
  "data_type": "primary data",
  "identifying_fields": [
    "name",
    "type"
  ],
  "files": [
    "scraper.py",
    "licence_transformer.py"
  ],
  "transformers": [
    {
      "file": "licence_transformer.py",
      "data_type": "licence",
      "identifying_fields": [
        "permissions",
        "licence_holder.entity_properties.name"
      ]
    }
  ],
  "frequency": "monthly",
  "publisher": {
    "name": "Securities and Exchange Commission of Cambodia",
    "url": "http://www.secc.gov.kh/",
    "terms": "n/a",
    "terms_url": "n/a"
  },
  "public_repo_url": "http://github.com/helenst/turbot-securities-cambodia"
}