Skip to content

Code: Scraping Reference

chrislkeller edited this page Oct 20, 2014 · 1 revision

DATA SOURCES:


QUALIFICATION PROCESS - ORDER OF STEPS

  1. attorney-general-information: pending title/summary at Attorney General's office
  2. cleared-for-circulation
  3. pending-signature-verification
  4. failed-to-qualify
  5. qualified-ballot-measures

Note: if an initiative appears in two places on SoS site, the status should be set based on the latest step in the process. Example: an initiative on both "pending signature verification" and "failed to qualify" lists should take the "failed to qualify" status.


URL STRUCTURE FOR SEC STATE SCRAPING:

  • BASE_URL = http://www.sos.ca.gov/elections/ballot-measures/
  • QUALIFIED = qualified-ballot-measures
  • PENDING = pending-signature-verification
  • CLEARED = cleared-for-circulation
  • FAILED = failed-to-qualify
  • PENDING AG OFFICE = attorney-general-information

...appending '.htm' to complete each URL


URL STRUCTURE FOR ATTORNEY GENERAL SCRAPING:

  • BASE_URL = http://oag.ca.gov/initiatives/
  • QUALIFIED = qualified-for-ballot
    • as is: current qualified measures (monitor this: at least one initiative qualified 2014 but AG not using 2014 filter)
    • ?field_initiative_date_value%5Bvalue%5D%5Byear%5D=: = 4-digit year; previous election cycles? (unclear if we'll need)
  • ACTIVE = active-measures
  • INACTIVE = inactive-measures
    • as is: defaults to current year
    • ?field_initiative_date_value%5Bvalue%5D%5Byear%5D=: filtered for specific year

URL STRUCTURE FOR LEGISLATIVE ANALYST SCRAPING:

Need to document...


URL STRUCTURE FOR DOCUMENTS:


STRATEGY FOR UPDATES:

What is the step-by-step strategy for pulling and updating data from Sec State and AG?

  • Match by ID with fail steps (SOS ID then AG ID, then prop #?)
  • Overwrite all data with newest, or selectively?
  • when would you update?
    • if status has changed
    • for "pending signature verification" initiatives, if random sample update is newer
    • when prop # has been assigned
  • what would you update? the following, only if different from current data:
    • for "pending signature verification" signature count updates:
      • status
      • date_sample_updated
      • sig_count_link
      • id_note
      • date_raw_count_due
    • for "failed to qualify":
      • status
      • date_failed
    • for "pending AG review" and "cleared for circulation":
      • status
      • date_circulation_deadline
      • date_sum
      • ag_id
      • sos_id
      • id_note
      • title
      • summary
      • full_text_link
      • proponent
      • email
      • phone
      • sigs_req
      • fiscal_impact_link
    • for "qualified":
      • prop_num, id_note, status, date_sample_update, sig_count_link, date_qualified, election, full_text_link (again when official voter guide avail.?)

NOTES ON DEADLINES AND UPDATE SCHEDULES:

  • "failed to qualify" initiatives are only on SoS site for 60 DAYS. Data must be scraped and stored, and the lack of an initiative on this page doesn't necessarily mean its status has changed.
  • "cleared for circulation" initiatives fail 9 DAYS after circulation deadline if counties haven't notified state of signatures submitted.
  • "pending signature verification" initiatives updated regularly with new sample update spreadsheets. More than 110% valid signatures in a random sample automatically qualifies. 95% to 110% needs full check. Less than 95% fails. SPECIAL HANDLING NEEDED: CHECK DATE ON LATEST RANDOM SAMPLE TO DETERMINE IF UPDATE NEEDED.
  • "pending AG review" initiatives — this is the first step in the process, means proponent has submitted a proposal but the initiative doesn't even have a title or summary yet. The AG takes roughly 60 DAYS to create title and summary. I think at this point there may be a document associated though, so users can at least read the full proposal/draft initiative
  • "qualified ballot measures" require special scraping and change from SoS/AG id numbers to official Proposition numbers.
    • initial state: no prop numbers, final check signatures still available
    • transitional: prop numbers AND id numbers during 'public display' period…how long?
    • final: prop numbers ONLY, NO id numbers
    • would appear the best time to connect the two would be during that transitional period when both Prop Num and SOS/AG_ID are listed

HANDLING REFERENDA:

  • may need manual update when prop assigned, but some alerts could help us know when to do that:
  • flag in admin on any referenda so that producers know they will need to manually update
  • email alert when any Prop # shows up on SoS page that doesn't have clear connection to a current initiative in the database. This will notify producer they need to check the content of the Prop to see which initiative it came from. Should be able to ascertain by reading intro to the measure's full text.

EMAIL ALERTS:

  • when any initiative is added
  • when any initiative changes status
  • special notice when referendum is added: action needed
  • when prop # is assigned that doesn't match a current initiative (may need match to referendum): action needed
  • when new random sample update on signatures is provided

These could be useful also for our political reporters