-
Notifications
You must be signed in to change notification settings - Fork 1
Code: Scraping Reference
chrislkeller edited this page Oct 20, 2014
·
1 revision
- Secretary of State (primary source for initiatives and status updates)
- Attorney General (for active, inactive and historical back to 2005, and for linking initiative IDs to Prop #s)
- UC Hastings College of Law (for full historical info on initiatives, including qualified/failed, proponent, full text, etc.)
- Legislative Analyst's Office (for fiscal impact, background, analysis, for pre-ballot initiatives and propositions)
- attorney-general-information: pending title/summary at Attorney General's office
- cleared-for-circulation
- pending-signature-verification
- failed-to-qualify
- qualified-ballot-measures
Note: if an initiative appears in two places on SoS site, the status should be set based on the latest step in the process. Example: an initiative on both "pending signature verification" and "failed to qualify" lists should take the "failed to qualify" status.
- BASE_URL = http://www.sos.ca.gov/elections/ballot-measures/
- QUALIFIED = qualified-ballot-measures
- PENDING = pending-signature-verification
- CLEARED = cleared-for-circulation
- FAILED = failed-to-qualify
- PENDING AG OFFICE = attorney-general-information
...appending '.htm' to complete each URL
- BASE_URL = http://oag.ca.gov/initiatives/
-
QUALIFIED = qualified-for-ballot
- as is: current qualified measures (monitor this: at least one initiative qualified 2014 but AG not using 2014 filter)
- ?field_initiative_date_value%5Bvalue%5D%5Byear%5D=: = 4-digit year; previous election cycles? (unclear if we'll need)
- ACTIVE = active-measures
-
INACTIVE = inactive-measures
- as is: defaults to current year
- ?field_initiative_date_value%5Bvalue%5D%5Byear%5D=: filtered for specific year
Need to document...
-
Signature Count
- BASE URL = http://www.sos.ca.gov/elections/pend_sig/init-sample-<SOS_ID>-.pdf
- <SOS_ID> = 4-digit Secretary of State initiative ID ("####")
- = MMDDYY
-
Full Text of Initiative
- difficult to predict: unique combination of initiative ID/keywords
- in some cases may be partial i.e. '/elections/ballot-measures/pdf/sca-17.pdf'
- if so, prefix with: 'http://www.sos.ca.gov'?
-
Fiscal Impact Estimate Report
- BASE URL = http://oag.ca.gov/system/files/initiatives/pdfs/fiscal-impact-estimate-report%28<AG_ID>%29.pdf?
- <AG_ID> = 6-digit Attorney General initiative ID ("##-####")
- NOTE: monitor to be sure this is consistent. I think I've seen URLs end "%2.pdf?" rather that "%29.pdf?", or similar minor tweak
What is the step-by-step strategy for pulling and updating data from Sec State and AG?
- Match by ID with fail steps (SOS ID then AG ID, then prop #?)
- Overwrite all data with newest, or selectively?
- when would you update?
- if status has changed
- for "pending signature verification" initiatives, if random sample update is newer
- when prop # has been assigned
- what would you update? the following, only if different from current data:
- for "pending signature verification" signature count updates:
- status
- date_sample_updated
- sig_count_link
- id_note
- date_raw_count_due
- for "failed to qualify":
- status
- date_failed
- for "pending AG review" and "cleared for circulation":
- status
- date_circulation_deadline
- date_sum
- ag_id
- sos_id
- id_note
- title
- summary
- full_text_link
- proponent
- phone
- sigs_req
- fiscal_impact_link
- for "qualified":
- prop_num, id_note, status, date_sample_update, sig_count_link, date_qualified, election, full_text_link (again when official voter guide avail.?)
- for "pending signature verification" signature count updates:
NOTES ON DEADLINES AND UPDATE SCHEDULES:
- "failed to qualify" initiatives are only on SoS site for 60 DAYS. Data must be scraped and stored, and the lack of an initiative on this page doesn't necessarily mean its status has changed.
- "cleared for circulation" initiatives fail 9 DAYS after circulation deadline if counties haven't notified state of signatures submitted.
- "pending signature verification" initiatives updated regularly with new sample update spreadsheets. More than 110% valid signatures in a random sample automatically qualifies. 95% to 110% needs full check. Less than 95% fails. SPECIAL HANDLING NEEDED: CHECK DATE ON LATEST RANDOM SAMPLE TO DETERMINE IF UPDATE NEEDED.
- "pending AG review" initiatives — this is the first step in the process, means proponent has submitted a proposal but the initiative doesn't even have a title or summary yet. The AG takes roughly 60 DAYS to create title and summary. I think at this point there may be a document associated though, so users can at least read the full proposal/draft initiative
- "qualified ballot measures" require special scraping and change from SoS/AG id numbers to official Proposition numbers.
- initial state: no prop numbers, final check signatures still available
- transitional: prop numbers AND id numbers during 'public display' period…how long?
- final: prop numbers ONLY, NO id numbers
- would appear the best time to connect the two would be during that transitional period when both Prop Num and SOS/AG_ID are listed
HANDLING REFERENDA:
- may need manual update when prop assigned, but some alerts could help us know when to do that:
- flag in admin on any referenda so that producers know they will need to manually update
- email alert when any Prop # shows up on SoS page that doesn't have clear connection to a current initiative in the database. This will notify producer they need to check the content of the Prop to see which initiative it came from. Should be able to ascertain by reading intro to the measure's full text.
- when any initiative is added
- when any initiative changes status
- special notice when referendum is added: action needed
- when prop # is assigned that doesn't match a current initiative (may need match to referendum): action needed
- when new random sample update on signatures is provided
These could be useful also for our political reporters