Skip to content

Set default older_than to tomorrow #1224

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Nov 14, 2023
Merged

Conversation

krivard
Copy link
Contributor

@krivard krivard commented Jun 30, 2023

Summary:

We currently wait until a day is over before pulling covid_hosp input files posted that day. Unfortunately, the new healthdata.gov posting schedule for covid_hosp data is Fridays and Mondays at ~12:10pm. Forecasting would like to use the fresh Monday data for the Monday forecasts they submit at 3pm (Slack convo link). This means we should change the default pipeline behavior to permit same-day healthdata.gov files to be included in each update.

This PR updates the default selection criteria for which healthdata.gov files to import so that it includes data posted today.

WARNING: if healthdata.gov posts a file after 12:30 on a Monday or a Friday, that file will never be picked up by the regular pipeline and must be handled with a manual invocation of the pipeline that selects for the day the late file was posted. 7/17 edit: a fix for this edge case is in progress, thanks george!

This is a semi-hotfix. The covid_hosp acquisition materials are in the midst of a massive overhaul, but we'd like to implement this behavior change before the new code will be ready, so this PR modifies the old code.

Prerequisites:

  • Unless it is a documentation hotfix it should be merged against the dev branch
  • Branch is up-to-date with the branch to be merged with, i.e. dev
  • Build is successful
  • Code is cleaned up and formatted

@krivard krivard requested review from melange396 and rzats July 3, 2023 20:52
@melange396
Copy link
Collaborator

this is untested in any way, and will probably need unit/integration test changes too.

with the commit i just pushed, it "should" now look back to the same day as the last issue to see if there were any other new data revisions issued, instead of just looking back to the next day after the last issue. any revisions from that last day that were already collected are skipped.

the new usage of the with database.connect() could maybe be widened in its scope (ie, pulled outisde of some number of the nested blocks), but i didnt want to keep the connection open while datasets are fetched and merged (which could potentially take a long time).

i think this also means, in normal usage, we should never again see the "no new issues; nothing to do" log message, but will now see the new log message "already collected revision:" on every run.

@melange396 melange396 mentioned this pull request Jul 14, 2023
4 tasks
@krivard
Copy link
Contributor Author

krivard commented Jul 17, 2023

oh noooooo this needs an ON DUPLICATE KEY UPDATE since we're no longer guaranteed to only insert once for each reference date + issue pair

unit tests:
* factor out mocks for "it's not there yet" / "it's already there" cases
* check both cases
* tests pass

integration tests (specifically state_daily):
* move second 3/15 entry and the 3/16 entry to a separate metadata file
* add new dataset file for 3/16 showing new day of data
* make sure first and second 3/15 entries have different data
* add checks for pretend-it's-3/16
* test still broken; ran out of steam when it came time to complete the ON DUPLICATE KEY UPDATE clause
@krivard
Copy link
Contributor Author

krivard commented Jul 17, 2023

Okay @melange396 I've pushed up what I have and left a TODO where the ON DUPLICATE KEY should go

@melange396
Copy link
Collaborator

i initially had a different change for the ON DUPLICATE KEY stuff where i rewrote more of the surrounding code for clarity and efficiency, but decided against that because of how it will complicate the integration with #1203 down the road.

@dshemetov
Copy link
Contributor

dshemetov commented Jul 26, 2023

EDIT: Updated after fixing.

Simplified the metadata files so they're easier to read:

# metadata.csv
Update Date,Days Since Update,User,Rows,Row Change,Columns,Column Change,Metadata Published,Metadata Updates,Column Level Metadata,Column Level Metadata Updates,Archive Link
03/13/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test0.csv
03/15/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test1.csv

# metadata2.csv
Update Date,Days Since Update,User,Rows,Row Change,Columns,Column Change,Metadata Published,Metadata Updates,Column Level Metadata,Column Level Metadata Updates,Archive Link
03/13/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test0.csv
03/15/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test1.csv
03/15/2021 00:00:01 AM,0,0,0,0,0,0,0,0,0,0,https://test2.csv
03/15/2021 00:00:02 AM,0,0,0,0,0,0,0,0,0,0,https://test3.csv
03/16/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test4.csv
03/16/2021 00:00:01 AM,0,0,0,0,0,0,0,0,0,0,https://test5.csv

Some commentary on the tests test_scenarios.py:

  • on line 73, we initialize the db with data from the default dataset dataset.csv and using the two issues from metadata.csv
  • on line 106, we test a second acquisition which is a noop because we use metadata.csv again
  • on line 123, we test a third acquisition with metadata2.csv; it shares two issues with metadata.csv, but contains 4 new issues; each of these 4 updates contains an update to a row for WY data
    • the first two (duplicate) issues are ignored correctly
    • 03/15/2021 00:00:01 and 03/15/2021 00:00:02 check that we correctly handle two updates for the same row for the same issue for an already existing row
    • 03/16/2021 00:00:00 and 03/16/2021 00:00:01 check that we correctly handle two updates for the same row for the same issue for a brand new row

@sonarqubecloud
Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

@melange396
Copy link
Collaborator

Looks like we arent even testing ON DUPLICATE KEY UPDATE (at least not in the place you're pointing out)... For WY, dataset0.csv has value 5 and date 2020/12/09 (the date is under column "reporting_cutoff_start" which eventually gets renamed/mapped to "date" in the db), dataset1.csv has value 8 and date 2020/12/10... So even with matching "issue"s, theres no key collision because the "date"s are different, and the value isn't overwritten. Line 125 is correct in failing, the value should be 5.

@dshemetov
Copy link
Contributor

dshemetov commented Nov 11, 2023

Good call @melange396, I fixed the test thanks to your pointer. I also simplified some things specific to this test (see updated comment above).

@dshemetov dshemetov self-assigned this Nov 13, 2023
@melange396
Copy link
Collaborator

Nice, thanks @dshemetov ! I especially appreciate you getting rid of those extra (and confusing) csv files.

Copy link

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 3 Code Smells

No Coverage information No Coverage information
0.0% 0.0% Duplication

Copy link
Collaborator

@melange396 melange396 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, i think this is ready for merge and release!

@melange396 melange396 merged commit b1a540d into dev Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants