-
Notifications
You must be signed in to change notification settings - Fork 68
Set default older_than to tomorrow #1224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this is untested in any way, and will probably need unit/integration test changes too. with the commit i just pushed, it "should" now look back to the same day as the last issue to see if there were any other new data revisions issued, instead of just looking back to the next day after the last issue. any revisions from that last day that were already collected are skipped. the new usage of the i think this also means, in normal usage, we should never again see the "no new issues; nothing to do" log message, but will now see the new log message "already collected revision:" on every run. |
oh noooooo this needs an ON DUPLICATE KEY UPDATE since we're no longer guaranteed to only insert once for each reference date + issue pair |
unit tests: * factor out mocks for "it's not there yet" / "it's already there" cases * check both cases * tests pass integration tests (specifically state_daily): * move second 3/15 entry and the 3/16 entry to a separate metadata file * add new dataset file for 3/16 showing new day of data * make sure first and second 3/15 entries have different data * add checks for pretend-it's-3/16 * test still broken; ran out of steam when it came time to complete the ON DUPLICATE KEY UPDATE clause
Okay @melange396 I've pushed up what I have and left a TODO where the ON DUPLICATE KEY should go |
i initially had a different change for the ON DUPLICATE KEY stuff where i rewrote more of the surrounding code for clarity and efficiency, but decided against that because of how it will complicate the integration with #1203 down the road. |
EDIT: Updated after fixing. Simplified the metadata files so they're easier to read:
Some commentary on the tests
|
Kudos, SonarCloud Quality Gate passed!
|
Looks like we arent even testing |
Good call @melange396, I fixed the test thanks to your pointer. I also simplified some things specific to this test (see updated comment above). |
Nice, thanks @dshemetov ! I especially appreciate you getting rid of those extra (and confusing) csv files. |
…ogic in incoming metadata processing
Kudos, SonarCloud Quality Gate passed!
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, i think this is ready for merge and release!
Summary:
We currently wait until a day is over before pulling covid_hosp input files posted that day. Unfortunately, the new healthdata.gov posting schedule for covid_hosp data is Fridays and Mondays at ~12:10pm. Forecasting would like to use the fresh Monday data for the Monday forecasts they submit at 3pm (Slack convo link). This means we should change the default pipeline behavior to permit same-day healthdata.gov files to be included in each update.
This PR updates the default selection criteria for which healthdata.gov files to import so that it includes data posted today.
WARNING: if healthdata.gov posts a file after 12:30 on a Monday or a Friday, that file will never be picked up by the regular pipeline and must be handled with a manual invocation of the pipeline that selects for the day the late file was posted.7/17 edit: a fix for this edge case is in progress, thanks george!This is a semi-hotfix. The covid_hosp acquisition materials are in the midst of a massive overhaul, but we'd like to implement this behavior change before the new code will be ready, so this PR modifies the old code.
Prerequisites:
dev
branchdev