Set default older_than to tomorrow #1224

krivard · 2023-06-30T18:17:16Z

Summary:

We currently wait until a day is over before pulling covid_hosp input files posted that day. Unfortunately, the new healthdata.gov posting schedule for covid_hosp data is Fridays and Mondays at ~12:10pm. Forecasting would like to use the fresh Monday data for the Monday forecasts they submit at 3pm (Slack convo link). This means we should change the default pipeline behavior to permit same-day healthdata.gov files to be included in each update.

This PR updates the default selection criteria for which healthdata.gov files to import so that it includes data posted today.

WARNING: if healthdata.gov posts a file after 12:30 on a Monday or a Friday, that file will never be picked up by the regular pipeline and must be handled with a manual invocation of the pipeline that selects for the day the late file was posted. 7/17 edit: a fix for this edge case is in progress, thanks george!

This is a semi-hotfix. The covid_hosp acquisition materials are in the midst of a massive overhaul, but we'd like to implement this behavior change before the new code will be ready, so this PR modifies the old code.

Prerequisites:

Unless it is a documentation hotfix it should be merged against the dev branch
Branch is up-to-date with the branch to be merged with, i.e. dev
Build is successful
Code is cleaned up and formatted

melange396 · 2023-07-14T20:24:35Z

this is untested in any way, and will probably need unit/integration test changes too.

with the commit i just pushed, it "should" now look back to the same day as the last issue to see if there were any other new data revisions issued, instead of just looking back to the next day after the last issue. any revisions from that last day that were already collected are skipped.

the new usage of the with database.connect() could maybe be widened in its scope (ie, pulled outisde of some number of the nested blocks), but i didnt want to keep the connection open while datasets are fetched and merged (which could potentially take a long time).

i think this also means, in normal usage, we should never again see the "no new issues; nothing to do" log message, but will now see the new log message "already collected revision:" on every run.

krivard · 2023-07-17T19:05:04Z

oh noooooo this needs an ON DUPLICATE KEY UPDATE since we're no longer guaranteed to only insert once for each reference date + issue pair

unit tests: * factor out mocks for "it's not there yet" / "it's already there" cases * check both cases * tests pass integration tests (specifically state_daily): * move second 3/15 entry and the 3/16 entry to a separate metadata file * add new dataset file for 3/16 showing new day of data * make sure first and second 3/15 entries have different data * add checks for pretend-it's-3/16 * test still broken; ran out of steam when it came time to complete the ON DUPLICATE KEY UPDATE clause

krivard · 2023-07-17T19:13:23Z

Okay @melange396 I've pushed up what I have and left a TODO where the ON DUPLICATE KEY should go

…entifier in metadata

melange396 · 2023-07-20T19:39:47Z

i initially had a different change for the ON DUPLICATE KEY stuff where i rewrote more of the surrounding code for clarity and efficiency, but decided against that because of how it will complicate the integration with #1203 down the road.

dshemetov · 2023-07-26T02:18:38Z

EDIT: Updated after fixing.

Simplified the metadata files so they're easier to read:

# metadata.csv
Update Date,Days Since Update,User,Rows,Row Change,Columns,Column Change,Metadata Published,Metadata Updates,Column Level Metadata,Column Level Metadata Updates,Archive Link
03/13/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test0.csv
03/15/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test1.csv

# metadata2.csv
Update Date,Days Since Update,User,Rows,Row Change,Columns,Column Change,Metadata Published,Metadata Updates,Column Level Metadata,Column Level Metadata Updates,Archive Link
03/13/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test0.csv
03/15/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test1.csv
03/15/2021 00:00:01 AM,0,0,0,0,0,0,0,0,0,0,https://test2.csv
03/15/2021 00:00:02 AM,0,0,0,0,0,0,0,0,0,0,https://test3.csv
03/16/2021 00:00:00 AM,0,0,0,0,0,0,0,0,0,0,https://test4.csv
03/16/2021 00:00:01 AM,0,0,0,0,0,0,0,0,0,0,https://test5.csv

Some commentary on the tests test_scenarios.py:

on line 73, we initialize the db with data from the default dataset dataset.csv and using the two issues from metadata.csv
on line 106, we test a second acquisition which is a noop because we use metadata.csv again
on line 123, we test a third acquisition with metadata2.csv; it shares two issues with metadata.csv, but contains 4 new issues; each of these 4 updates contains an update to a row for WY data
- the first two (duplicate) issues are ignored correctly
- 03/15/2021 00:00:01 and 03/15/2021 00:00:02 check that we correctly handle two updates for the same row for the same issue for an already existing row
- 03/16/2021 00:00:00 and 03/16/2021 00:00:01 check that we correctly handle two updates for the same row for the same issue for a brand new row

sonarqubecloud · 2023-07-26T02:27:49Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
0.0% Duplication

melange396 · 2023-11-02T22:04:12Z

Looks like we arent even testing ON DUPLICATE KEY UPDATE (at least not in the place you're pointing out)... For WY, dataset0.csv has value 5 and date 2020/12/09 (the date is under column "reporting_cutoff_start" which eventually gets renamed/mapped to "date" in the db), dataset1.csv has value 8 and date 2020/12/10... So even with matching "issue"s, theres no key collision because the "date"s are different, and the value isn't overwritten. Line 125 is correct in failing, the value should be 5.

dshemetov · 2023-11-11T00:57:28Z

Good call @melange396, I fixed the test thanks to your pointer. I also simplified some things specific to this test (see updated comment above).

melange396 · 2023-11-14T14:28:38Z

Nice, thanks @dshemetov ! I especially appreciate you getting rid of those extra (and confusing) csv files.

…ogic in incoming metadata processing

sonarqubecloud · 2023-11-14T14:48:03Z

Kudos, SonarCloud Quality Gate passed!

0 Bugs
0 Vulnerabilities
0 Security Hotspots
3 Code Smells

No Coverage information
0.0% Duplication

melange396

LGTM, i think this is ready for merge and release!

krivard added 2 commits June 30, 2023 14:12

Set default older_than to tomorrow

217fcfc

Fix test to include today's posts

193ef7b

krivard requested review from melange396 and rzats July 3, 2023 20:52

look for new issues from later in the same day as the last issue

3e34f5c

melange396 mentioned this pull request Jul 14, 2023

Refactor covid_hosp auto columns #1203

Merged

4 tasks

remove extraneous colon

58e9519

melange396 added 2 commits July 19, 2023 10:55

added ON DUPLICATE KEY UPDATE clause

34e56c3

make 'Archive Link' urls unique, as theyre used for the 'revision' id…

af210a7

…entifier in metadata

fix(covid_hosp): test for older_than fix

9d7f7c2

dshemetov added 2 commits November 10, 2023 16:41

refactor(covid_hosp test): small name change

85c9a70

fix(covid_hosp test): fix update scenarios test

08c22ec

dshemetov added 2 commits November 10, 2023 18:20

refactor(covid_hosp test): simplify

fbb8992

doc(covid_hosp): fix comment

a356310

dshemetov self-assigned this Nov 13, 2023

melange396 added 2 commits November 14, 2023 09:29

small clarifications in variable names, comments, log messages, and l…

eecc6b8

…ogic in incoming metadata processing

Merge branch 'dev' into krivard/covid_hosp-older_than

b2fb357

melange396 approved these changes Nov 14, 2023

View reviewed changes

melange396 merged commit b1a540d into dev Nov 14, 2023

melange396 deleted the krivard/covid_hosp-older_than branch November 14, 2023 15:25

dshemetov mentioned this pull request Nov 15, 2023

Release Delphi Epidata 4.1.14 #1349

Merged

melange396 mentioned this pull request Dec 5, 2023

use REPLACE INTO instead of INSERT INTO...UPDATE in covid_hosp acquisition #1356

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Set default older_than to tomorrow #1224

Set default older_than to tomorrow #1224

Uh oh!

krivard commented Jun 30, 2023 •

edited by dshemetov

Loading

Uh oh!

melange396 commented Jul 14, 2023

Uh oh!

krivard commented Jul 17, 2023

Uh oh!

krivard commented Jul 17, 2023

Uh oh!

melange396 commented Jul 20, 2023

Uh oh!

dshemetov commented Jul 26, 2023 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jul 26, 2023

Uh oh!

melange396 commented Nov 2, 2023

Uh oh!

dshemetov commented Nov 11, 2023 •

edited

Loading

Uh oh!

melange396 commented Nov 14, 2023

Uh oh!

sonarqubecloud bot commented Nov 14, 2023

Uh oh!

melange396 left a comment

Uh oh!

Uh oh!

Set default older_than to tomorrow #1224

Set default older_than to tomorrow #1224

Uh oh!

Conversation

krivard commented Jun 30, 2023 • edited by dshemetov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary:

Prerequisites:

Uh oh!

melange396 commented Jul 14, 2023

Uh oh!

krivard commented Jul 17, 2023

Uh oh!

krivard commented Jul 17, 2023

Uh oh!

melange396 commented Jul 20, 2023

Uh oh!

dshemetov commented Jul 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Jul 26, 2023

Uh oh!

melange396 commented Nov 2, 2023

Uh oh!

dshemetov commented Nov 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

melange396 commented Nov 14, 2023

Uh oh!

sonarqubecloud bot commented Nov 14, 2023

Uh oh!

melange396 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

krivard commented Jun 30, 2023 •

edited by dshemetov

Loading

dshemetov commented Jul 26, 2023 •

edited

Loading

dshemetov commented Nov 11, 2023 •

edited

Loading