-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplicating IPs within different time granularities #75
Comments
@aaronkaplan can you tell us what you think here? Relates back to #42 |
Across different feeds? There is only one file per week per feed. |
@kxyne what do you mean by "average of weeks that fall in that month"?
Results I think should be:
In this scenario we can not just sum up monthly counts to get result for quarter or year |
@kxyne Also reason we are switching aggregation on day level is that there may be case when scan won't end within same day - it may start in one month and end in next, but within same week. |
Somehow I didn't get the update on this one. For sake of argument I'd probably assign weeks to the month that their Wednesday falls in, the ISO standard has no guide for this. |
@kxyne sorry for late respond... Ok, think I've got you re week/month splits. |
Well we deduplicate by the week only, and we average the weeks across a month/year, not cumulative counts. Either that or we re-do everything and pre-aggregate on the back end before deduplication but that could be a later enhancement. |
The way I see it we're interested in trends, not counts, but maybe we should throw this to the stats group? |
@rufuspollock this make me think to switch back to week as base granularity... What would you say? |
@aaronkaplan @kxyne this is something to discuss on next tech team call. |
@rufuspollock @aaronkaplan @kxyne we should include this into the list of items to be discussed on next team call. |
Moving to backlog as this needs a team discussion. |
Currently IPs are deduplicated on week level, meaning - if there is same IP within same week and risk - it is ignored (not counted).
When this is done on day level, it will result deduplicated IPs within same day. Meaning - simple Rollup by date just won't work, cause sum of daily counts, for any time granularity, won't be accurate (they'll include tons of duplicated IPs)
This may lead to create different fact tables for different time granularities.
The text was updated successfully, but these errors were encountered: