Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deduplicating IPs within different time granularities #75

Open
zelima opened this issue Dec 6, 2016 · 12 comments
Open

Deduplicating IPs within different time granularities #75

zelima opened this issue Dec 6, 2016 · 12 comments
Assignees
Milestone

Comments

@zelima
Copy link
Contributor

zelima commented Dec 6, 2016

Currently IPs are deduplicated on week level, meaning - if there is same IP within same week and risk - it is ignored (not counted).

When this is done on day level, it will result deduplicated IPs within same day. Meaning - simple Rollup by date just won't work, cause sum of daily counts, for any time granularity, won't be accurate (they'll include tons of duplicated IPs)

This may lead to create different fact tables for different time granularities.

@rufuspollock
Copy link
Contributor

@aaronkaplan can you tell us what you think here? Relates back to #42

@kxyne
Copy link

kxyne commented Dec 8, 2016

Across different feeds? There is only one file per week per feed.
We're basing everything on a week, so if you are doing a month it needs to be the average of the weeks that fall in that month. I'm very much in favour of moving to ISO8601 weeks to make all of this clearer.

@zelima
Copy link
Contributor Author

zelima commented Dec 9, 2016

@kxyne what do you mean by "average of weeks that fall in that month"?
Could you give the result output for this example scan results:

# data loaded into table
ts,ip,risk_id,asn,cc
2016-11-28T00:00:00+00:00,111.11.111.11,2,33333,GB 
2016-11-28T00:00:00+00:00,555.55.555.55,2,33333,GB 
2016-12-05T00:00:00+00:00,111.11.111.11,2,33333,GB  
2016-12-12T00:00:00+00:00,111.11.111.11,2,33333,GB
2016-12-19T00:00:00+00:00,111.11.111.11,2,33333,GB 

Results I think should be:

  • weekly aggregation
date,risk,country,asn,count
2016-11-28T00:00:00+00:00,2,GB,33333,2
2016-12-05T00:00:00+00:00,2,GB,33333,1
2016-12-12T00:00:00+00:00,2,GB,33333,1
2016-12-19T00:00:00+00:00,2,GB,33333,1 
  • Monthly
date,risk,country,asn,count
2016-11-01T00:00:00+00:00,2,GB,33333,2
2016-12-01T00:00:00+00:00,2,GB,33333,1  #count is still 1 cause of same IP within same month
  • Annual
date,risk,country,asn,count
2016-01-01T00:00:00+00:00,2,GB,33333,2 

In this scenario we can not just sum up monthly counts to get result for quarter or year

@zelima
Copy link
Contributor Author

zelima commented Dec 9, 2016

@kxyne Also reason we are switching aggregation on day level is that there may be case when scan won't end within same day - it may start in one month and end in next, but within same week.
Eg: scan started on 2016-12-31 an ended on 2017-01-01 - same week, but different month, quarter and year

@kxyne
Copy link

kxyne commented Dec 14, 2016

Somehow I didn't get the update on this one. For sake of argument I'd probably assign weeks to the month that their Wednesday falls in, the ISO standard has no guide for this.
We've been looking at how we process in the ETL when it comes to weeks (things like "which file do I pick up?" and using the ISO week that days fall in makes it all nicely aligned across the feeds, otherwise it'll sawtooth between threats based on the day the scan happens on.

@zelima
Copy link
Contributor Author

zelima commented Dec 20, 2016

@kxyne sorry for late respond... Ok, think I've got you re week/month splits.
What about deduplicating? What if the same IP within the same risk comes up in 3 different weeks - should they be counted as 3 or 1, when let's say we want to see annual results?

@zelima zelima added this to the Beta-3 milestone Dec 20, 2016
@kxyne
Copy link

kxyne commented Dec 20, 2016

Well we deduplicate by the week only, and we average the weeks across a month/year, not cumulative counts.

Either that or we re-do everything and pre-aggregate on the back end before deduplication but that could be a later enhancement.

@kxyne
Copy link

kxyne commented Dec 20, 2016

The way I see it we're interested in trends, not counts, but maybe we should throw this to the stats group?

@zelima
Copy link
Contributor Author

zelima commented Dec 20, 2016

@rufuspollock this make me think to switch back to week as base granularity... What would you say?

@rufuspollock
Copy link
Contributor

@aaronkaplan @kxyne this is something to discuss on next tech team call.

@zelima
Copy link
Contributor Author

zelima commented Jan 30, 2017

@rufuspollock @aaronkaplan @kxyne we should include this into the list of items to be discussed on next team call.
Moving to current milestone

@rufuspollock
Copy link
Contributor

Moving to backlog as this needs a team discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants