Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chg: [join] fix overflowing filters on join operation #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gallypette
Copy link

bloom does not check correctly the receiving filter's capacity before merging another one into it. For instance:

$ bloom create -p 0.001 -n 100 test1.bloom
$ bloom create -p 0.001 -n 100 test2.bloom
$ find /usr/bin/ -type f -print0 | xargs -0 sha1sum | awk '{ print $1 }' > corpus.txt
$ head -n 60 corpus.txt | bloom insert test1.bloom
$ tail -n 60 corpus.txt | bloom insert test2.bloom
$ bloom join test1.bloom test2.bloom
$ bloom show test1.bloom

result:

File:			/home/jlouis/Play/bloom/test1.bloom
Capacity:		100
Elements present:	120
FP probability:		1.00e-03
Bits:			1437
Hash functions:		10

This PR aims at correcting this small quirck:

Error: addition of member counts would overflow

cheers,
jlouis

@adulau
Copy link

adulau commented Sep 24, 2022

@satta Do you think it would be possible to review and merge it? We are using the library for various projects and we would like to keep getting the library upstream. Thanks a lot.

@satta
Copy link
Member

satta commented Sep 24, 2022

Sorry, I just didn't notice the PR before, or I would have responded earlier.

I deliberately did not check the sum of the filter element counts against the target capacity because any target count calculation only correct when both filters are disjoint. If there is a large overlap between the filters, then the target number of element counts can in fact be lower than the sum of both input element counts. Hence joining such filters needs not necessarily exceed the capacity. See also the corresponding comment in the code (https://github.com/DCSO/bloom/blob/master/bloom.go#L237) stating that the new count is only to be considered an upper bound.
Since IMHO merging overlapping filters is a valid use case that is worth trading off exact element counts for, there is no check for an 'overflow' here.

So long story short, I would rather not disallow such joins in the library code altogether. I'd rather handle such cases in the code that uses the library, where more is known about the usage pattern.
An idea would be in to return a separate named error, i.e. bloom.OverflowPossibleError, in this case that signals such a situation but still performs the join, and that can or can not be handled by the calling code. Any comments?

@satta satta self-assigned this Sep 24, 2022
@gallypette
Copy link
Author

gallypette commented Sep 24, 2022

I understand the rationale. But I a little worried that joining some filters--mostly disjoints--may exceed their maximum capacity and become incorrect regarding their false positive rate. I am not sure I can still trust p once I joined two filters while leaving the possibility for an overflow.
Interesting problem :)

@gallypette
Copy link
Author

An alternative would be to forbid joining filters that exceed the capacity, and only allow it when a force flag -f is set.

@satta
Copy link
Member

satta commented Sep 27, 2022

An alternative would be to forbid joining filters that exceed the capacity, and only allow it when a force flag -f is set.

This would be my preferred solution, but would obviously only affect the command line tool instead of the library. If that's fine with you as well, could you adjust your PR accordingly please? :)

@gallypette
Copy link
Author

sure, will do when I have some free cycles ;)

@satta
Copy link
Member

satta commented Sep 30, 2022

sure, will do when I have some free cycles ;)

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants