chg: [join] fix overflowing filters on join operation #18

gallypette · 2022-09-16T12:47:20Z

bloom does not check correctly the receiving filter's capacity before merging another one into it. For instance:

$ bloom create -p 0.001 -n 100 test1.bloom
$ bloom create -p 0.001 -n 100 test2.bloom
$ find /usr/bin/ -type f -print0 | xargs -0 sha1sum | awk '{ print $1 }' > corpus.txt
$ head -n 60 corpus.txt | bloom insert test1.bloom
$ tail -n 60 corpus.txt | bloom insert test2.bloom
$ bloom join test1.bloom test2.bloom
$ bloom show test1.bloom

result:

File:			/home/jlouis/Play/bloom/test1.bloom
Capacity:		100
Elements present:	120
FP probability:		1.00e-03
Bits:			1437
Hash functions:		10

This PR aims at correcting this small quirck:

Error: addition of member counts would overflow

cheers,
jlouis

adulau · 2022-09-24T06:46:00Z

@satta Do you think it would be possible to review and merge it? We are using the library for various projects and we would like to keep getting the library upstream. Thanks a lot.

satta · 2022-09-24T12:25:19Z

Sorry, I just didn't notice the PR before, or I would have responded earlier.

I deliberately did not check the sum of the filter element counts against the target capacity because any target count calculation only correct when both filters are disjoint. If there is a large overlap between the filters, then the target number of element counts can in fact be lower than the sum of both input element counts. Hence joining such filters needs not necessarily exceed the capacity. See also the corresponding comment in the code (https://github.com/DCSO/bloom/blob/master/bloom.go#L237) stating that the new count is only to be considered an upper bound.
Since IMHO merging overlapping filters is a valid use case that is worth trading off exact element counts for, there is no check for an 'overflow' here.

So long story short, I would rather not disallow such joins in the library code altogether. I'd rather handle such cases in the code that uses the library, where more is known about the usage pattern.
An idea would be in to return a separate named error, i.e. bloom.OverflowPossibleError, in this case that signals such a situation but still performs the join, and that can or can not be handled by the calling code. Any comments?

gallypette · 2022-09-24T12:59:58Z

I understand the rationale. But I a little worried that joining some filters--mostly disjoints--may exceed their maximum capacity and become incorrect regarding their false positive rate. I am not sure I can still trust p once I joined two filters while leaving the possibility for an overflow.
Interesting problem :)

gallypette · 2022-09-26T06:59:15Z

An alternative would be to forbid joining filters that exceed the capacity, and only allow it when a force flag -f is set.

satta · 2022-09-27T07:35:36Z

An alternative would be to forbid joining filters that exceed the capacity, and only allow it when a force flag -f is set.

This would be my preferred solution, but would obviously only affect the command line tool instead of the library. If that's fine with you as well, could you adjust your PR accordingly please? :)

gallypette · 2022-09-27T09:02:47Z

sure, will do when I have some free cycles ;)

satta · 2022-09-30T09:03:17Z

sure, will do when I have some free cycles ;)

Thanks!

chg: [join] fix overflowing filters on join operation

fd46ed2

satta added the enhancement label Sep 24, 2022

satta self-assigned this Sep 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chg: [join] fix overflowing filters on join operation #18

chg: [join] fix overflowing filters on join operation #18

gallypette commented Sep 16, 2022

adulau commented Sep 24, 2022

satta commented Sep 24, 2022

gallypette commented Sep 24, 2022 •

edited

Loading

gallypette commented Sep 26, 2022

satta commented Sep 27, 2022

gallypette commented Sep 27, 2022

satta commented Sep 30, 2022

chg: [join] fix overflowing filters on join operation #18

Are you sure you want to change the base?

chg: [join] fix overflowing filters on join operation #18

Conversation

gallypette commented Sep 16, 2022

adulau commented Sep 24, 2022

satta commented Sep 24, 2022

gallypette commented Sep 24, 2022 • edited Loading

gallypette commented Sep 26, 2022

satta commented Sep 27, 2022

gallypette commented Sep 27, 2022

satta commented Sep 30, 2022

gallypette commented Sep 24, 2022 •

edited

Loading