-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Support DocIdSetBuilder with partition bounds #15383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: Prudhvi Godithi <[email protected]>
|
Hey all, pending to add some tests/validations and code clean up from my end but before this I would like to get some early feedback on the approach to see if the idea would make sense. |
|
Adding @jainankitk @getsaurabh02 to the conversation. |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
1 similar comment
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
|
Ok the exists checks and tests are now green, let me add some tests in |
| public sealed interface BulkAdder | ||
| permits FixedBitSetAdder, | ||
| BufferAdder, | ||
| PartitionAwareFixedBitSetAdder, | ||
| PartitionAwareBufferAdder { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now megamorphic :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. We should run the benchmark to quantify the impact due to virtual calls and megamorphism. Also assuming the impact is significant, I am wondering if we can use directly PartitionAwareFixedBitSetAdder instead of FixedBitSetAdder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes good point, we can unify them. For non-partitioned case the minDocId = 0, maxDocId = maxDoc and offset = 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me try to implement this and run the tests part of TestDocIdSetBuilder.java .
Signed-off-by: Prudhvi Godithi <[email protected]>
|
This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR. |
Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: Prudhvi Godithi <[email protected]>
|
I have added some decent tests in |
Thanks @prudhvigodithi for adding the tests. Will be good to see the performance benchmark numbers and ensure there isn't any regression due to the offset logic |
Signed-off-by: Prudhvi Godithi <[email protected]>
|
Thinking through the logic here, the only benefit is in terms of the size of arrays allocated. We're still doing just as many allocations in total and the individual partitions will each traverse the same range of the point tree (just collecting different doc IDs, while others get excluded by the partition filter). I'm skeptical that there is measurable benefit (unless you have a lot of slices over a big segment). I find the change to add a (*) Of course, it's also a very significant change. @javanna -- I'd be curious to get your opinion on it. I feel like it could be a way of addressing #13745 incrementally. The default behavior could be to get a |
As per my understanding the slice generation logic is fairly aggressive. So even in case of 4/8 slices for a segment, this change should reduce the operating memory by 4x / 8x for that segment The suggestion to create a synchronized |
Reading from a single |
I guess we can do that, but still not sure if it will seamlessly integrate into existing abstractions on top of that. I was initially thinking about say cost function of this iterator, but there seems to be implementation for specific docId range Also from the performance perspective, even simple iteration on this |
|
Yes this is the issue to handle the duplicate work per segment #13745, the main target of this is to reduce the size allocation per partition, without this PR today each partition thread allocates full segment-sized structure, so the number of allocations is the same, but the size is vastly different. IMO this change should be still useful when we come up with strategy to stop duplicate work per segment #13745 Before I run the full benchmarks I guess I can quickly test the final DocIdSet's |
This doesn't need to be demonstrated by a test. It's obvious that if you have a segment with N docs and you split it into two partitions and allocate two arrays with N/2 bits each it will use half the memory of two arrays with N bits each. Nobody is disputing the reduction in heap usage. The question is whether the reduction in heap usage will have a measurable impact, which we can only see from benchmarks. Also, if we can reduce the number of tree traversals (i.e. only do one tree traversal per segment instead of per partition), then we would expect to see a performance benefit, since we're doing less work. |
Thanks @msfroh, true we have to do that eventually for intra segment search, my point is this change is to just have partition aware DocIdSetBuilder and can be leveraged in Followup similar to |
Yes I'm playing with https://github.com/mikemccand/luceneutil/ (dealing with some setup issues) and will post the results. |
Description
Coming from #14485 and #13745 (Initial implementation of intra-segment search concurrency #13542), when splitting a segment into partitions for intra segment search, each partition would create a
DocIdSetBuilderthat allocates memory based on the entire segment size, even though it only collects documents within a small partition range. This PR adds partition aware support toDocIdSetBuilderwhich creates bitsets and buffers scoped to its doc ID range instead of the entire segment size, this change will have memory efficiency during intra segment search.Example for a Segment with 1M documents split into 4 partitions of 250K docs each and now each partition creates a FixedBitSet(1M) which is not required.
PartitionAwareBufferAdder:PartitionAwareFixedBitSetAdderOffsetBitDocIdSet&OffsetDocIdSetIteratorFixedBitSetuses the doc ID parameter directly as an array index. When we create partition sized bitsets to save memory, we store documents using relative indices (0 to partitionSize-1) internally, but the Lucene API requires iterators to return absolute doc IDs. These wrapper classes handle the conversion automatically.PartitionAwareFixedBitSetAdderis used). This is to convert partition relative indices back to absolute doc IDs.Without Optimization (Old Way):
With Optimization (New Way):