Skip to content

Conversation

@prudhvigodithi
Copy link
Contributor

@prudhvigodithi prudhvigodithi commented Oct 30, 2025

Description

Coming from #14485 and #13745 (Initial implementation of intra-segment search concurrency #13542), when splitting a segment into partitions for intra segment search, each partition would create a DocIdSetBuilder that allocates memory based on the entire segment size, even though it only collects documents within a small partition range. This PR adds partition aware support to DocIdSetBuilder which creates bitsets and buffers scoped to its doc ID range instead of the entire segment size, this change will have memory efficiency during intra segment search.

Example for a Segment with 1M documents split into 4 partitions of 250K docs each and now each partition creates a FixedBitSet(1M) which is not required.

PartitionAwareBufferAdder:

  • Filters documents to only accept those within minDocId, maxDocId range.
  • Stores absolute doc IDs in buffers (used for sparse results below threshold) and rejects not part of of the partition range.

PartitionAwareFixedBitSetAdder

  • Filters documents to only accept those within partition range.
  • Uses partition sized bitset instead of segment sized.

OffsetBitDocIdSet & OffsetDocIdSetIterator

    • FixedBitSet uses the doc ID parameter directly as an array index. When we create partition sized bitsets to save memory, we store documents using relative indices (0 to partitionSize-1) internally, but the Lucene API requires iterators to return absolute doc IDs. These wrapper classes handle the conversion automatically.
  • So these wrapper classes adds offset during iteration (when PartitionAwareFixedBitSetAdder is used). This is to convert partition relative indices back to absolute doc IDs.
  • Callers should always receive absolute doc IDs.
Segment: 100,000 documents
Partition: [50,000 to 60,000) - only 10,000 docs

Without Optimization (Old Way):

Create bitset for ENTIRE segment:
FixedBitSet(100,000 bits)

Bit position:  0     1     2  ... 50000 ... 50500 ... 55000 ... 59999 ... 99999
                ↓     ↓     ↓       ↓        ↓         ↓         ↓         ↓
Bit value:      0     0     0       1        1         1         1         0
                               

With Optimization (New Way):

Create bitset with ONLY partition size:
FixedBitSet(10,000 bits)

Bit position:  0    1    2    ... 500  ... 5000 ... 9000 ... 9999
               ↓    ↓    ↓        ↓        ↓        ↓        ↓
Bit value:     1    0    0        1        1        1        0
               └───────────────────────────────────────────────┘
                All bits used efficiently!
                
Storage mapping (with offset):
  Doc 50,000 → Bit[0]     (50,000 - 50,000 = 0)
  Doc 50,500 → Bit[500]   (50,500 - 50,000 = 500)
  Doc 55,000 → Bit[5,000] (55,000 - 50,000 = 5,000)
  Doc 59,999 → Bit[9,999] (59,999 - 50,000 = 9,999)

@prudhvigodithi
Copy link
Contributor Author

prudhvigodithi commented Oct 31, 2025

Hey all, pending to add some tests/validations and code clean up from my end but before this I would like to get some early feedback on the approach to see if the idea would make sense.

@prudhvigodithi prudhvigodithi marked this pull request as ready for review October 31, 2025 15:34
@prudhvigodithi
Copy link
Contributor Author

Adding @jainankitk @getsaurabh02 to the conversation.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@prudhvigodithi
Copy link
Contributor Author

Ok the exists checks and tests are now green, let me add some tests in TestDocIdSetBuilder.

Comment on lines 44 to 48
public sealed interface BulkAdder
permits FixedBitSetAdder,
BufferAdder,
PartitionAwareFixedBitSetAdder,
PartitionAwareBufferAdder {
Copy link
Member

@benwtrent benwtrent Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now megamorphic :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We should run the benchmark to quantify the impact due to virtual calls and megamorphism. Also assuming the impact is significant, I am wondering if we can use directly PartitionAwareFixedBitSetAdder instead of FixedBitSetAdder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes good point, we can unify them. For non-partitioned case the minDocId = 0, maxDocId = maxDoc and offset = 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try to implement this and run the tests part of TestDocIdSetBuilder.java .

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@prudhvigodithi
Copy link
Contributor Author

I have added some decent tests in TestDocIdSetBuilder. Please let me know what could be the next steps here.

@jainankitk
Copy link
Contributor

I have added some decent tests in TestDocIdSetBuilder. Please let me know what could be the next steps here.

Thanks @prudhvigodithi for adding the tests. Will be good to see the performance benchmark numbers and ensure there isn't any regression due to the offset logic

Signed-off-by: Prudhvi Godithi <[email protected]>
@msfroh
Copy link
Contributor

msfroh commented Nov 13, 2025

Thinking through the logic here, the only benefit is in terms of the size of arrays allocated. We're still doing just as many allocations in total and the individual partitions will each traverse the same range of the point tree (just collecting different doc IDs, while others get excluded by the partition filter). I'm skeptical that there is measurable benefit (unless you have a lot of slices over a big segment).

I find the change to add a scorerSupplier(LeafReaderContextPartition) method much more interesting (*). I'm imagining an implementation in PointRangeQuery's anonymous Weight could create a synchronized scorerSupplier per segment, where each partition would wrap that with something that filters over their doc IDs. That way, you'd go back to only creating one FixedBitSet per segment, regardless of how many slices are there (though the other threads would block until the winning thread finishes collecting).

(*) Of course, it's also a very significant change. @javanna -- I'd be curious to get your opinion on it. I feel like it could be a way of addressing #13745 incrementally. The default behavior could be to get a ScorerSupplier for the whole segment, but query-specific implementations might be able to do less work per partition (or share work across partitions). I'm not 100% convinced that it's the best solution, but I think it may work.

@jainankitk
Copy link
Contributor

I'm skeptical that there is measurable benefit (unless you have a lot of slices over a big segment).

As per my understanding the slice generation logic is fairly aggressive. So even in case of 4/8 slices for a segment, this change should reduce the operating memory by 4x / 8x for that segment

The suggestion to create a synchronized scorerSupplier per segment is interesting. I was initially concerned about the synchronization overhead, but that is just once per segment. Although I feel having partitioned FixedBitSet should add even more value in that case, as we can get the winner thread to populate partitioned FixedBitSet for each segment partition, and after it is done with that, those partitioned FixedBitSet can be processed concurrently by collector for each partition, without worrying about synchronization with other threads. Primary additional overhead I can think of is for the winner thread to populate each matching document into the correct FixedBitSet

@msfroh
Copy link
Contributor

msfroh commented Nov 14, 2025

those partitioned FixedBitSet can be processed concurrently by collector for each partition, without worrying about synchronization with other threads.

Reading from a single FixedBitSet can be done by multiple threads with no synchronization. It's just reading from a long[].

@jainankitk
Copy link
Contributor

Reading from a single FixedBitSet can be done by multiple threads with no synchronization. It's just reading from a long[].

I guess we can do that, but still not sure if it will seamlessly integrate into existing abstractions on top of that. I was initially thinking about say cost function of this iterator, but there seems to be implementation for specific docId range cardinality(int from, int to). But I am still concerned, there might be few other unknown things that might pop up.

Also from the performance perspective, even simple iteration on this long[] will be randomized due to different threads accessing different parts of the array. So, it might be efficient to partition into long[][] where each row is accessed sequentially by one thread

@prudhvigodithi
Copy link
Contributor Author

Yes this is the issue to handle the duplicate work per segment #13745, the main target of this is to reduce the size allocation per partition, without this PR today each partition thread allocates full segment-sized structure, so the number of allocations is the same, but the size is vastly different.

IMO this change should be still useful when we come up with strategy to stop duplicate work per segment #13745

Before I run the full benchmarks I guess I can quickly test the final DocIdSet's ramBytesUsed ? this should show reduction with partition aware DocIdSetBuilder.

@msfroh
Copy link
Contributor

msfroh commented Nov 14, 2025

Before I run the full benchmarks I guess I can quickly test the final DocIdSet's ramBytesUsed ? this should show reduction with partition aware DocIdSetBuilder.

This doesn't need to be demonstrated by a test. It's obvious that if you have a segment with N docs and you split it into two partitions and allocate two arrays with N/2 bits each it will use half the memory of two arrays with N bits each. Nobody is disputing the reduction in heap usage.

The question is whether the reduction in heap usage will have a measurable impact, which we can only see from benchmarks. Also, if we can reduce the number of tree traversals (i.e. only do one tree traversal per segment instead of per partition), then we would expect to see a performance benefit, since we're doing less work.

@prudhvigodithi
Copy link
Contributor Author

if we can reduce the number of tree traversals (i.e. only do one tree traversal per segment instead of per partition), then we would expect to see a performance benefit, since we're doing less work.

Thanks @msfroh, true we have to do that eventually for intra segment search, my point is this change is to just have partition aware DocIdSetBuilder and can be leveraged in PointRangeQuery.

Followup similar to public DocIdSetBuilder(int maxDoc, PointValues values, int minDocId, int maxDocId). We should also have override public DocIdSetBuilder(int maxDoc, Terms terms)

@prudhvigodithi
Copy link
Contributor Author

The question is whether the reduction in heap usage will have a measurable impact, which we can only see from benchmarks

Yes I'm playing with https://github.com/mikemccand/luceneutil/ (dealing with some setup issues) and will post the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants