Consolidation issue fix. #7

k0ushal · 2025-05-30T06:04:54Z

Fixed the tier based candidate selection
Default tiers are powers of 4 with the first tier being 0-4M followed by 4-16M, 16-64M and so on.
Fixed consolidation window of size 4

Note

Replaces the tiered consolidation algorithm with a cleanup-first, skew-aware, templated selection engine and updates tests accordingly.

Index utils (core):
- Introduce tier::ConsolidationConfig, SegmentAttributes, and templated ConsolidationCandidate with sliding-window, skew-based scoring.
- Add findBestCleanupCandidate (prefers low live-doc% segments) and findBestConsolidationCandidate (size-based, skew-thresholded) helpers.
- Wire new flow in ConsolidateTier policy: filter, early-exit on small sets, try cleanup candidates first, then consolidation; copy candidates via iterator range.
- Move/factor helpers (FillFactor, SizeWithoutRemovals) and define tier::SegmentStats in header; add getSegmentDimensions accessor.
- Note TODO on "too large segments" threshold formula.
API/Test adjustments:
- Extend AssertCandidates to accept error message.
- Add and rewrite tests for cleanup vs consolidation preference, singleton/threshold behavior, skew handling (including over-threshold no-merge), window pop/push, and combined live-percentage cases.
Misc:
- Add default and test constructors to SegmentInfo for convenience/testing.

^{Written by Cursor Bugbot for commit ae1f202. This will update automatically on new commits. Configure here.}

k0ushal · 2025-05-30T08:05:48Z

Documentation:
https://github.com/arangodb/documents/pull/145

goedderz

Comments as we talked about. Looks good to me!

goedderz · 2025-07-14T13:03:18Z

core/utils/index_utils.cpp

+      mergeBytes += itrMeta->byte_size;
+      skew = static_cast<double>(itrMeta->byte_size) / mergeBytes;
+      delCount += (itrMeta->docs_count - itrMeta->live_docs_count);
+      mergeScore = skew + (1.0 / (1 + delCount));
+      cost = mergeBytes * mergeScore;

-  size_t size_before_consolidation = 0;
-  size_t size_after_consolidation = 0;
-  size_t size_after_consolidation_floored = 0;
-  for (auto& segment_stat : consolidation) {
-    size_before_consolidation += segment_stat.meta->byte_size;
-    size_after_consolidation += segment_stat.size;
-    size_after_consolidation_floored +=
-      std::max(segment_stat.size, floor_segment_bytes);
+    } while (itr++ != end);


Probably inconsequential, but it would suffice to calculate skew, mergeScore and cost once after the loop for the last element.

goedderz · 2025-07-14T13:04:42Z

core/utils/index_utils.cpp

+  size_t nextTier = ConsolidationConfig::tier1;
+  while (nextTier < num)
+    nextTier = nextTier << 2;


Minor: You could probably use std::countl_zero and get rid of the loop.

goedderz · 2025-07-14T13:20:50Z

core/utils/index_utils.cpp

+    mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size;
+    skew = static_cast<double>(addMeta->byte_size) / mergeBytes;
+    delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta);
+    mergeScore = skew + (1 / (1 + delCount));


As already discussed:

We should think about whether calculating the mergeScore this way is sensible. What seems strange is that while the skew is a ratio (of byte-sizes), the second summand is an inverse count. This seems off: intuitively I'd expect e.g. a ratio of live and total documents to be considered alongside the skew.

This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.

Note that $\mathrm{skew} \in (0, 1)$. With $\mathrm{delCount} = 1$, we get
$$\begin{align*} \mathrm{mergeScore} &= \mathrm{skew} + \frac{1}{1 + \mathrm{delCount}} \\\ &= \mathrm{skew} + \frac 1 2 \\\ &\leq 1 \frac 1 2 \\\ &= \mathrm{maxMergeScore} \end{align*}$$
.

So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.

Let us at least do

mergeScore = skew + live_docs_count / total_docs_count;

instead, as discussed - this has more reasonable properties.

And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore will need to be adjusted to 0.5 to get a similar effect.

So we should actually do

mergeScore = skew * live_docs_count / total_docs_count;

(and adapt maxMergeScore).

To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).

goedderz · 2025-07-14T18:22:09Z

core/utils/index_utils.hpp

+    for (auto idx = start; idx != sorted_segments.end();) {
+
+        if (getSize(*idx) <= currentTier) {
+            idx++;
+            continue;
+        }
+
+        tiers.emplace_back(start, idx - 1);
+
+        //  The next tier may not necessarily be in the
+        //  next power of 4.
+        //  Consider this example,
+        //     [2, 4, 6, 8, 900]
+        //  While the 2, 4 fall in the 0-4 tier and 6, 8 fall
+        //  in the 4-16 tier, the last segment falls in
+        //  the [256-1024] tier.
+
+        currentTier = getConsolidationTier(getSize(*idx));
+        start = idx++;
+    }


As discussed: finding the tier-boundaries could be done by binary search, possibly utilizing std::lower_bound / std::upper_bound.

goedderz · 2025-07-15T10:25:47Z

core/utils/index_utils.cpp

+    mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size;
+    skew = static_cast<double>(addMeta->byte_size) / mergeBytes;
+    delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta);
+    mergeScore = skew + (1 / (1 + delCount));


This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.

Note that $\mathrm{skew} \in (0, 1)$. With $\mathrm{delCount} = 1$, we get
$$\begin{align*} \mathrm{mergeScore} &= \mathrm{skew} + \frac{1}{1 + \mathrm{delCount}} \\\ &= \mathrm{skew} + \frac 1 2 \\\ &\leq 1 \frac 1 2 \\\ &= \mathrm{maxMergeScore} \end{align*}$$
.

So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.

Let us at least do

mergeScore = skew + live_docs_count / total_docs_count;

instead, as discussed - this has more reasonable properties.

And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore will need to be adjusted to 0.5 to get a similar effect.

So we should actually do

mergeScore = skew * live_docs_count / total_docs_count;

(and adapt maxMergeScore).

To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).

- Fixed the tier based candidate selection - Default tiers are powers of 4 with the first tier being 0-4M followed by 4-16M, 16-64M and so on. - Fixed consolidation window of size 4

Changed consolidation config defaults

Disabled irrelevant tests

goedderz · 2025-09-10T11:18:31Z

core/utils/index_utils.cpp

+        uint64_t& docs_count,
+        uint64_t& live_docs_count) {
+
+      auto itrMeta = itr->meta;


Personally, I have a slight preference towards

Suggested change

auto itrMeta = itr->meta;

auto* itrMeta = itr->meta;

, but feel free to keep it as is if you prefer it that way.

auto * definitely is more appropriate since we are expecting itr->meta always to be a pointer.
Changed it.

goedderz · 2025-09-10T11:21:10Z

core/utils/index_utils.hpp

+    void getSegmentDimensions(
+        std::vector<tier::SegmentStats>::const_iterator itr,
+        uint64_t& byte_size,
+        uint64_t& docs_count,
+        uint64_t& live_docs_count);


Just curious, why did you choose return-parameters instead of a product type (tuple or struct)? Due to existing code style?

Changed this to struct.

goedderz · 2025-09-10T11:49:10Z

core/utils/index_utils.hpp

+        const auto removeSegment = first();
+        const auto lastSegment = last();
+
+        std::advance(segments.first, 1);


Should this method get a check or assertion that segments is a non-empty range?

ConsolidationCandidate only gets the first and last pointers. Previously it didn't play so much of a role in deciding the best candidate. It only represented a candidate and all the decision making was done outside of this class.
That is why it let it be the caller's responsibility to ensure that the std::advance operation won't create an assertion.
I've added a note to the function header.

goedderz · 2025-09-10T11:55:41Z

core/utils/index_utils.hpp

+        const auto addSegment = segments.second + 1;
+
+        std::advance(segments.second, 1);


Should this method get an assertion, checking we have enough space?

I've left it to be the caller's responsibility to check that before calling push_back() or pop_front().
The reason being that ConsolidationCandidate was designed to only receive the first and last segment iterators by the predecessors. It doesn't get the full sorted_segments vector.
I'll add some documentation to the function.

goedderz · 2025-09-10T12:26:27Z

core/utils/index_utils.hpp

+    template<typename Segment>
+    bool findBestCleanupCandidate(


This is only ever used with Segment = tier::SegmentStats if I'm not mistaken; does it need to be a template?

TBH, I'm conflicted myself about this. I templatized the function to make writing tests easier. For instance, findBestCleanupCandidate() is only concerned with the docs_count and live_docs_count attributes of the segment. We shouldn't have to initialize and pass the entire SegmentStats struct which comprises a nested hierarchy of structs. So I templatized this function and added an accessor method argument to make using this function easier and to achieve decoupling.

But on the other hand, there is an AddSegment() method in the tests that does the setting up of the complex SegmentStats structure.
Perhaps you can make this decision for me. I can see tradeoffs on both sides.

goedderz · 2025-09-17T14:54:58Z

core/utils/index_utils.hpp

+    template<typename Segment>
+    bool findBestConsolidationCandidate1(


Is this a leftover that should be deleted?

Yes, it was. Sorry about that.
Removed it.

goedderz · 2025-09-17T14:59:53Z

core/utils/index_utils.hpp

+        //  sort segments in increasing order of the segment byte size
+        std::sort(sorted_segments.begin(), sorted_segments.end());


I don't have a final (nor a strong) opinion on this one; but now that we're using different segment orders in different functions, should we still keep the size-order as the default one via operator<, or should we rather pass an explicit comparison function here as well and remove < from SegmentStats? WDYT? I'm also fine with just leaving it as it is regardless, it's not a real issue either way.

You're right. It made sense to have the operator < in the past. I've removed it now from SegmentStats.

goedderz · 2025-09-18T09:30:52Z

core/utils/index_utils.hpp

+            continue;
+          }
+
+          if (candidate.mergeScore > prev_score ||


nit-pick, for consistency:

Suggested change

continue;

}

if (candidate.mergeScore > prev_score ||

} else if (candidate.mergeScore > prev_score ||

goedderz · 2025-09-18T13:19:16Z

core/utils/index_utils.hpp

+
+        while ((candidate.first() + 1) < sorted_segments.end()) {
+
+          if (!best.initialized || (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes))


best will possibly be initialized with an invalid candidate (that violates the size limit). This will later prevent valid candidates from being selected if they have a worse score.

So I suggest

Suggested change

if (!best.initialized || (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes))

if (candidate.mergeBytes <= max_segments_bytes && (!best.initialized || best.mergeScore > candidate.mergeScore))

And I think the candidates checked here can also be below the min window size, though I haven't checked whether this can cause a problem or not.

goedderz · 2025-09-18T18:36:28Z

core/utils/index_utils.hpp

+        while ((candidate.first() + 1) < sorted_segments.end()) {
+
+          if (!best.initialized || (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes))
+            best = candidate;
+
+          if (std::distance(candidate.first(), candidate.last()) < (minWindowSize - 1)) {
+            candidate.push_back();
+            continue;
+          }
+
+          if (candidate.mergeScore > prev_score ||
+              candidate.mergeBytes > max_segments_bytes ||
+              candidate.last() == (sorted_segments.end() - 1)) {
+            prev_score = candidate.mergeScore;
+            candidate.pop_front();
+          }
+          else if (candidate.mergeScore <= prev_score && candidate.last() < (sorted_segments.end() - 1) &&
+            candidate.mergeBytes <= max_segments_bytes) {
+            prev_score = candidate.mergeScore;
+            candidate.push_back();
+          }
+        }


I don't quite understand the implementation (which may just be me). I've tried to consolidate it with my own picture of the same algorithm, which goes very roughly like this:

auto left = sorted_segments.begin(); auto best = nullopt; for(auto right = sorted_segments.begin() + 1; right < sorted_segments.end(); ++right) { // shrink candidate set from the left until the size limit is undercut, // or until there are only two segments left while(estimatedSize(left, right) > maxSize && left + 1 < right) { ++left; } if (estimatedSize(left, right) > maxSize) { assert(left + 1 == right); // no more valid candidates possible due to size break; } if (!best || skew(best) > skew(left, right)) { best = (left, right); } }

I'm uncertain what you're using prev_score for, and relatedly haven't quite grasped when the front or back of the candidate are moved.

As discussed and to document it, the condition for the best candidate selection in the above algorithm is incorrect. It rather needs to be

if (skew(left, right) <= skew_threshhold && (!best || estimatedSize(best) < estimatedSize(left, right))) { best = (left, right); }

cursor

This PR is being reviewed by Cursor Bugbot

Details

Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

cursor · 2025-09-30T20:11:57Z

core/utils/index_utils.hpp

+
+  struct ConsolidationConfig {
+    static constexpr size_t candidate_size { 2 };   //  candidate selection window size: 4
+    static constexpr double maxMergeScore { 0.4 };  //  max score allowed for candidates consolidation.


Bug: Candidate Size Mismatch Causes Suboptimal Consolidation

The candidate_size constant is set to 2, but its comment and the PR description indicate an intended value of 4. This means the consolidation algorithm uses a minimum candidate selection window of 2 instead of the intended 4, which may lead to suboptimal consolidation decisions.

cursor · 2025-09-30T20:11:57Z

core/utils/index_utils.hpp

+        uint64_t minWindowSize { tier::ConsolidationConfig::candidate_size };
+        auto front = segments.begin();
+        auto rear = front + minWindowSize - 1;
+        tier::ConsolidationCandidate<Segment> candidate(front, rear, getSegmentAttributes);


Bug: Template Function Fails Vector Size Validation

The findBestConsolidationCandidate template function doesn't validate that the segments vector has at least minWindowSize (2) elements. If the vector is smaller, an invalid iterator is passed to the ConsolidationCandidate constructor, causing undefined behavior when its loop dereferences it. This affects direct calls, such as in tests, that may lack external size checks.

k0ushal requested review from kvahed, neunhoef, johann-listunov and goedderz May 30, 2025 06:04

k0ushal self-assigned this May 30, 2025

k0ushal requested a review from alexbakharew May 30, 2025 08:09

k0ushal marked this pull request as draft July 9, 2025 08:18

k0ushal force-pushed the bugfix/consolidation-issues branch from 57714c9 to c1e6ebb Compare July 11, 2025 12:34

k0ushal changed the base branch from master to bugfix/iresearch-address-table-tests July 14, 2025 09:03

k0ushal changed the base branch from bugfix/iresearch-address-table-tests to master July 14, 2025 09:04

k0ushal changed the base branch from master to bugfix/iresearch-address-table-tests July 14, 2025 09:05

goedderz approved these changes Jul 14, 2025

View reviewed changes

goedderz reviewed Jul 14, 2025

View reviewed changes

goedderz requested changes Jul 15, 2025

View reviewed changes

k0ushal force-pushed the bugfix/iresearch-address-table-tests branch from 07286d8 to 872d553 Compare July 16, 2025 19:35

k0ushal force-pushed the bugfix/consolidation-issues branch 2 times, most recently from fb73fcd to f6305e3 Compare July 17, 2025 07:34

k0ushal deleted the branch master July 18, 2025 07:56

k0ushal closed this Jul 18, 2025

goedderz reopened this Jul 23, 2025

goedderz changed the base branch from bugfix/iresearch-address-table-tests to master July 23, 2025 13:02

k0ushal force-pushed the bugfix/consolidation-issues branch from f6305e3 to 21a2f95 Compare July 23, 2025 13:05

k0ushal added 2 commits July 24, 2025 15:42

Consolidation issue fix.

48cecc3

- Fixed the tier based candidate selection - Default tiers are powers of 4 with the first tier being 0-4M followed by 4-16M, 16-64M and so on. - Fixed consolidation window of size 4

Fixed ConsolidationTier unit tests

d91b909

k0ushal force-pushed the bugfix/consolidation-issues branch from 21a2f95 to d91b909 Compare July 24, 2025 15:43

k0ushal requested a review from goedderz August 25, 2025 07:49

k0ushal added 2 commits August 26, 2025 07:39

Fixed consolidation candidate selection approach

c4ce867

Changed consolidation config defaults

Changed ConsolidationTier unit tests to follow the new algorithm

4eb4b06

Disabled irrelevant tests

k0ushal force-pushed the bugfix/consolidation-issues branch from 79070ae to 5165f01 Compare August 26, 2025 07:41

Removed irrelevant tests

9cfc1fc

k0ushal force-pushed the bugfix/consolidation-issues branch from 5165f01 to 9cfc1fc Compare August 26, 2025 07:56

k0ushal marked this pull request as ready for review August 26, 2025 11:36

goedderz reviewed Sep 18, 2025

View reviewed changes

k0ushal added 2 commits September 26, 2025 11:32

Fixed review comments

afe615c

Added more unit tests for testing consolidation

ae1f202

cursor bot reviewed Sep 30, 2025

View reviewed changes

		const auto addSegment = segments.second + 1;

		std::advance(segments.second, 1);

		template<typename Segment>
		bool findBestConsolidationCandidate1(

		// sort segments in increasing order of the segment byte size
		std::sort(sorted_segments.begin(), sorted_segments.end());


		while ((candidate.first() + 1) < sorted_segments.end()) {

		if (!best.initialized \|\| (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes))

	if (!best.initialized \|\| (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes))
	if (candidate.mergeBytes <= max_segments_bytes && (!best.initialized \|\| best.mergeScore > candidate.mergeScore))

Consolidation issue fix. #7

Are you sure you want to change the base?

Consolidation issue fix. #7

Uh oh!

Conversation

k0ushal commented May 30, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k0ushal commented May 30, 2025

Uh oh!

goedderz left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goedderz Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

goedderz Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

This PR is being reviewed by Cursor Bugbot

Uh oh!

cursor bot Sep 30, 2025

Choose a reason for hiding this comment

Bug: Candidate Size Mismatch Causes Suboptimal Consolidation

Uh oh!

cursor bot Sep 30, 2025

Choose a reason for hiding this comment

Bug: Template Function Fails Vector Size Validation

Uh oh!

Uh oh!

k0ushal commented May 30, 2025 •

edited by cursor bot

Loading

goedderz Jul 15, 2025 •

edited

Loading

goedderz Jul 15, 2025 •

edited

Loading