-
Notifications
You must be signed in to change notification settings - Fork 0
Consolidation issue fix. #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Documentation: |
57714c9
to
c1e6ebb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments as we talked about. Looks good to me!
core/utils/index_utils.cpp
Outdated
mergeBytes += itrMeta->byte_size; | ||
skew = static_cast<double>(itrMeta->byte_size) / mergeBytes; | ||
delCount += (itrMeta->docs_count - itrMeta->live_docs_count); | ||
mergeScore = skew + (1.0 / (1 + delCount)); | ||
cost = mergeBytes * mergeScore; | ||
|
||
size_t size_before_consolidation = 0; | ||
size_t size_after_consolidation = 0; | ||
size_t size_after_consolidation_floored = 0; | ||
for (auto& segment_stat : consolidation) { | ||
size_before_consolidation += segment_stat.meta->byte_size; | ||
size_after_consolidation += segment_stat.size; | ||
size_after_consolidation_floored += | ||
std::max(segment_stat.size, floor_segment_bytes); | ||
} while (itr++ != end); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably inconsequential, but it would suffice to calculate skew
, mergeScore
and cost
once after the loop for the last element.
core/utils/index_utils.cpp
Outdated
size_t nextTier = ConsolidationConfig::tier1; | ||
while (nextTier < num) | ||
nextTier = nextTier << 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: You could probably use std::countl_zero
and get rid of the loop.
core/utils/index_utils.cpp
Outdated
mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size; | ||
skew = static_cast<double>(addMeta->byte_size) / mergeBytes; | ||
delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta); | ||
mergeScore = skew + (1 / (1 + delCount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already discussed:
We should think about whether calculating the mergeScore
this way is sensible. What seems strange is that while the skew is a ratio (of byte-sizes), the second summand is an inverse count. This seems off: intuitively I'd expect e.g. a ratio of live and total documents to be considered alongside the skew.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.
Note that
.
So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.
Let us at least do
mergeScore = skew + live_docs_count / total_docs_count;
instead, as discussed - this has more reasonable properties.
And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore
will need to be adjusted to 0.5
to get a similar effect.
So we should actually do
mergeScore = skew * live_docs_count / total_docs_count;
(and adapt maxMergeScore
).
To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).
core/utils/index_utils.hpp
Outdated
for (auto idx = start; idx != sorted_segments.end();) { | ||
|
||
if (getSize(*idx) <= currentTier) { | ||
idx++; | ||
continue; | ||
} | ||
|
||
tiers.emplace_back(start, idx - 1); | ||
|
||
// The next tier may not necessarily be in the | ||
// next power of 4. | ||
// Consider this example, | ||
// [2, 4, 6, 8, 900] | ||
// While the 2, 4 fall in the 0-4 tier and 6, 8 fall | ||
// in the 4-16 tier, the last segment falls in | ||
// the [256-1024] tier. | ||
|
||
currentTier = getConsolidationTier(getSize(*idx)); | ||
start = idx++; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed: finding the tier-boundaries could be done by binary search, possibly utilizing std::lower_bound
/ std::upper_bound
.
core/utils/index_utils.cpp
Outdated
mergeBytes = mergeBytes - removeMeta->byte_size + addMeta->byte_size; | ||
skew = static_cast<double>(addMeta->byte_size) / mergeBytes; | ||
delCount = delCount - getDelCount(removeMeta) + getDelCount(addMeta); | ||
mergeScore = skew + (1 / (1 + delCount)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually quite bad the way it is, worse than we noticed yesterday @k0ushal.
Note that
.
So this way we are always allowed to consolidate if only one document has been deleted, no matter the size of the files or number of documents therein.
Let us at least do
mergeScore = skew + live_docs_count / total_docs_count;
instead, as discussed - this has more reasonable properties.
And as a second observation @neunhoef made today while discussing this: Adding these two values is probably not right, either. They should be multiplied instead; the maxMergeScore
will need to be adjusted to 0.5
to get a similar effect.
So we should actually do
mergeScore = skew * live_docs_count / total_docs_count;
(and adapt maxMergeScore
).
To understand this better, we should still do some formal worst-case analysis and some tests (specifically unit tests of the consolidation algorithm that play out certain usage scenarios).
07286d8
to
872d553
Compare
fb73fcd
to
f6305e3
Compare
f6305e3
to
21a2f95
Compare
- Fixed the tier based candidate selection - Default tiers are powers of 4 with the first tier being 0-4M followed by 4-16M, 16-64M and so on. - Fixed consolidation window of size 4
21a2f95
to
d91b909
Compare
Changed consolidation config defaults
Disabled irrelevant tests
79070ae
to
5165f01
Compare
5165f01
to
9cfc1fc
Compare
core/utils/index_utils.cpp
Outdated
uint64_t& docs_count, | ||
uint64_t& live_docs_count) { | ||
|
||
auto itrMeta = itr->meta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I have a slight preference towards
auto itrMeta = itr->meta; | |
auto* itrMeta = itr->meta; |
, but feel free to keep it as is if you prefer it that way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto *
definitely is more appropriate since we are expecting itr->meta always to be a pointer.
Changed it.
core/utils/index_utils.hpp
Outdated
void getSegmentDimensions( | ||
std::vector<tier::SegmentStats>::const_iterator itr, | ||
uint64_t& byte_size, | ||
uint64_t& docs_count, | ||
uint64_t& live_docs_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, why did you choose return-parameters instead of a product type (tuple or struct)? Due to existing code style?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to struct.
core/utils/index_utils.hpp
Outdated
const auto removeSegment = first(); | ||
const auto lastSegment = last(); | ||
|
||
std::advance(segments.first, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method get a check or assertion that segments
is a non-empty range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConsolidationCandidate only gets the first and last pointers. Previously it didn't play so much of a role in deciding the best candidate. It only represented a candidate and all the decision making was done outside of this class.
That is why it let it be the caller's responsibility to ensure that the std::advance operation won't create an assertion.
I've added a note to the function header.
core/utils/index_utils.hpp
Outdated
const auto addSegment = segments.second + 1; | ||
|
||
std::advance(segments.second, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this method get an assertion, checking we have enough space?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left it to be the caller's responsibility to check that before calling push_back() or pop_front().
The reason being that ConsolidationCandidate was designed to only receive the first and last segment iterators by the predecessors. It doesn't get the full sorted_segments vector.
I'll add some documentation to the function.
core/utils/index_utils.hpp
Outdated
template<typename Segment> | ||
bool findBestCleanupCandidate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only ever used with Segment = tier::SegmentStats
if I'm not mistaken; does it need to be a template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm conflicted myself about this. I templatized the function to make writing tests easier. For instance, findBestCleanupCandidate() is only concerned with the docs_count
and live_docs_count
attributes of the segment. We shouldn't have to initialize and pass the entire SegmentStats struct which comprises a nested hierarchy of structs. So I templatized this function and added an accessor method argument to make using this function easier and to achieve decoupling.
But on the other hand, there is an AddSegment() method in the tests that does the setting up of the complex SegmentStats structure.
Perhaps you can make this decision for me. I can see tradeoffs on both sides.
core/utils/index_utils.hpp
Outdated
template<typename Segment> | ||
bool findBestConsolidationCandidate1( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a leftover that should be deleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was. Sorry about that.
Removed it.
core/utils/index_utils.hpp
Outdated
// sort segments in increasing order of the segment byte size | ||
std::sort(sorted_segments.begin(), sorted_segments.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a final (nor a strong) opinion on this one; but now that we're using different segment orders in different functions, should we still keep the size-order as the default one via operator<
, or should we rather pass an explicit comparison function here as well and remove <
from SegmentStats
? WDYT? I'm also fine with just leaving it as it is regardless, it's not a real issue either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. It made sense to have the operator <
in the past. I've removed it now from SegmentStats.
core/utils/index_utils.hpp
Outdated
continue; | ||
} | ||
|
||
if (candidate.mergeScore > prev_score || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit-pick, for consistency:
continue; | |
} | |
if (candidate.mergeScore > prev_score || | |
} else if (candidate.mergeScore > prev_score || |
core/utils/index_utils.hpp
Outdated
|
||
while ((candidate.first() + 1) < sorted_segments.end()) { | ||
|
||
if (!best.initialized || (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
best
will possibly be initialized with an invalid candidate (that violates the size limit). This will later prevent valid candidates from being selected if they have a worse score.
So I suggest
if (!best.initialized || (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes)) | |
if (candidate.mergeBytes <= max_segments_bytes && (!best.initialized || best.mergeScore > candidate.mergeScore)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think the candidates checked here can also be below the min window size, though I haven't checked whether this can cause a problem or not.
core/utils/index_utils.hpp
Outdated
while ((candidate.first() + 1) < sorted_segments.end()) { | ||
|
||
if (!best.initialized || (best.mergeScore > candidate.mergeScore && candidate.mergeBytes <= max_segments_bytes)) | ||
best = candidate; | ||
|
||
if (std::distance(candidate.first(), candidate.last()) < (minWindowSize - 1)) { | ||
candidate.push_back(); | ||
continue; | ||
} | ||
|
||
if (candidate.mergeScore > prev_score || | ||
candidate.mergeBytes > max_segments_bytes || | ||
candidate.last() == (sorted_segments.end() - 1)) { | ||
prev_score = candidate.mergeScore; | ||
candidate.pop_front(); | ||
} | ||
else if (candidate.mergeScore <= prev_score && candidate.last() < (sorted_segments.end() - 1) && | ||
candidate.mergeBytes <= max_segments_bytes) { | ||
prev_score = candidate.mergeScore; | ||
candidate.push_back(); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand the implementation (which may just be me). I've tried to consolidate it with my own picture of the same algorithm, which goes very roughly like this:
auto left = sorted_segments.begin();
auto best = nullopt;
for(auto right = sorted_segments.begin() + 1; right < sorted_segments.end(); ++right) {
// shrink candidate set from the left until the size limit is undercut,
// or until there are only two segments left
while(estimatedSize(left, right) > maxSize && left + 1 < right) {
++left;
}
if (estimatedSize(left, right) > maxSize) {
assert(left + 1 == right);
// no more valid candidates possible due to size
break;
}
if (!best || skew(best) > skew(left, right)) {
best = (left, right);
}
}
I'm uncertain what you're using prev_score
for, and relatedly haven't quite grasped when the front or back of the candidate are moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed and to document it, the condition for the best candidate selection in the above algorithm is incorrect. It rather needs to be
if (skew(left, right) <= skew_threshhold && (!best || estimatedSize(best) < estimatedSize(left, right))) {
best = (left, right);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is being reviewed by Cursor Bugbot
Details
Your team is on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle for each member of your team.
To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.
|
||
struct ConsolidationConfig { | ||
static constexpr size_t candidate_size { 2 }; // candidate selection window size: 4 | ||
static constexpr double maxMergeScore { 0.4 }; // max score allowed for candidates consolidation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Candidate Size Mismatch Causes Suboptimal Consolidation
The candidate_size
constant is set to 2
, but its comment and the PR description indicate an intended value of 4
. This means the consolidation algorithm uses a minimum candidate selection window of 2
instead of the intended 4
, which may lead to suboptimal consolidation decisions.
uint64_t minWindowSize { tier::ConsolidationConfig::candidate_size }; | ||
auto front = segments.begin(); | ||
auto rear = front + minWindowSize - 1; | ||
tier::ConsolidationCandidate<Segment> candidate(front, rear, getSegmentAttributes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Template Function Fails Vector Size Validation
The findBestConsolidationCandidate
template function doesn't validate that the segments
vector has at least minWindowSize
(2) elements. If the vector is smaller, an invalid iterator is passed to the ConsolidationCandidate
constructor, causing undefined behavior when its loop dereferences it. This affects direct calls, such as in tests, that may lack external size checks.
Note
Replaces the tiered consolidation algorithm with a cleanup-first, skew-aware, templated selection engine and updates tests accordingly.
tier::ConsolidationConfig
,SegmentAttributes
, and templatedConsolidationCandidate
with sliding-window, skew-based scoring.findBestCleanupCandidate
(prefers low live-doc% segments) andfindBestConsolidationCandidate
(size-based, skew-thresholded) helpers.ConsolidateTier
policy: filter, early-exit on small sets, try cleanup candidates first, then consolidation; copy candidates via iterator range.FillFactor
,SizeWithoutRemovals
) and definetier::SegmentStats
in header; addgetSegmentDimensions
accessor.AssertCandidates
to accept error message.SegmentInfo
for convenience/testing.Written by Cursor Bugbot for commit ae1f202. This will update automatically on new commits. Configure here.