-
Notifications
You must be signed in to change notification settings - Fork 4.1k
Description
Problem
MMA's WriteBandwidth (WB) normalization divides raw bytes/s by 128 KiB and truncates to an integer (load.go:565-567).
On clusters with low store-level WriteBandwidth, this normalization results in stores with similar WB to have very different load summaries. For example, a store at 920 KB/s (normalized load=7) and one at 910 KB/s (normalized load=6) are 16.7% apart in fractionAbove despite a 1% difference in actual throughput. With the 10% mean-fraction threshold, load=7 is overloadSlow while load=6 is loadNormal.
This has two consequences:
- Stores are falsely classified as overloaded, entering the shedding pool every tick despite having nearly identical throughput to their peers.
- Moves can't fix the overload. When per-range WB is small relative to the 128 KiB quantization unit, shedding a range doesn't change the store's integer-truncated load level, and its essentially impossible to arrive at a state where no stores are overloaded on WB, regardless of where replicas are moved to.
This can result in indefinite thrashing: MMA continuously moves replicas, each incurring raft snapshot and disk I/O cost, with zero WB benefit. On a test cluster, this produced 13,260 successful but useless replica moves over 6 hours. See write-up for the full investigation.
Possible approaches
We should address this related TODO:
cockroach/pkg/kv/kvserver/allocator/mmaprototype/load.go
Lines 606 to 607 in 83279e5
| // TODO(sumeer): consider adding a summaryUpperBound for small | |
| // WriteBandwidth values too. |
In addition/instead, we could consider:
- Using a much smaller divisor (e.g., 8 KiB) so that per-range deltas meaningfully change the load level
- Perform the
fractionAbovecomparison on raw bytes/s, keeping quantization only for display or bucketing. Eliminates the cliff entirely.
Epic: CRDB-56265
Jira issue: CRDB-60867
Epic CRDB-56265