Add cgroup v2 QoS support for RBD volumes#6274
Conversation
7cac3cd to
d3126aa
Compare
|
/test ci/centos/mini-e2e/k8s-1.35/rbd |
|
/test ci/centos/mini-e2e/k8s-1.35/rbd |
|
/test ci/centos/mini-e2e/k8s-1.33/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
1 similar comment
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
1 similar comment
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
2 similar comments
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
3 similar comments
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
Implement cgroup v2 based QoS for krbd by applying io.max limits to container cgroups. This addresses the limitation that rbd-nbd QoS doesn't work with krbd. Key features: - Parse VolumeAttributesClass parameters (MaxReadIOPS, MaxWriteIOPS, MaxReadBytesPerSecond, MaxWriteBytesPerSecond) - Discover pod cgroup path based on pod UID and QoS class - Find all container cgroups within the pod - Apply io.max limits to each container's cgroup Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add NodePublishSecretRef field to RBD configuration to support secret lookup during NodePublishVolume operation. This enables cgroup QoS to retrieve QoS metadata from RBD images. Supports two secret sources: 1. StorageClass parameter: csi.storage.k8s.io/node-publish-secret-name 2. CSI ConfigMap fallback: rbd.nodePublishSecretRef Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Allow krbd volumes to use cgroup v2 QoS parameters alongside traditional rbd-nbd QoS. ControllerModifyVolume now: - Rejects requests mixing cgroup QoS and traditional NBD QoS - Only enforces rbd-nbd mounter check for traditional QoS parameters - Saves cgroup QoS parameters to image metadata for retrieval during NodePublishVolume - Removes QoS metadata when VAC is removed or keys not present Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Apply cgroup v2 QoS limits during NodePublishVolume when: - Pod UID is available (podInfoOnMount enabled) - Cgroup QoS parameters are stored in image metadata - Device path is available from staging metadata Features: - Retrieve QoS parameters from image metadata - Get device major:minor number from mapped RBD device - Apply io.max limits to the pod - Support secret retrieval from StorageClass or CSI ConfigMap - Non-blocking: QoS failure logs error but doesn't fail mount. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add configuration examples for cgroup v2 QoS support: - New VolumeAttributesClass example demonstrating cgroup v2 QoS parameters - Updated CSI ConfigMap sample with nodePublishSecretRef The cgroup QoS VAC shows all four parameters: - MaxReadIOPS: maximum read IOPS - MaxWriteIOPS: maximum write IOPS - MaxReadBytesPerSecond: maximum read bandwidth - MaxWriteBytesPerSecond: maximum write bandwidth Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Update VolumeAttributesClass documentation to include: - Overview of two QoS types (traditional and cgroup v2) - Prerequisites for cgroup v2 QoS - Configuration examples for both approaches - Secret configuration options (StorageClass and ConfigMap) - Instructions for removing QoS limits - Explanation of how cgroup v2 QoS works The documentation clearly distinguishes between traditional rbd-nbd QoS and new cgroup v2 QoS for krbd volumes. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add end-to-end tests for the cgroup v2 QoS feature that enforces I/O limits on RBD volumes using Kubernetes VolumeAttributesClass (VAC). This testing validates the complete lifecycle of QoS management including metadata storage, enforcement, and edge cases. Test Coverage (13 scenarios): Access Mode Coverage: - RWO (ReadWriteOnce) filesystem with metadata validation - RWO block with I/O enforcement testing using dd - RWOP (ReadWriteOncePod) access mode validation - RWX (ReadWriteMany) block with 3-replica deployment - ROX (ReadOnlyMany) filesystem with clone and deployment Critical Functionality: - Multi-PVC pod with 3 volumes having different QoS tiers - validates the updateIOMaxForDevice() read-modify-write logic that prevents io.max file overwrites when multiple volumes share a pod's cgroup - Snapshot/clone non-propagation - verifies .rbd.csi.ceph.com prefix prevents QoS metadata inheritance during clone/snapshot operations VAC Lifecycle Operations: - VAC modification (low tier → high tier) - Partial VAC update (all parameters → IOPS only) - VAC removal (validate complete metadata cleanup) - Add VAC to existing PVC without QoS Integration Testing: - Volume expansion with QoS persistence - Encrypted volumes with QoS compatibility Helper Functions (rbd_helper.go): - validateCgroupQoS() - validates RBD image metadata with cgroup QoS prefix - testIOEnforcement() - tests I/O limits using dd with 20% tolerance - createMultiPVCPod() - creates pods with multiple RBD volumes - createKRBDStorageClassWithModifySecret() - SC setup for VAC testing YAML Templates: - Multi-PVC pod templates (filesystem and block modes) - Three-tier VolumeAttributesClass templates (low/medium/high QoS) The tests require Kubernetes >= 1.34 for VolumeAttributesClass support and cgroup v2 enabled on cluster nodes. All tests include cleanup validation using validateRBDImageCount() and validateOmapCount(). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add NodePublish secret to the SC created in the e2e testing. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
When a PVC is created with VAC name already set,Kubernetes does not call ControllerModifyVolume. That RPC is only invoked when the VolumeAttributesClass is changed after creation. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
modifyPVCVolumeAttributesClass doesnot check for the nil pointer before accessing it. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Irrespective of the mounter store the devicePath in the NodeStageVolume. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The containers scope file will not be created before the NodePublish RPC call, due to that we cannot apply qos at the container level rather we need apply at the pod level. updating the design to apply the qos at the pod level. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
|
/test ci/centos/mini-e2e/k8s-1.34/rbd |
Test results===================================================================
Step 5: Running Performance Tests
===================================================================
===================================================================
Testing Volume Mode: BLOCK
===================================================================
-------------------------------------------------------------------
Tier: BASELINE (No QoS Limits) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
No io.max (expected - no QoS)
[1/4] Sequential Read
33790.67 MiB/s
✓ Read Bandwidth: No limit (baseline)
[2/4] Sequential Write
328.92 MiB/s
✓ Write Bandwidth: No limit (baseline)
[3/4] Random Read IOPS
12693 IOPS
✓ Read IOPS: No limit (baseline)
[4/4] Random Write IOPS
2967 IOPS
✓ Write IOPS: No limit (baseline)
-------------------------------------------------------------------
Tier: BASELINE (No QoS Limits) - K8s QoS: BURSTABLE
-------------------------------------------------------------------
No io.max (expected - no QoS)
[1/4] Sequential Read
30553.34 MiB/s
✓ Read Bandwidth: No limit (baseline)
[2/4] Sequential Write
305.72 MiB/s
✓ Write Bandwidth: No limit (baseline)
[3/4] Random Read IOPS
13939 IOPS
✓ Read IOPS: No limit (baseline)
[4/4] Random Write IOPS
2964 IOPS
✓ Write IOPS: No limit (baseline)
-------------------------------------------------------------------
Tier: BASELINE (No QoS Limits) - K8s QoS: GUARANTEED
-------------------------------------------------------------------
No io.max (expected - no QoS)
[1/4] Sequential Read
31278.91 MiB/s
✓ Read Bandwidth: No limit (baseline)
[2/4] Sequential Write
233.16 MiB/s
✓ Write Bandwidth: No limit (baseline)
[3/4] Random Read IOPS
19007 IOPS
✓ Read IOPS: No limit (baseline)
[4/4] Random Write IOPS
2963 IOPS
✓ Write IOPS: No limit (baseline)
-------------------------------------------------------------------
Tier: 5MB (5 MB/s, 1250 IOPS) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
io.max: 251:80 rbps=5242880 wbps=5242880 riops=1250 wiops=1250
[1/4] Sequential Read
5.01 MiB/s
✓ PASS: Read Bandwidth within limit (5262764 ≤ 6815744, 100.3% used)
[2/4] Sequential Write
5.50 MiB/s
✓ PASS: Write Bandwidth within limit (5771139 ≤ 6815744, 110.0% used)
[3/4] Random Read IOPS
1233 IOPS
✓ PASS: Read IOPS within limit (1233 ≤ 1625, 98.6% used)
[4/4] Random Write IOPS
1236 IOPS
✓ PASS: Write IOPS within limit (1236 ≤ 1625, 98.8% used)
-------------------------------------------------------------------
Tier: 5MB (5 MB/s, 1250 IOPS) - K8s QoS: BURSTABLE
-------------------------------------------------------------------
io.max: 251:96 rbps=5242880 wbps=5242880 riops=1250 wiops=1250
[1/4] Sequential Read
4.99 MiB/s
✓ PASS: Read Bandwidth within limit (5238871 ≤ 6815744, 99.9% used)
[2/4] Sequential Write
6.51 MiB/s
✗ FAIL: Write Bandwidth EXCEEDED limit!
Expected: ≤5242880 (max with tolerance: 6815744)
Actual: 6833142 (130.3% of limit)
[3/4] Random Read IOPS
1233 IOPS
✓ PASS: Read IOPS within limit (1233 ≤ 1625, 98.6% used)
[4/4] Random Write IOPS
1234 IOPS
✓ PASS: Write IOPS within limit (1234 ≤ 1625, 98.7% used)
-------------------------------------------------------------------
Tier: 5MB (5 MB/s, 1250 IOPS) - K8s QoS: GUARANTEED
-------------------------------------------------------------------
io.max: 251:48 rbps=5242880 wbps=5242880 riops=1250 wiops=1250
[1/4] Sequential Read
5.01 MiB/s
✓ PASS: Read Bandwidth within limit (5262854 ≤ 6815744, 100.3% used)
[2/4] Sequential Write
6.24 MiB/s
✓ PASS: Write Bandwidth within limit (6548514 ≤ 6815744, 124.9% used)
[3/4] Random Read IOPS
1234 IOPS
✓ PASS: Read IOPS within limit (1234 ≤ 1625, 98.7% used)
[4/4] Random Write IOPS
1236 IOPS
✓ PASS: Write IOPS within limit (1236 ≤ 1625, 98.8% used)
-------------------------------------------------------------------
Tier: 50MB (50 MB/s, 12000 IOPS) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
io.max: 251:64 rbps=52428800 wbps=52428800 riops=12000 wiops=12000
[1/4] Sequential Read
49.98 MiB/s
✓ PASS: Read Bandwidth within limit (52415119 ≤ 68157440, 99.9% used)
[2/4] Sequential Write
49.92 MiB/s
✓ PASS: Write Bandwidth within limit (52350625 ≤ 68157440, 99.8% used)
[3/4] Random Read IOPS
11984 IOPS
✓ PASS: Read IOPS within limit (11984 ≤ 15600, 99.8% used)
[4/4] Random Write IOPS
2958 IOPS
✓ PASS: Write IOPS within limit (2958 ≤ 15600, 24.6% used)
-------------------------------------------------------------------
Tier: 50MB (50 MB/s, 12000 IOPS) - K8s QoS: BURSTABLE
-------------------------------------------------------------------
io.max: 251:112 rbps=52428800 wbps=52428800 riops=12000 wiops=12000
[1/4] Sequential Read
50.01 MiB/s
✓ PASS: Read Bandwidth within limit (52441183 ≤ 68157440, 100.0% used)
[2/4] Sequential Write
50.38 MiB/s
✓ PASS: Write Bandwidth within limit (52832043 ≤ 68157440, 100.7% used)
[3/4] Random Read IOPS
11984 IOPS
✓ PASS: Read IOPS within limit (11984 ≤ 15600, 99.8% used)
[4/4] Random Write IOPS
2961 IOPS
✓ PASS: Write IOPS within limit (2961 ≤ 15600, 24.6% used)
-------------------------------------------------------------------
Tier: 50MB (50 MB/s, 12000 IOPS) - K8s QoS: GUARANTEED
-------------------------------------------------------------------
io.max: 251:32 rbps=52428800 wbps=52428800 riops=12000 wiops=12000
[1/4] Sequential Read
49.97 MiB/s
✓ PASS: Read Bandwidth within limit (52402741 ≤ 68157440, 99.9% used)
[2/4] Sequential Write
50.03 MiB/s
✓ PASS: Write Bandwidth within limit (52462672 ≤ 68157440, 100.0% used)
[3/4] Random Read IOPS
11984 IOPS
✓ PASS: Read IOPS within limit (11984 ≤ 15600, 99.8% used)
[4/4] Random Write IOPS
2962 IOPS
✓ PASS: Write IOPS within limit (2962 ≤ 15600, 24.6% used)
-------------------------------------------------------------------
Tier: 100MB (100 MB/s, 24000 IOPS) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
io.max: 251:144 rbps=104857600 wbps=104857600 riops=24000 wiops=24000
[1/4] Sequential Read
99.91 MiB/s
✓ PASS: Read Bandwidth within limit (104768029 ≤ 136314880, 99.9% used)
[2/4] Sequential Write
100.24 MiB/s
✓ PASS: Write Bandwidth within limit (105114121 ≤ 136314880, 100.2% used)
[3/4] Random Read IOPS
23983 IOPS
✓ PASS: Read IOPS within limit (23983 ≤ 31200, 99.9% used)
[4/4] Random Write IOPS
2958 IOPS
✓ PASS: Write IOPS within limit (2958 ≤ 31200, 12.3% used)
-------------------------------------------------------------------
Tier: 100MB (100 MB/s, 24000 IOPS) - K8s QoS: BURSTABLE
------------------------------------------------------------------- |
This PR implements cgroup v2 QoS enforcement for RBD volumes using Kubernetes VolumeAttributesClass (VAC). It enables administrators to apply I/O limits (IOPS and bandwidth) to krbd-backed volumes at the pod cgroup level, providing fine-grained I/O control without modifying the underlying RBD image configuration.