Skip to content

Add cgroup v2 QoS support for RBD volumes#6274

Open
Madhu-1 wants to merge 12 commits into
ceph:develfrom
Madhu-1:implement-qos
Open

Add cgroup v2 QoS support for RBD volumes#6274
Madhu-1 wants to merge 12 commits into
ceph:develfrom
Madhu-1:implement-qos

Conversation

@Madhu-1
Copy link
Copy Markdown
Collaborator

@Madhu-1 Madhu-1 commented May 13, 2026

This PR implements cgroup v2 QoS enforcement for RBD volumes using Kubernetes VolumeAttributesClass (VAC). It enables administrators to apply I/O limits (IOPS and bandwidth) to krbd-backed volumes at the pod cgroup level, providing fine-grained I/O control without modifying the underlying RBD image configuration.

@Madhu-1 Madhu-1 force-pushed the implement-qos branch 2 times, most recently from 7cac3cd to d3126aa Compare May 13, 2026 10:49
@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 13, 2026

/test ci/centos/mini-e2e/k8s-1.35/rbd

@Madhu-1 Madhu-1 added the ci/skip/multi-arch-build skip building on multiple architectures label May 13, 2026
@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 13, 2026

/test ci/centos/mini-e2e/k8s-1.35/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 13, 2026

/test ci/centos/mini-e2e/k8s-1.33/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 18, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 18, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

1 similar comment
@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 18, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 18, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

1 similar comment
@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

2 similar comments
@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

3 similar comments
@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 19, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 20, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 20, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 21, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 21, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 25, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

Madhu-1 and others added 12 commits May 27, 2026 08:57
Implement cgroup v2 based QoS for krbd by applying io.max limits
to container cgroups. This addresses the limitation that rbd-nbd
QoS doesn't work with krbd.

Key features:
- Parse VolumeAttributesClass parameters (MaxReadIOPS, MaxWriteIOPS,
  MaxReadBytesPerSecond, MaxWriteBytesPerSecond)
- Discover pod cgroup path based on pod UID and QoS class
- Find all container cgroups within the pod
- Apply io.max limits to each container's cgroup

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add NodePublishSecretRef field to RBD configuration to support
secret lookup during NodePublishVolume operation. This enables
cgroup QoS to retrieve QoS metadata from RBD images.

Supports two secret sources:
1. StorageClass parameter: csi.storage.k8s.io/node-publish-secret-name
2. CSI ConfigMap fallback: rbd.nodePublishSecretRef

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Allow krbd volumes to use cgroup v2 QoS parameters alongside
traditional rbd-nbd QoS. ControllerModifyVolume now:

- Rejects requests mixing cgroup QoS and traditional NBD QoS
- Only enforces rbd-nbd mounter check for traditional QoS parameters
- Saves cgroup QoS parameters to image metadata for retrieval during
  NodePublishVolume
- Removes QoS metadata when VAC is removed or keys not present

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Apply cgroup v2 QoS limits during NodePublishVolume
when:
- Pod UID is available (podInfoOnMount enabled)
- Cgroup QoS parameters are stored in image metadata
- Device path is available from staging metadata

Features:
- Retrieve QoS parameters from image metadata
- Get device major:minor number from mapped RBD device
- Apply io.max limits to the pod
- Support secret retrieval from StorageClass or CSI
  ConfigMap
- Non-blocking: QoS failure logs error but doesn't
  fail mount.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add configuration examples for cgroup v2 QoS support:

- New VolumeAttributesClass example
  demonstrating cgroup v2 QoS parameters
- Updated CSI ConfigMap sample with nodePublishSecretRef

The cgroup QoS VAC shows all four parameters:
- MaxReadIOPS: maximum read IOPS
- MaxWriteIOPS: maximum write IOPS
- MaxReadBytesPerSecond: maximum read bandwidth
- MaxWriteBytesPerSecond: maximum write bandwidth

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Update VolumeAttributesClass documentation to include:

- Overview of two QoS types (traditional and cgroup v2)
- Prerequisites for cgroup v2 QoS
- Configuration examples for both approaches
- Secret configuration options (StorageClass and ConfigMap)
- Instructions for removing QoS limits
- Explanation of how cgroup v2 QoS works

The documentation clearly distinguishes between traditional rbd-nbd
QoS and new cgroup v2 QoS for krbd volumes.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add end-to-end tests for the cgroup v2 QoS
feature that enforces I/O limits
on RBD volumes using Kubernetes
VolumeAttributesClass (VAC).
This testing validates the complete
lifecycle of QoS management including
metadata storage,
enforcement, and edge cases.

Test Coverage (13 scenarios):

Access Mode Coverage:
- RWO (ReadWriteOnce) filesystem with metadata validation
- RWO block with I/O enforcement testing using dd
- RWOP (ReadWriteOncePod) access mode validation
- RWX (ReadWriteMany) block with 3-replica deployment
- ROX (ReadOnlyMany) filesystem with clone and deployment

Critical Functionality:
- Multi-PVC pod with 3 volumes having different QoS tiers - validates
  the updateIOMaxForDevice() read-modify-write logic that prevents
  io.max file overwrites when multiple volumes share a pod's cgroup
- Snapshot/clone non-propagation - verifies .rbd.csi.ceph.com prefix
  prevents QoS metadata inheritance during clone/snapshot operations

VAC Lifecycle Operations:
- VAC modification (low tier → high tier)
- Partial VAC update (all parameters → IOPS only)
- VAC removal (validate complete metadata cleanup)
- Add VAC to existing PVC without QoS

Integration Testing:
- Volume expansion with QoS persistence
- Encrypted volumes with QoS compatibility

Helper Functions (rbd_helper.go):
- validateCgroupQoS() - validates RBD image metadata with cgroup QoS prefix
- testIOEnforcement() - tests I/O limits using dd with 20% tolerance
- createMultiPVCPod() - creates pods with multiple RBD volumes
- createKRBDStorageClassWithModifySecret() - SC setup for VAC testing

YAML Templates:
- Multi-PVC pod templates (filesystem and block modes)
- Three-tier VolumeAttributesClass templates (low/medium/high QoS)

The tests require Kubernetes >= 1.34 for VolumeAttributesClass support
and cgroup v2 enabled on cluster nodes. All tests include cleanup
validation using validateRBDImageCount() and validateOmapCount().

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Add NodePublish secret to the
SC created in the e2e testing.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
When a PVC is created with VAC name
already set,Kubernetes does not call
ControllerModifyVolume. That RPC is only
invoked when the VolumeAttributesClass
is changed after creation.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
modifyPVCVolumeAttributesClass doesnot
check for the nil pointer before accessing
it.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
Irrespective of the mounter store
the devicePath in the NodeStageVolume.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
The containers scope file will not
be created before the NodePublish RPC
call, due to that we cannot apply qos
at the container level rather we need
apply at the pod level. updating the
design to apply the qos at the pod
level.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 27, 2026

/test ci/centos/mini-e2e/k8s-1.34/rbd

@Madhu-1
Copy link
Copy Markdown
Collaborator Author

Madhu-1 commented May 27, 2026

Test results

===================================================================
Step 5: Running Performance Tests
===================================================================

===================================================================
Testing Volume Mode: BLOCK
===================================================================

-------------------------------------------------------------------
Tier: BASELINE (No QoS Limits) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
No io.max (expected - no QoS)
[1/4] Sequential Read
  33790.67 MiB/s
  ✓ Read Bandwidth: No limit (baseline)
[2/4] Sequential Write
  328.92 MiB/s
  ✓ Write Bandwidth: No limit (baseline)
[3/4] Random Read IOPS
  12693 IOPS
  ✓ Read IOPS: No limit (baseline)
[4/4] Random Write IOPS
  2967 IOPS
  ✓ Write IOPS: No limit (baseline)

-------------------------------------------------------------------
Tier: BASELINE (No QoS Limits) - K8s QoS: BURSTABLE
-------------------------------------------------------------------
No io.max (expected - no QoS)
[1/4] Sequential Read
  30553.34 MiB/s
  ✓ Read Bandwidth: No limit (baseline)
[2/4] Sequential Write
  305.72 MiB/s
  ✓ Write Bandwidth: No limit (baseline)
[3/4] Random Read IOPS
  13939 IOPS
  ✓ Read IOPS: No limit (baseline)
[4/4] Random Write IOPS
  2964 IOPS
  ✓ Write IOPS: No limit (baseline)

-------------------------------------------------------------------
Tier: BASELINE (No QoS Limits) - K8s QoS: GUARANTEED
-------------------------------------------------------------------
No io.max (expected - no QoS)
[1/4] Sequential Read
  31278.91 MiB/s
  ✓ Read Bandwidth: No limit (baseline)
[2/4] Sequential Write
  233.16 MiB/s
  ✓ Write Bandwidth: No limit (baseline)
[3/4] Random Read IOPS
  19007 IOPS
  ✓ Read IOPS: No limit (baseline)
[4/4] Random Write IOPS
  2963 IOPS
  ✓ Write IOPS: No limit (baseline)

-------------------------------------------------------------------
Tier: 5MB (5 MB/s, 1250 IOPS) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
io.max: 251:80 rbps=5242880 wbps=5242880 riops=1250 wiops=1250
[1/4] Sequential Read
  5.01 MiB/s
  ✓ PASS: Read Bandwidth within limit (5262764 ≤ 6815744, 100.3% used)
[2/4] Sequential Write
  5.50 MiB/s
  ✓ PASS: Write Bandwidth within limit (5771139 ≤ 6815744, 110.0% used)
[3/4] Random Read IOPS
  1233 IOPS
  ✓ PASS: Read IOPS within limit (1233 ≤ 1625, 98.6% used)
[4/4] Random Write IOPS
  1236 IOPS
  ✓ PASS: Write IOPS within limit (1236 ≤ 1625, 98.8% used)

-------------------------------------------------------------------
Tier: 5MB (5 MB/s, 1250 IOPS) - K8s QoS: BURSTABLE
-------------------------------------------------------------------
io.max: 251:96 rbps=5242880 wbps=5242880 riops=1250 wiops=1250
[1/4] Sequential Read
  4.99 MiB/s
  ✓ PASS: Read Bandwidth within limit (5238871 ≤ 6815744, 99.9% used)
[2/4] Sequential Write
  6.51 MiB/s
  ✗ FAIL: Write Bandwidth EXCEEDED limit!
    Expected: ≤5242880 (max with tolerance: 6815744)
    Actual: 6833142 (130.3% of limit)
[3/4] Random Read IOPS
  1233 IOPS
  ✓ PASS: Read IOPS within limit (1233 ≤ 1625, 98.6% used)
[4/4] Random Write IOPS
  1234 IOPS
  ✓ PASS: Write IOPS within limit (1234 ≤ 1625, 98.7% used)

-------------------------------------------------------------------
Tier: 5MB (5 MB/s, 1250 IOPS) - K8s QoS: GUARANTEED
-------------------------------------------------------------------
io.max: 251:48 rbps=5242880 wbps=5242880 riops=1250 wiops=1250
[1/4] Sequential Read
  5.01 MiB/s
  ✓ PASS: Read Bandwidth within limit (5262854 ≤ 6815744, 100.3% used)
[2/4] Sequential Write
  6.24 MiB/s
  ✓ PASS: Write Bandwidth within limit (6548514 ≤ 6815744, 124.9% used)
[3/4] Random Read IOPS
  1234 IOPS
  ✓ PASS: Read IOPS within limit (1234 ≤ 1625, 98.7% used)
[4/4] Random Write IOPS
  1236 IOPS
  ✓ PASS: Write IOPS within limit (1236 ≤ 1625, 98.8% used)

-------------------------------------------------------------------
Tier: 50MB (50 MB/s, 12000 IOPS) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
io.max: 251:64 rbps=52428800 wbps=52428800 riops=12000 wiops=12000
[1/4] Sequential Read
  49.98 MiB/s
  ✓ PASS: Read Bandwidth within limit (52415119 ≤ 68157440, 99.9% used)
[2/4] Sequential Write
  49.92 MiB/s
  ✓ PASS: Write Bandwidth within limit (52350625 ≤ 68157440, 99.8% used)
[3/4] Random Read IOPS
  11984 IOPS
  ✓ PASS: Read IOPS within limit (11984 ≤ 15600, 99.8% used)
[4/4] Random Write IOPS
  2958 IOPS
  ✓ PASS: Write IOPS within limit (2958 ≤ 15600, 24.6% used)

-------------------------------------------------------------------
Tier: 50MB (50 MB/s, 12000 IOPS) - K8s QoS: BURSTABLE
-------------------------------------------------------------------
io.max: 251:112 rbps=52428800 wbps=52428800 riops=12000 wiops=12000
[1/4] Sequential Read
  50.01 MiB/s
  ✓ PASS: Read Bandwidth within limit (52441183 ≤ 68157440, 100.0% used)
[2/4] Sequential Write
  50.38 MiB/s
  ✓ PASS: Write Bandwidth within limit (52832043 ≤ 68157440, 100.7% used)
[3/4] Random Read IOPS
  11984 IOPS
  ✓ PASS: Read IOPS within limit (11984 ≤ 15600, 99.8% used)
[4/4] Random Write IOPS
  2961 IOPS
  ✓ PASS: Write IOPS within limit (2961 ≤ 15600, 24.6% used)

-------------------------------------------------------------------
Tier: 50MB (50 MB/s, 12000 IOPS) - K8s QoS: GUARANTEED
-------------------------------------------------------------------
io.max: 251:32 rbps=52428800 wbps=52428800 riops=12000 wiops=12000
[1/4] Sequential Read
  49.97 MiB/s
  ✓ PASS: Read Bandwidth within limit (52402741 ≤ 68157440, 99.9% used)
[2/4] Sequential Write
  50.03 MiB/s
  ✓ PASS: Write Bandwidth within limit (52462672 ≤ 68157440, 100.0% used)
[3/4] Random Read IOPS
  11984 IOPS
  ✓ PASS: Read IOPS within limit (11984 ≤ 15600, 99.8% used)
[4/4] Random Write IOPS
  2962 IOPS
  ✓ PASS: Write IOPS within limit (2962 ≤ 15600, 24.6% used)

-------------------------------------------------------------------
Tier: 100MB (100 MB/s, 24000 IOPS) - K8s QoS: BESTEFFORT
-------------------------------------------------------------------
io.max: 251:144 rbps=104857600 wbps=104857600 riops=24000 wiops=24000
[1/4] Sequential Read
  99.91 MiB/s
  ✓ PASS: Read Bandwidth within limit (104768029 ≤ 136314880, 99.9% used)
[2/4] Sequential Write
  100.24 MiB/s
  ✓ PASS: Write Bandwidth within limit (105114121 ≤ 136314880, 100.2% used)
[3/4] Random Read IOPS
  23983 IOPS
  ✓ PASS: Read IOPS within limit (23983 ≤ 31200, 99.9% used)
[4/4] Random Write IOPS
  2958 IOPS
  ✓ PASS: Write IOPS within limit (2958 ≤ 31200, 12.3% used)

-------------------------------------------------------------------
Tier: 100MB (100 MB/s, 24000 IOPS) - K8s QoS: BURSTABLE
-------------------------------------------------------------------

@Madhu-1 Madhu-1 removed DNM DO NOT MERGE Not Ready For Review ci/skip/multi-arch-build skip building on multiple architectures labels May 27, 2026
@Madhu-1 Madhu-1 requested review from a team, Rakshith-R and nixpanic May 27, 2026 05:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant