Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
- Allow the configuration of plugins in the binder service. [#1480](https://github.com/kai-scheduler/KAI-Scheduler/pull/1480) - [davidLif](https://github.com/davidLif)
- Added support for configuring scheduler log level and custom scheduler args via Helm values (`scheduler.args`) [#1452](https://github.com/kai-scheduler/KAI-Scheduler/pull/1452) [dttung2905](https://github.com/dttung2905)
- Added `crdupgrader.image.registry` Helm value to override `global.registry` for the `crd-upgrader` pre-install/pre-upgrade hook image, allowing the hook image to be served from a separate mirror without redirecting all chart images. [#1404](https://github.com/kai-scheduler/KAI-Scheduler/issues/1404)
- Added support for externally-created PodGroups. Workloads can opt out of podgrouper mutation with `kai.scheduler/skip-podgrouper: "true"` on the pod or owner chain, join an existing PodGroup via `pod-group-name`, and now get a pod condition when they reference a non-existent subgroup. [#1420](https://github.com/kai-scheduler/KAI-Scheduler/issues/1420)

### Changed
- **Breaking:** JobSet PodGroups no longer auto-calculate `minAvailable` from `parallelism × replicas`. The default is now 1. Use the `kai.scheduler/batch-min-member` annotation to set a custom value.
Expand Down
23 changes: 23 additions & 0 deletions docs/batch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,29 @@ This will create a job with parallelism of 6, but requires at least 2 pods to be

For JobSets, the annotation overrides the calculated minAvailable for all PodGroups created by the JobSet.

## External PodGroups

KAI also supports PodGroups that are created outside the podgrouper. This is useful when multiple workloads should join the same gang or when an external controller owns the PodGroup lifecycle.

Use the following contract:

- Create the `PodGroup` explicitly.
- Set `pod-group-name` on the pod template metadata to join that PodGroup.
- Set `kai.scheduler/subgroup-name` on the pod template metadata labels when using non-default subgroups.
- Set `kai.scheduler/skip-podgrouper: "true"` on the workload or any readable owner in the owner chain to prevent podgrouper from creating or rewriting PodGroup membership.

Example:

```bash
kubectl apply -f examples/batch/external-podgroup-job.yaml
```

Behavior notes:

- `PodGroup.spec.queue` is authoritative for scheduling.
- If a pod references a PodGroup that does not exist yet, KAI leaves that case unchanged and does not set a new pod condition.
- If a pod references a subgroup that does not exist in the PodGroup, KAI ignores only that pod for scheduling and sets a pod condition explaining the invalid subgroup.

## PyTorchJob
To run in a distributed way across multiple pods, you can use PyTorchJob.

Expand Down
10 changes: 10 additions & 0 deletions docs/developer/pod-grouper.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,16 @@ The Pod Grouper uses the PodGroup Custom Resource Definition (CRD) to represent

While users or third-party tools can manually create PodGroup resources, the Pod Grouper automates this process by analyzing incoming pods and applying appropriate grouping logic based on the pod's characteristics and ownership.

### External PodGroups

Podgrouper can also be told to leave PodGroup membership unchanged. When a pod or any readable object in its owner chain has `kai.scheduler/skip-podgrouper: "true"`, podgrouper does not create or update a PodGroup for that pod and does not patch `pod-group-name` or `kai.scheduler/subgroup-name`.

This is the supported path for externally-created PodGroups. External controllers or manifests still need to:

- Create the `PodGroup` resource explicitly.
- Set `pod-group-name` on the pod template annotations.
- Set `kai.scheduler/subgroup-name` on the pod template labels when using non-default subgroups.

## Plugin Architecture
The Pod Grouper uses a plugin-based architecture similar to the scheduler's plugin framework. Each plugin implements specific grouping logic for different types of workloads:

Expand Down
15 changes: 15 additions & 0 deletions examples/batch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,18 @@ kubectl apply -f batch-job-min-member.yaml
```

This creates a job with `parallelism: 6` but requires at least 2 pods to be schedulable before any pod starts running. The annotation value must be a positive integer.

## External PodGroup

Use `external-podgroup-job.yaml` when the PodGroup is created manually or by another controller and the Job should join it without podgrouper interference.

```bash
kubectl apply -f external-podgroup-job.yaml
```

This example shows:

- An explicit `PodGroup` resource with queue and subgroup definitions.
- `kai.scheduler/skip-podgrouper: "true"` on the Job.
- `pod-group-name` on the pod template annotations.
- `kai.scheduler/subgroup-name` on the pod template labels.
43 changes: 43 additions & 0 deletions examples/batch/external-podgroup-job.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Copyright 2025 NVIDIA CORPORATION
# SPDX-License-Identifier: Apache-2.0

# This example demonstrates how to use an externally-created PodGroup with a
# batch Job. The PodGroup is created explicitly and the Job opts out of
# podgrouper reconciliation while attaching its pods to the external PodGroup.

apiVersion: scheduling.run.ai/v2alpha2
kind: PodGroup
metadata:
name: external-batch-job
spec:
minMember: 2
queue: default-queue
subGroups:
- name: workers
minMember: 2
---
Comment thread
enoodle marked this conversation as resolved.
apiVersion: batch/v1
kind: Job
metadata:
name: external-batch-job
annotations:
kai.scheduler/skip-podgrouper: "true"
spec:
parallelism: 2
completions: 2
template:
metadata:
annotations:
pod-group-name: external-batch-job
labels:
kai.scheduler/subgroup-name: workers
spec:
schedulerName: kai-scheduler
restartPolicy: Never
containers:
- name: main
image: ubuntu
args: ["sleep", "infinity"]
resources:
limits:
nvidia.com/gpu: "1"
1 change: 1 addition & 0 deletions pkg/common/constants/constants.go
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ const (

// Annotations
PodGroupAnnotationForPod = "pod-group-name"
SkipPodGrouperAnnotation = "kai.scheduler/skip-podgrouper"
GpuFraction = "gpu-fraction"
GpuFractionContainerName = "gpu-fraction-container-name"
GpuMemory = "gpu-memory"
Expand Down
31 changes: 30 additions & 1 deletion pkg/podgrouper/pod_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ import (

v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/api/errors"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/tools/record"
Expand Down Expand Up @@ -96,6 +97,10 @@ func (r *PodReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.R
}
}()

if shouldSkipPodGrouper(&pod) {
return ctrl.Result{}, nil
}

if isOrphanPodWithPodGroup(&pod) {
return ctrl.Result{}, nil
}
Expand All @@ -110,11 +115,18 @@ func (r *PodReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.R
return ctrl.Result{}, err
}

if shouldSkipAnyOwner(allOwners) {
return ctrl.Result{}, nil
}

metadata, err := r.podGrouper.GetPGMetadata(ctx, &pod, topOwner, allOwners)
if err != nil {
logger.V(1).Error(err, "Failed to create pod group metadata for pod", req.Namespace, req.Name)
return ctrl.Result{}, err
}
if metadata == nil {
return ctrl.Result{}, nil
}

if len(r.configs.NodePoolLabelKey) > 0 {
addNodePoolLabel(metadata, &pod, r.configs.NodePoolLabelKey)
Expand Down Expand Up @@ -219,7 +231,24 @@ func addNodePoolLabel(metadata *podgroup.Metadata, pod *v1.Pod, nodePoolKey stri

func isOrphanPodWithPodGroup(pod *v1.Pod) bool {
_, foundPGAnnotation := pod.Annotations[constants.PodGroupAnnotationForPod]
return foundPGAnnotation && pod.OwnerReferences == nil
return foundPGAnnotation && len(pod.OwnerReferences) == 0
}

func shouldSkipAnyOwner(owners []*metav1.PartialObjectMetadata) bool {
for _, owner := range owners {
if shouldSkipPodGrouper(owner) {
return true
}
}
return false
}

func shouldSkipPodGrouper(obj metav1.Object) bool {
if obj == nil {
return false
}

return obj.GetAnnotations()[constants.SkipPodGrouperAnnotation] == "true"
}

// +kubebuilder:rbac:groups="",resources=namespaces,verbs=get;list;watch
Expand Down
Loading
Loading