This Kubernetes controller integrates with the ADO autoconf custom experiment to automatically set resource requirements for AI tuning jobs that use fms-hf-tuning.
The controller inspects the command line of your AI jobs (either PyTorchJob objects or PyTorchJob objects wrapped inside an AppWrapper) to extract details such as:
- Model name
- Tuning method
- Effective batch size (i.e.,
per_device_batch_size * NUM_GPUS) - Maximum sequence length
It combines these with the target GPU model to request recommendations from the autoconf experiment and then:
- Patches the resource requests/limits of
PyTorchJobobjects, or - Creates a new
AppWrapper(when direct patching of a nestedPyTorchJobis not supported byAppWrapper).
We wish to use this controller to enhance the execution of AI workloads on Kubernetes clusters such that they use the right number of GPUs so as to avoid going out of GPU memory. The design of the controller enables us to explore different algorithms for resource recommendation which we plan to explore in the future.
We are working with the Kueue maintainers on a Kubernetes Enhancement Proposal (KEP) to improve how external Kubernetes controllers interact with jobs managed by Kueue (including AI workloads). The design discussion is tracked here: kubernetes-sigs/kueue#6915.
Until that work lands, this controller demonstrates a way to interact with Kueue-managed jobs while operating within current Kueue capabilities.
You can run the controller as a local process while it manages one or more namespaces on your cluster.
Assumptions for the example below
- You use a Kueue-managed namespace called
tuningto runAppWrapperworkloads. - These workloads:
- use the Kueue
LocalQueuenameddefault-queue, - wrap a
PyTorchJobthat uses fms-hf-tuning, - request one or more NVIDIA GPUs of the same model (e.g.,
NVIDIA-A100-SXM4-80GB), - are subject to Kyverno policies requiring the
kueue.x-k8s.io/queue-namelabel onAppWrapperobjects.
- use the Kueue
Steps
- Create and activate a Python virtual environment, then install the ADO autoconf client:
python3 -m venv .venv && source .venv/bin/activate pip install ado-autoconf
- Log in to your cluster (via
kubectloroc). - Build the controller locally:
make
- Start the controller with flags appropriate for the scenario:
./bin/manager \ --done-label-key=kueue.x-k8s.io/queue-name \ --done-label-value=default-queue \ --namespaces "tuning" \ --enable-appwrapper=true \ --enable-pytorchjob=true \ --unsuspend-patched-jobs=false \ --default-gpu-model=NVIDIA-A100-SXM4-80GB \ --path-wrapper-script=./cmd/wrapper_autoconf.py - Create an
AppWrapperorPyTorchJobworkload with the following labels:# This setup both satisfies Kyverno (requires a queue-name) and # allows Kueue to temporarily ignore the job until the controller updates it. kueue.x-k8s.io/queue-name: fake autoconf-plugin-name: resource-requirements-appwrapper
Example AppWrapper and PyTorchJob manifests are available under examples.
If you prefer to run the controller in-cluster (e.g., as a Deployment), the high-level process is:
- Build an image for the controller.
- Create RBAC: ServiceAccount, Role/ClusterRole, and bindings that permit reading/patching the resources you plan to manage (i.e.
AppWrapperand/orPyTorchJob). - Deploy a
Deploymentfor the controller, setting the desired command-line flags (see Configuration below). Enable leader election if you run multiple replicas. - Optionally expose metrics/webhooks via a
Serviceif you enable those endpoints. - Label workloads so the controller can discover them (see
--watch-label-key/--watch-label-value), then create yourAppWrapper/PyTorchJobobjects. - Observe logs and job status to confirm resources are being recommended and applied as expected.
Below are the controller’s command-line options:
--default-autoconf-model-version string— Default autoconf model version to use (default2.0.0).--default-gpu-model string— Default GPU model if not specified in the job.--patch-cpu-request— Set job CPU request/limit tomax(1, 2 * NUM_GPUS)(defaulttrue).--unsuspend-patched-jobs— Unsuspend jobs after patching.--path-wrapper-script string— Path to the local Python wrapper for running models. Mutually exclusive with--url-ado. Exactly one of these must be set.--url-ado string— URL of the ADO REST API serving the models. Mutually exclusive with--path-wrapper-script. Exactly one of these must be set.
--namespaces string— Comma-separated list of namespaces to watch.--watch-label-key string— Limit monitoring to objects labeledkey=value(default keyautoconf-plugin-name).--watch-label-value string— Label value used with--watch-label-key(defaultresource-requirements-appwrapper).--enable-appwrapper— Watch and patchAppWrapperobjects.--enable-pytorchjob— Watch and patchPyTorchJobobjects.
--done-label-key string— Label key inserted when patching is complete (defaultautoconf-plugin-done).--done-label-value string— Label value inserted when patching is complete (defaultyes).--waiting-for-ado-request-id-label string— Label used to mark jobs waiting for an ADO request ID (defaultwaiting-for-ado-request-id).
--zap-devel— Development mode defaults (console encoder, debug log level, warn stack traces). Production mode defaults (JSON encoder, info log level, error stack traces). Defaulttrue.--zap-encoder [json|console]— Zap log encoding.--zap-log-level value— Log verbosity (debug,info,error,panic, or integer > 0 for custom levels).--zap-stacktrace-level [info|error|panic]— Level at and above which stack traces are captured.--zap-time-encoding [epoch|millis|nano|iso8601|rfc3339|rfc3339nano]— Time encoding (defaultepoch).