Skip to content

Commit 1a558bc

Browse files
committed
fix(cli): keep GPU fallback mode internal
1 parent be9cefd commit 1a558bc

File tree

6 files changed

+25
-34
lines changed

6 files changed

+25
-34
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -128,7 +128,7 @@ OpenShell can pass host GPUs into sandboxes for local inference, fine-tuning, or
128128
openshell sandbox create --gpu --from [gpu-enabled-sandbox] -- claude
129129
```
130130

131-
The CLI auto-bootstraps a GPU-enabled gateway on first use. GPU intent is also inferred automatically for community images with `gpu` in the name.
131+
The CLI auto-bootstraps a GPU-enabled gateway on first use, auto-selecting CDI when available and otherwise falling back to Docker's NVIDIA GPU request path (`--gpus all`). GPU intent is also inferred automatically for community images with `gpu` in the name.
132132

133133
**Requirements:** NVIDIA drivers and the [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) must be installed on the host. The sandbox image itself must include the appropriate GPU drivers and libraries for your workload — the default `base` image does not. See the [BYOC example](https://github.com/NVIDIA/OpenShell/tree/main/examples/bring-your-own-container) for building a custom sandbox image with GPU support.
134134

architecture/gateway-single-node.md

Lines changed: 2 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -318,12 +318,7 @@ Host GPU drivers & NVIDIA Container Toolkit
318318

319319
### `--gpu` flag
320320

321-
The `--gpu` flag on `gateway start` accepts an optional value that overrides the automatic injection mode:
322-
323-
| Invocation | Behaviour |
324-
|---|---|
325-
| `--gpu` | Auto-select: CDI when enabled on the daemon, `--gpus all` otherwise |
326-
| `--gpu=legacy` | Force `--gpus all` |
321+
The `--gpu` flag on `gateway start` enables GPU passthrough. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise.
327322

328323
The expected smoke test is a plain pod requesting `nvidia.com/gpu: 1` with `runtimeClassName: nvidia` and running `nvidia-smi`.
329324

@@ -392,7 +387,7 @@ When `openshell sandbox create` cannot connect to a gateway (connection refused,
392387
1. `should_attempt_bootstrap()` in `crates/openshell-cli/src/bootstrap.rs` checks the error type. It returns `true` for connectivity errors and missing default TLS materials, but `false` for TLS handshake/auth errors.
393388
2. If running in a terminal, the user is prompted to confirm.
394389
3. `run_bootstrap()` deploys a gateway named `"openshell"`, sets it as active, and returns fresh `TlsOptions` pointing to the newly-written mTLS certs.
395-
4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation.
390+
4. When `sandbox create` requests GPU explicitly (`--gpu`) or infers it from an image whose final name component contains `gpu` (such as `nvidia-gpu`), the bootstrap path enables gateway GPU support before retrying sandbox creation, using the same CDI-or-fallback selection as `gateway start --gpu`.
396391

397392
## Container Environment Variables
398393

crates/openshell-bootstrap/src/docker.rs

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ const REGISTRY_NAMESPACE_DEFAULT: &str = "openshell";
2828
/// | Input | Output |
2929
/// |--------------|--------------------------------------------------------------|
3030
/// | `[]` | `[]` — no GPU |
31-
/// | `["legacy"]` | `["legacy"]` — pass through |
31+
/// | `["legacy"]` | `["legacy"]` — pass through to the non-CDI fallback path |
3232
/// | `["auto"]` | `["nvidia.com/gpu=all"]` if CDI enabled, else `["legacy"]` |
3333
/// | `[cdi-ids…]` | unchanged |
3434
pub(crate) fn resolve_gpu_device_ids(gpu: &[String], cdi_enabled: bool) -> Vec<String> {
@@ -569,8 +569,8 @@ pub async fn ensure_container(
569569
//
570570
// The list is pre-resolved by `resolve_gpu_device_ids` before reaching here:
571571
// [] — no GPU passthrough
572-
// ["legacy"] — legacy nvidia DeviceRequest (driver="nvidia", count=-1);
573-
// relies on the NVIDIA Container Runtime hook
572+
// ["legacy"] — internal non-CDI fallback path: `driver="nvidia"`,
573+
// `count=-1`; relies on the NVIDIA Container Runtime hook
574574
// [cdi-ids…] — CDI DeviceRequest (driver="cdi") with the given device IDs;
575575
// Docker resolves them against the host CDI spec at /etc/cdi/
576576
match device_ids {

crates/openshell-bootstrap/src/lib.rs

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -115,8 +115,8 @@ pub struct DeployOptions {
115115
/// GPU device IDs to inject into the gateway container.
116116
///
117117
/// - `[]` — no GPU passthrough (default)
118-
/// - `["legacy"]` — legacy nvidia DeviceRequest (driver="nvidia", count=-1)
119-
/// - `["auto"]` — resolved at deploy time: CDI if enabled on the daemon, else legacy
118+
/// - `["legacy"]` — internal non-CDI fallback path (`driver="nvidia"`, `count=-1`)
119+
/// - `["auto"]` — resolved at deploy time: CDI if enabled on the daemon, else the non-CDI fallback
120120
/// - `[cdi-ids…]` — CDI DeviceRequest with the given device IDs
121121
pub gpu: Vec<String>,
122122
/// When true, destroy any existing gateway resources before deploying.
@@ -193,9 +193,9 @@ impl DeployOptions {
193193

194194
/// Set GPU device IDs for the cluster container.
195195
///
196-
/// Pass `vec!["auto"]` to auto-select between CDI and legacy based on Docker
197-
/// version at deploy time, or an explicit list of CDI device IDs, or
198-
/// `vec!["legacy"]` to force the legacy nvidia DeviceRequest.
196+
/// Pass `vec!["auto"]` to auto-select between CDI and the non-CDI fallback
197+
/// based on daemon capabilities at deploy time. The `legacy` sentinel is an
198+
/// internal implementation detail for the fallback path.
199199
#[must_use]
200200
pub fn with_gpu(mut self, gpu: Vec<String>) -> Self {
201201
self.gpu = gpu;

crates/openshell-cli/src/main.rs

Lines changed: 13 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -808,12 +808,11 @@ enum GatewayCommands {
808808
/// `nvidia.com/gpu` resources. Requires NVIDIA drivers and the
809809
/// NVIDIA Container Toolkit on the host.
810810
///
811-
/// An optional argument controls the injection mode:
812-
///
813-
/// --gpu Auto-select: CDI when enabled on the daemon, legacy otherwise
814-
/// --gpu=legacy Force legacy nvidia DeviceRequest
815-
#[arg(long = "gpu", num_args = 0..=1, default_missing_value = "auto", value_name = "MODE")]
816-
gpu: Option<String>,
811+
/// When enabled, OpenShell auto-selects CDI when the Docker daemon has
812+
/// CDI enabled and falls back to Docker's NVIDIA GPU request path
813+
/// (`--gpus all`) otherwise.
814+
#[arg(long)]
815+
gpu: bool,
817816
},
818817

819818
/// Stop the gateway (preserves state).
@@ -1117,8 +1116,10 @@ enum SandboxCommands {
11171116
/// Request GPU resources for the sandbox.
11181117
///
11191118
/// When no gateway is running, auto-bootstrap starts a GPU-enabled
1120-
/// gateway. GPU intent is also inferred automatically for known
1121-
/// GPU-designated image names such as `nvidia-gpu`.
1119+
/// gateway using the same automatic injection selection as
1120+
/// `openshell gateway start --gpu`. GPU intent is also inferred
1121+
/// automatically for known GPU-designated image names such as
1122+
/// `nvidia-gpu`.
11221123
#[arg(long)]
11231124
gpu: bool,
11241125

@@ -1575,15 +1576,10 @@ async fn main() -> Result<()> {
15751576
registry_token,
15761577
gpu,
15771578
} => {
1578-
let gpu = match gpu.as_deref() {
1579-
None => vec![],
1580-
Some("auto") => vec!["auto".to_string()],
1581-
Some("legacy") => vec!["legacy".to_string()],
1582-
Some(other) => {
1583-
return Err(miette::miette!(
1584-
"unknown --gpu value: {other:?}; expected `legacy`"
1585-
));
1586-
}
1579+
let gpu = if gpu {
1580+
vec!["auto".to_string()]
1581+
} else {
1582+
vec![]
15871583
};
15881584
run::gateway_admin_deploy(
15891585
&name,

docs/sandboxes/manage-gateways.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -168,7 +168,7 @@ $ openshell gateway info --name my-remote-cluster
168168

169169
| Flag | Purpose |
170170
|---|---|
171-
| `--gpu` | Enable NVIDIA GPU passthrough. Requires NVIDIA drivers and the Container Toolkit on the host. Accepts an optional value: omit for auto-select (CDI when enabled on the daemon, `--gpus all` otherwise), or `--gpu=legacy` to force `--gpus all`. |
171+
| `--gpu` | Enable NVIDIA GPU passthrough. Requires NVIDIA drivers and the Container Toolkit on the host. OpenShell auto-selects CDI when enabled on the daemon and falls back to Docker's NVIDIA GPU request path (`--gpus all`) otherwise. |
172172
| `--plaintext` | Listen on HTTP instead of mTLS. Use behind a TLS-terminating reverse proxy. |
173173
| `--disable-gateway-auth` | Skip mTLS client certificate checks. Use when a reverse proxy cannot forward client certs. |
174174
| `--registry-username` | Username for registry authentication. Defaults to `__token__` when `--registry-token` is set. Only needed for private registries. Also configurable with `OPENSHELL_REGISTRY_USERNAME`. |

0 commit comments

Comments
 (0)