You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .agents/skills/debug-navigator-cluster/SKILL.md
+38-38Lines changed: 38 additions & 38 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,8 +13,8 @@ Diagnose why a openshell cluster failed to start after `openshell gateway start`
13
13
14
14
1.**Pre-deploy check**: `openshell gateway start` in interactive mode prompts to **reuse** (keep volume, clean stale nodes) or **recreate** (destroy everything, fresh start). `mise run cluster` always recreates before deploy.
15
15
2. Ensure cluster image is available (local build or remote pull)
16
-
3. Create Docker network (`navigator-cluster`) and volume (`navigator-cluster-{name}`)
17
-
4. Create and start a privileged Docker container (`navigator-cluster-{name}`)
16
+
3. Create Docker network (`openshell-cluster`) and volume (`openshell-cluster-{name}`)
17
+
4. Create and start a privileged Docker container (`openshell-cluster-{name}`)
18
18
5. Wait for k3s to generate kubeconfig (up to 60s)
19
19
6.**Clean stale nodes**: Remove any `NotReady` k3s nodes left over from previous container instances that reused the same persistent volume
20
20
7.**Prepare local images** (if `OPENSHELL_PUSH_IMAGES` is set): In `internal` registry mode, bootstrap waits for the in-cluster registry and pushes tagged images there. In `external` mode, bootstrap uses legacy `ctr -n k8s.io images import` push-mode behavior.
@@ -35,7 +35,7 @@ The host port is configurable via `--port` on `openshell gateway start` (default
35
35
36
36
The TCP host is also added as an extra gateway TLS SAN so mTLS hostname validation succeeds.
37
37
38
-
The default cluster name is `openshell`. The container is `navigator-cluster-{name}`.
38
+
The default cluster name is `openshell`. The container is `openshell-cluster-{name}`.
39
39
40
40
## Prerequisites
41
41
@@ -51,7 +51,7 @@ When the user asks to debug a cluster failure, **run diagnostics automatically**
51
51
52
52
Before running commands, establish:
53
53
54
-
1.**Cluster name**: Default is `openshell`, giving container name `navigator-cluster-openshell`
54
+
1.**Cluster name**: Default is `openshell`, giving container name `openshell-cluster-openshell`
55
55
2.**Remote or local**: If the user deployed with `--remote <host>`, all Docker commands must target that host
@@ -162,15 +162,15 @@ The Envoy Gateway provides HTTP/gRPC ingress:
162
162
163
163
```bash
164
164
# Gateway status
165
-
docker execnavigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get gateway/navigator-gateway'
165
+
docker execopenshell-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get gateway/navigator-gateway'
166
166
167
167
# Check port bindings on the host
168
-
docker port navigator-cluster-<name>
168
+
docker port openshell-cluster-<name>
169
169
```
170
170
171
171
Expected ports: `6443/tcp`, `30051/tcp` (mapped to configurable host port, default 8080; set via `--port` on deploy).
172
172
Only one local cluster can run on a Docker host at a time because `6443` is fixed.
173
-
`mise run cluster` handles this by removing conflicting local `navigator-cluster-*` containers first.
173
+
`mise run cluster` handles this by removing conflicting local `openshell-cluster-*` containers first.
174
174
175
175
If ports are missing or conflicting, another process may be using them. Check with:
176
176
@@ -185,37 +185,37 @@ If using Docker-in-Docker (`DOCKER_HOST=tcp://docker:2375`), verify metadata poi
185
185
186
186
Component images (server, sandbox, pki-job) can reach kubelet via two paths:
187
187
188
-
**Local/external pull mode** (default local via `mise run cluster` / `mise run cluster:build`): Local images are tagged to the configured local registry base (default `127.0.0.1:5000/navigator/*`), pushed to that registry, and pulled by k3s via `registries.yaml` mirror endpoint (typically `host.docker.internal:5000`). `cluster:build` builds then pushes images; `cluster` pushes prebuilt local tags (`navigator/*:dev`, falling back to `localhost:5000/navigator/*:dev` or `127.0.0.1:5000/navigator/*:dev`).
188
+
**Local/external pull mode** (default local via `mise run cluster`): Local images are tagged to the configured local registry base (default `127.0.0.1:5000/openshell/*`), pushed to that registry, and pulled by k3s via `registries.yaml` mirror endpoint (typically `host.docker.internal:5000`). The `cluster` task pushes prebuilt local tags (`openshell/*:dev`, falling back to `localhost:5000/openshell/*:dev` or `127.0.0.1:5000/openshell/*:dev`).
189
189
190
190
```bash
191
191
# Verify image refs currently used by openshell deployment
192
-
docker execnavigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get deploy navigator -o jsonpath="{.spec.template.spec.containers[*].image}"'
192
+
docker execopenshell-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get deploy navigator -o jsonpath="{.spec.template.spec.containers[*].image}"'
**Legacy push mode** (`mise run cluster:push`): Images are imported into the k3s containerd `k8s.io` namespace.
198
+
**Legacy push mode**: Images are imported into the k3s containerd `k8s.io` namespace.
199
199
200
200
```bash
201
201
# Check if images were imported into containerd (k3s default namespace is k8s.io)
202
-
docker execnavigator-cluster-<name> ctr -a /run/k3s/containerd/containerd.sock images ls | grep navigator
202
+
docker execopenshell-cluster-<name> ctr -a /run/k3s/containerd/containerd.sock images ls | grep navigator
203
203
```
204
204
205
205
If images are missing, re-import with:
206
206
207
207
```bash
208
-
docker save <image-ref>| docker exec -i navigator-cluster-<name> ctr -a /run/k3s/containerd/containerd.sock images import -
208
+
docker save <image-ref>| docker exec -i openshell-cluster-<name> ctr -a /run/k3s/containerd/containerd.sock images import -
209
209
```
210
210
211
211
**External pull mode** (remote deploy, or local with `OPENSHELL_REGISTRY_HOST`/`IMAGE_REPO_BASE` pointing at a non-local registry): Images are pulled from an external registry at runtime. The entrypoint generates `/etc/rancher/k3s/registries.yaml`.
212
212
213
213
```bash
214
214
# Verify registries.yaml exists and has credentials
# Test pulling an image manually from inside the cluster
218
-
docker execnavigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml crictl pull ghcr.io/nvidia/nemoclaw/server:latest'
218
+
docker execopenshell-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml crictl pull ghcr.io/nvidia/nemoclaw/server:latest'
219
219
```
220
220
221
221
If `registries.yaml` is missing or has wrong values, verify env wiring (`OPENSHELL_REGISTRY_HOST`, `OPENSHELL_REGISTRY_INSECURE`, username/password for authenticated registries).
@@ -226,10 +226,10 @@ TLS certificates are generated by the `navigator-bootstrap` crate (using `rcgen`
226
226
227
227
```bash
228
228
# Check if the three TLS secrets exist
229
-
docker execnavigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls navigator-server-client-ca navigator-client-tls'
229
+
docker execopenshell-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls navigator-server-client-ca navigator-client-tls'
230
230
231
231
# Inspect server cert expiry (if openssl is available in the container)
232
-
docker execnavigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -dates 2>/dev/null || echo "openssl not available"'
232
+
docker execopenshell-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl -n navigator get secret navigator-server-tls -o jsonpath="{.data.tls\.crt}" | base64 -d | openssl x509 -noout -dates 2>/dev/null || echo "openssl not available"'
233
233
234
234
# Check if CLI-side mTLS files exist locally
235
235
ls -la ~/.config/openshell/clusters/<name>/mtls/
@@ -247,7 +247,7 @@ Common mTLS issues:
247
247
Events catch scheduling failures, image pull errors, and resource issues:
248
248
249
249
```bash
250
-
docker execnavigator-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get events -A --sort-by=.lastTimestamp'| tail -n 50
250
+
docker execopenshell-cluster-<name> sh -lc 'KUBECONFIG=/etc/rancher/k3s/k3s.yaml kubectl get events -A --sort-by=.lastTimestamp'| tail -n 50
251
251
```
252
252
253
253
Look for:
@@ -264,13 +264,13 @@ DNS misconfiguration is a common root cause, especially on remote/Linux hosts:
|`-g`, `--gateway <NAME>`| Gateway to operate on. Also settable via `OPENSHELL_CLUSTER` env var. Falls back to active gateway in `~/.config/openshell/active_cluster`. |
12
+
|`-g`, `--gateway <NAME>`| Gateway to operate on. Also settable via `OPENSHELL_GATEWAY` env var. Falls back to active gateway in `~/.config/openshell/active_gateway`. |
13
13
14
14
## Environment Variables
15
15
16
16
| Variable | Description |
17
17
|----------|-------------|
18
-
|`OPENSHELL_CLUSTER`| Override active gateway name (same as `--gateway`) |
18
+
|`OPENSHELL_GATEWAY`| Override active gateway name (same as `--gateway`) |
19
19
|`OPENSHELL_SANDBOX_POLICY`| Path to default sandbox policy YAML (fallback when `--policy` is not provided) |
20
20
21
21
---
@@ -122,7 +122,7 @@ Print or start an SSH tunnel for kubectl access to a remote cluster.
122
122
123
123
### `openshell gateway select [name]`
124
124
125
-
Set the active gateway. Writes to `~/.config/openshell/active_cluster`. When called without arguments, lists all provisioned gateways with the active one marked with `*`.
125
+
Set the active gateway. Writes to `~/.config/openshell/active_gateway`. When called without arguments, lists all provisioned gateways with the active one marked with `*`.
126
126
127
127
---
128
128
@@ -319,29 +319,29 @@ Delete one or more providers by name.
319
319
320
320
---
321
321
322
-
## Cluster Inference Commands
322
+
## Inference Commands
323
323
324
-
### `openshell cluster inference set`
324
+
### `openshell inference set`
325
325
326
-
Configure the managed cluster inference route used by `inference.local`. Both flags are required.
326
+
Configure the managed gateway inference route used by `inference.local`. Both flags are required.
327
327
328
328
| Flag | Default | Description |
329
329
|------|---------|-------------|
330
330
|`--provider <NAME>`| -- | Provider record name (required) |
331
331
|`--model <ID>`| -- | Model identifier to use for generation requests (required) |
332
332
333
-
### `openshell cluster inference update`
333
+
### `openshell inference update`
334
334
335
-
Partially update the cluster inference configuration. Fetches the current config and applies only the provided overrides. At least one flag is required.
335
+
Partially update the gateway inference configuration. Fetches the current config and applies only the provided overrides. At least one flag is required.
336
336
337
337
| Flag | Default | Description |
338
338
|------|---------|-------------|
339
339
|`--provider <NAME>`| unchanged | Provider record name |
0 commit comments