Architecture Review: K8s Troubleshooting with Nemoclaw + Nemotron-3-120B #966

uzunenes · 2026-03-26T08:40:30Z

uzunenes
Mar 26, 2026

Hi everyone, looking for best practice advice on this architecture:

The Stack: nvidia/nemotron-3-super-120b-a12b (2x H200) + Nemoclaw on a separate VM (same DC, HTTP endpoint).

The Plan: Give Nemoclaw K8s API access (via token) to a test namespace. Trigger it via CronJob or Telegram to scan for issues and send SMTP alerts.

Important context: I already use Grafana alerts for deterministic problems. This LLM setup is strictly for rapid detection of complex, non-deterministic edge cases.

My Questions:

Is querying the K8s API directly the best practice for Nemoclaw in this scenario?
Alternatively, should I ship all namespace logs/events to my Grafana stack first and have Nemoclaw analyze them from there instead of direct K8s access?
Any quick tips on filtering the data to avoid blowing up the context window?
Thanks!

BenediktSchackenberg · 2026-03-26T12:58:19Z

BenediktSchackenberg
Mar 26, 2026

Interesting setup. I've been running OpenClaw in a similar split-infra pattern (not K8s specifically, but agent + external APIs on separate hosts). A few thoughts:

1. Direct K8s API vs Grafana

Direct K8s API is better for your use case. The agent can query exactly what it needs (kubectl get events --field-selector reason=BackOff, pod logs, describe output) instead of parsing pre-aggregated dashboards. Grafana is great for time-series, but LLMs work better with structured, point-in-time snapshots.

That said — don't give it cluster-admin. Create a ServiceAccount with a Role scoped to your test namespace: get/list/watch on pods, events, deployments, replicasets, and get on logs. That's enough for troubleshooting without risk.

2. Context window management

This is the real challenge. A busy namespace can produce megabytes of events and logs per hour. What works:

Pre-filter in the query, not in the prompt. Use --since=15m on logs, --field-selector on events. Don't dump everything and ask the LLM to find the problem.
Two-pass approach: first pass queries high-level state (events, pod status, restart counts), second pass drills into specific pod logs only if the first pass flags something. Keeps most runs small.
Truncate per-pod logs to last ~200 lines. If the root cause isn't in the last 200 lines of a crashing pod, it's usually in events or describe output instead.

With Nemotron-3-120B you have a decent context window, but token cost per run adds up fast if you're scanning every 5 minutes.

3. Practical tip for the CronJob trigger

Rather than a fixed interval, consider having your CronJob check for recent warning events first (kubectl get events --field-selector type=Warning --since=5m). If the count is 0, skip the full analysis. Saves a lot of inference calls on quiet clusters.

What namespace complexity are we talking about? (number of pods, typical churn rate) That would help narrow down the filtering strategy.

0 replies

uzunenes · 2026-03-28T17:57:51Z

uzunenes
Mar 28, 2026
Author

The test namespace doesn't exist yet. I am currently creating it from scratch just to test nemoclaw. My plan is to start small by setting up a dummy test backend-app's, test databases etc. to validate the workflow and see how the LLM handles the context.

Once I prove this concept works, my main goal is to deploy it to real applications. I will probably split my applications into different namespaces and assign different roles across various VMs.

I will definitely apply your filtering and RBAC tips while building the system. I'll make sure to share my test results and findings here once the setup is up and running.

Thanks again.

0 replies

andresl123 · 2026-03-31T16:03:08Z

andresl123
Mar 31, 2026

Were you able to get OpenClaw to access your K8s cluster without using an SSH tunnel? I tried running it directly using kubectl get pods (with roles and everything set up), but I’m getting this error: 'Direct access to standard Kubernetes control-plane ports (like port 6443) is unconditionally hard-blocked.'

As far as I understand, it's blocked by OpenShell and cannot be changed. But maybe I misunderstood something? :)

0 replies

BenediktSchackenberg · 2026-03-31T16:10:54Z

BenediktSchackenberg
Mar 31, 2026

Yeah that's by design — OpenShell hard-blocks a handful of ports that could let the sandbox mess with the host's control plane. Port 6443 (K8s API) is one of them, and you can't override it through the network policy file.

What I'd do instead: run kubectl outside the sandbox and expose the results to the agent. Two options depending on how far you want to go:

Quick and dirty: Write a small wrapper script on the host that runs your kubectl commands and dumps the output to a file the sandbox can read. Trigger it via cron or inotify. The agent reads the file, analyzes it, done. Zero network access needed.
Cleaner: Stand up a lightweight REST API on the host (even a simple Flask/Express app) that wraps your kubectl calls behind specific endpoints like /api/pods, /api/events?since=5m, etc. Add that host+port to the sandbox network policy. That way the agent can query K8s state on demand without having direct API server access.

Option 2 is basically what @uzunenes would end up building anyway for the production setup — a controlled interface between the agent and the cluster, rather than handing it raw API access.

The hard-block exists for a good reason though. If the sandbox could talk to 6443, a compromised agent could potentially escalate privileges through the K8s API. The indirection layer forces you to decide exactly what the agent is allowed to see and do.

1 reply

andresl123 Mar 31, 2026

Thanks for the reply :)
I got it. That's a great approach. The second option is much more dynamic and we can adapt it as needed. I will definitely try this one!

What I was doing before was creating an SSH tunnel using this command:
ssh -o ProxyCommand="corkscrew 10.200.0.1 3128 %h %p" -L 16443:127.0.0.1:6443 user@192.168.x.x

And then running my kubectl (something like this: "kubectl --server=https://127.0.0.1:16443 --insecure-skip-tls-verify=true get nodes") commands through that local port to access the K8s API. Asked to the agent create an skill to use it more dynamically

Of course, I created a custom kubeconfig for this, and the user has their own RBAC policies limiting their access."

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Review: K8s Troubleshooting with Nemoclaw + Nemotron-3-120B #966

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Architecture Review: K8s Troubleshooting with Nemoclaw + Nemotron-3-120B #966

Uh oh!

uzunenes Mar 26, 2026

Replies: 4 comments · 1 reply

Uh oh!

BenediktSchackenberg Mar 26, 2026

Uh oh!

uzunenes Mar 28, 2026 Author

Uh oh!

andresl123 Mar 31, 2026

Uh oh!

BenediktSchackenberg Mar 31, 2026

Uh oh!

andresl123 Mar 31, 2026

uzunenes
Mar 26, 2026

Replies: 4 comments 1 reply

BenediktSchackenberg
Mar 26, 2026

uzunenes
Mar 28, 2026
Author

andresl123
Mar 31, 2026

BenediktSchackenberg
Mar 31, 2026