AI sentinel agent#98
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
| docker push ${{ env.NPD_IMAGE }}:latest | ||
|
|
||
| - name: Build and push Sentinel image | ||
| uses: docker/build-push-action@v6 |
Check warning
Code scanning / CodeQL
Unpinned tag for a non-immutable Action in workflow Medium
| "labels": dict(node.metadata.labels or {}), | ||
| "annotations": { | ||
| k: v for k, v in (node.metadata.annotations or {}).items() | ||
| if k.startswith("gcm-sentinel") or k.startswith("node.kubernetes.io/") |
Check failure
Code scanning / CodeQL
Incomplete URL substring sanitization High
Summary:
gcm-sentinel: AI-powered GPU cluster investigation agent
Summary
Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action.
Key design decisions
- Observe-only by default (actionMode=recommend). Remediation tools are absent from the LLM's schema unless explicitly set to execute mode. Safety enforced at the code level, not prompt level.
- Plugin-based data sources. Each data source (Prometheus, DCGM direct, node-exporter, K8s core, workloads, GCM health,
Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration.
- Multi-LLM support. Backend abstraction supports Anthropic and OpenAI APIs. Default: Claude Sonnet 4.6.
- Separate Helm chart at charts/gcm-sentinel/ (alongside existing charts/gcm/). Separate PyPI package (gcm-sentinel). Independent deployment, RBAC, and lifecycle from the core GCM DaemonSets.
Deploy to a cluster (observe-only):
```
# Create API key secret
kubectl create secret generic gcm-sentinel-llm \
--namespace monitoring \
--from-literal=api-key=YOUR_KEY
# Deploy — observe-only, single node
helm install gcm-sentinel oci://ghcr.io/facebookresearch/charts/gcm-sentinel \
--namespace monitoring \
--set llm.existingSecret=gcm-sentinel-llm \
--set sentinel.nodeAllowlist="YOUR_GPU_NODE_NAME"
# Verify
kubectl logs -n monitoring -l app=gcm-sentinel | head -10
kubectl get events -A --field-selector reason=GCMSentinel --watch
```
Differential Revision: D96650734
ddfd9af to
521ff6b
Compare
|
Claude finished @luccabb's task in 1m 50s —— View job PR Review: AI Sentinel Agent
This is a well-structured addition. The plugin-based datasource architecture, three-tier safety model (LLM schema → code guards → RBAC), and the observe-by-default posture are all solid design choices. Below are findings ranging from critical to minor. Critical: Race condition on
|
|
Hi @luccabb! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
Summary:
gcm-sentinel: AI-powered GPU cluster investigation agent
3 lines to release:
Summary
Adds gcm-sentinel, a new component to the GCM ecosystem that uses an LLM (Claude or GPT) to investigate GPU hardware failures detected by GCM Health Checks. When a node condition changes (e.g. GcmXidErrorsProblem), the agent queries Prometheus metrics, Kubernetes state, pod logs, and GPU exporter data, then produces a severity assessment, root cause analysis, and recommended remediation action.
Key design decisions
Alertmanager) is a self-contained Python class. Adding a new one is one file + one line of registration.
Deploy to a cluster (observe-only):
Differential Revision: D96650734