-
Notifications
You must be signed in to change notification settings - Fork 86
feat: cloud-native GPU health event management #795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
ArangoGutierrez
wants to merge
58
commits into
main
Choose a base branch
from
feat/cloud-native-healthevents
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 48 commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
006c572
chore - Add small set of GitHub checks and copilot rules (#691)
ArangoGutierrez 35c2bbd
feat: introduce k8s-idiomatic Go SDK for Device API (#692)
pteranodan bf8db76
chore: update documentation (#693)
pteranodan 85db526
api: add ProviderService proto for device-api-server
ArangoGutierrez 9d40418
feat: add device-api-server with NVML fallback provider
ArangoGutierrez f28013e
fix(security): bind gRPC TCP listener to localhost by default
ArangoGutierrez ef3c5a8
docs(provider): clarify Heartbeat RPC is reserved for future use
ArangoGutierrez 6fe8f63
feat(consumer): populate ListMeta.ResourceVersion in ListGpus response
ArangoGutierrez 76fb570
docs: document module structure and naming conventions
ArangoGutierrez 69c9efd
fix: align module paths to github.com/nvidia/nvsentinel
ArangoGutierrez c9360cd
refactor: consolidate to unified GpuService with standard CRUD methods
ArangoGutierrez 3820941
fix(build): use Go 1.25 for container builds
ArangoGutierrez a9f9432
fix(nvml-provider): parse command-line flags before returning config
ArangoGutierrez da15558
fix(helm): remove invalid --provider-address flag from server args
ArangoGutierrez 7f07d0b
fix(helm): correct sidecar test values for cluster deployment
ArangoGutierrez 15efc6e
feat(demo): improve cross-platform builds and idempotency
ArangoGutierrez 996aed0
fix(demo): remove unreliable metrics check from verify_gpu_registration
ArangoGutierrez af47a00
fix(demo): use correct container name 'nvsentinel' instead of 'device…
ArangoGutierrez 7b65691
fix(ci): update protoc version to v33.4 to match generated files
ArangoGutierrez 196c165
docs: add hybrid device-apiserver design for PR #718 + #720 merge
ArangoGutierrez 4f236c4
chore: add .worktrees/ and temp docs to .gitignore
ArangoGutierrez 4457249
feat: implement cloud-native GPU health event management
ArangoGutierrez 1a8dd5d
docs: add AGENTS.md for AI assistant context persistence
ArangoGutierrez 7e73ba3
docs(AGENTS.md): add E2E test migration plan
ArangoGutierrez 5d15de0
docs(AGENTS.md): complete Phase 1 E2E test audit
ArangoGutierrez 6288d60
feat(tests): add HealthEvent CRD test helpers
ArangoGutierrez bd3fc9b
docs: update AGENTS.md with Phase 2 completion
ArangoGutierrez a7f0996
test(e2e): migrate smoke_test.go to CRD-based flow
ArangoGutierrez ac9f421
docs: update AGENTS.md with Phase 3 progress
ArangoGutierrez 30dad6e
test(e2e): migrate fault_quarantine_test.go to CRD-based flow
ArangoGutierrez 2385015
test(e2e): migrate node_drainer_test.go to CRD-based flow
ArangoGutierrez 3c9ad1a
test(e2e): migrate fault_remediation_test.go to CRD-based flow
ArangoGutierrez b46933b
docs: complete Phase 3 E2E test migration
ArangoGutierrez a7f3cc7
fix(tests): register HealthEvent types and fix context keys
ArangoGutierrez abcad80
docs: document Phase 4 validation results
ArangoGutierrez 0efe0dc
fix: restore complete .versions.yaml from main
ArangoGutierrez f10a510
fix: restore complete Makefile from main
ArangoGutierrez c825c99
ci: remove legacy MongoDB/PostgreSQL E2E tests
ArangoGutierrez 43070a1
fix(deviceapiserver): migrate to metadata.resource_version after prot…
ArangoGutierrez b22c586
fix(ci): add missing strategy.matrix to go job
ArangoGutierrez c63e1f3
fix(health-provider): remove duplicate flag.Parse() calls
ArangoGutierrez a5f10ac
fix(controller-test): use metricsAddr flag in manager options
ArangoGutierrez ad0e661
fix(api): remove duplicate resource_version and clarify CreateGpu docs
ArangoGutierrez 4e0c60d
chore(deps): update go.mod for codegen dependencies
ArangoGutierrez fc9cfa9
fix(codegen): use Function() for fmt.Errorf and fix connection leak
ArangoGutierrez dbd2156
fix(docs): correct import path, remove duplicates, add license
ArangoGutierrez ec1d459
fix(client-go): add nil guards and fix resource leak
ArangoGutierrez f56b1fc
style: fix lint issues in api and client-go
ArangoGutierrez ff326b7
fix: address PR review findings across cache, controllers, and codegen
ArangoGutierrez 260812f
fix: drain controller accepts Draining phase on re-entry
ArangoGutierrez 33209ed
fix: remediation controller requeues while reboot job is running
ArangoGutierrez bde3e4d
fix: executeGPUReset uses Patch instead of Update
ArangoGutierrez a952254
fix: quarantine controller cancels event when node not found
ArangoGutierrez 6d78945
fix: bind gRPC to localhost in static manifest
ArangoGutierrez 213dbc6
fix: correct Dockerfile EXPOSE ports to match actual defaults
ArangoGutierrez e2ad468
chore: remove unused Prometheus metrics from healthevents controllers
ArangoGutierrez 36782b1
fix: reject malformed resource_version with InvalidArgument
ArangoGutierrez d677a61
perf: broadcast events outside write lock in MarkProviderGPUsUnknown
ArangoGutierrez File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,35 @@ | ||
| # Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||
| # | ||
| # Licensed under the Apache License, Version 2.0 (the "License"); | ||
| # you may not use this file except in compliance with the License. | ||
| # You may obtain a copy of the License at | ||
| # | ||
| # http://www.apache.org/licenses/LICENSE-2.0 | ||
| # | ||
| # Unless required by applicable law or agreed to in writing, software | ||
| # distributed under the License is distributed on an "AS IS" BASIS, | ||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
| # See the License for the specific language governing permissions and | ||
| # limitations under the License. | ||
|
|
||
| # ============================================================================== | ||
| # GIT ATTRIBUTES | ||
| # ============================================================================== | ||
| # Use 'linguist-generated=true' to hide generated code from GitHub PR diffs. | ||
| # ============================================================================== | ||
|
|
||
| # Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc) | ||
| zz_generated.*.go linguist-generated=true | ||
|
|
||
| # Hide generated Protobuf Go bindings | ||
| *.pb.go linguist-generated=true | ||
|
|
||
| # Hide generated client library | ||
| client-go/client/** linguist-generated=true | ||
| client-go/listers/** linguist-generated=true | ||
| client-go/informers/** linguist-generated=true | ||
|
|
||
| # Hide copied, unmodified upstream code | ||
| code-generator/cmd/client-gen/types/** linguist-generated=true | ||
| code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true | ||
| code-generator/cmd/client-gen/generators/util/** linguist-generated=true | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,260 +1,115 @@ | ||
| # GitHub Copilot Instructions for NVSentinel | ||
| # Copilot Instructions for NVSentinel | ||
|
|
||
| ## Project Overview | ||
|
|
||
| NVSentinel is a GPU Node Resilience System for Kubernetes that automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It's designed for high-performance computing environments running NVIDIA GPUs. | ||
| This file provides repository-level instructions for GitHub Copilot to improve | ||
| code reviews and suggestions. | ||
|
|
||
| **Status**: Experimental/Preview Release - APIs and configurations may change. | ||
| ## Project Overview | ||
|
|
||
| ## Architecture & Technologies | ||
| NVSentinel is the NVIDIA Device API — a Protocol Buffer and gRPC-based API for | ||
| GPU device management. The codebase consists primarily of `.proto` definitions | ||
| and their generated Go code. | ||
|
|
||
| ### Core Technologies | ||
| - **Language**: Go 1.25+ (primary), Python 3.10+ (monitoring tools) | ||
| - **Container Platform**: Kubernetes 1.25+ | ||
| - **Deployment**: Helm 3.0+, Tilt (development) | ||
| - **Storage**: MongoDB (event store with change streams) | ||
| - **Communication**: gRPC with Protocol Buffers | ||
| - **GPU Monitoring**: NVIDIA DCGM (Data Center GPU Manager) | ||
| ## Repository Structure | ||
|
|
||
| ### Project Structure | ||
| ``` | ||
| ├── health-monitors/ # Pluggable health detection modules | ||
| │ ├── gpu-health-monitor/ # DCGM-based GPU monitoring (Python) | ||
| │ ├── csp-health-monitor/ # Cloud provider health checks (Go) | ||
| │ └── syslog-health-monitor/ # System log analysis (Go) | ||
| ├── health-events-analyzer/ # Event classification and routing | ||
| ├── fault-quarantine/ # Node isolation (cordon) | ||
| ├── node-drainer/ # Workload eviction | ||
| ├── fault-remediation/ # Break-fix automation | ||
| ├── labeler/ # Node labeling (DCGM version, driver status, Kata detection) | ||
| ├── janitor/ # State cleanup and maintenance | ||
| ├── platform-connectors/ # CSP integration (GCP, AWS, Azure) | ||
| ├── commons/ # Shared utilities | ||
| ├── data-models/ # Protocol Buffer definitions | ||
| ├── store-client/ # MongoDB client library | ||
| └── distros/kubernetes/ # Helm charts | ||
| api/ | ||
| ├── proto/ # Protocol Buffer source definitions (edit these) | ||
| │ └── device/v1alpha1/ | ||
| │ └── gpu.proto | ||
| ├── gen/go/ # Generated Go code (DO NOT edit manually) | ||
| │ └── device/v1alpha1/ | ||
| │ ├── gpu.pb.go | ||
| │ └── gpu_grpc.pb.go | ||
| ├── go.mod | ||
| └── Makefile | ||
| ``` | ||
|
|
||
| ## Coding Standards | ||
|
|
||
| ### Go Code Guidelines | ||
|
|
||
| #### Module Organization | ||
| - Each service is a separate Go module with its own `go.mod` | ||
| - Use semantic import versioning | ||
| - Keep dependencies minimal and up-to-date | ||
| - Use `commons/` for shared utilities across modules | ||
|
|
||
| #### Code Style | ||
| - Follow standard Go conventions (gofmt, golint) | ||
| - Use structured logging via `log/slog` | ||
| - Error handling: wrap errors with context using `fmt.Errorf("context: %w", err)` | ||
| - Within `retry.RetryOnConflict` blocks, return errors **without wrapping** to preserve retry behavior | ||
| - Use meaningful variable names (`synced` over `ok` for cache sync checks) | ||
|
|
||
| #### Kubernetes Integration | ||
| - Use `client-go` for Kubernetes API interactions | ||
| - Prefer informers over direct API calls for watching resources | ||
| - Use `envtest` for testing Kubernetes controllers (not fake clients) | ||
| - Implement proper shutdown handling with context cancellation | ||
|
|
||
| #### Testing Requirements | ||
| - Use `testify/assert` and `testify/require` for assertions | ||
| - Write table-driven tests when testing multiple scenarios | ||
| - Use `envtest` for integration tests with real Kubernetes API | ||
| - Test coverage: aim for >80% on critical paths | ||
| - Name tests descriptively: `TestFunctionName_Scenario_ExpectedBehavior` | ||
|
|
||
| ### Python Code Guidelines | ||
| - Use Poetry for dependency management | ||
| - Follow PEP 8 style guide | ||
| - Use Black for formatting | ||
| - Type hints required for all functions | ||
| - Use dataclasses for structured data | ||
|
|
||
| ### Protobuf Guidelines | ||
| - Define messages in `data-models/protobufs/` | ||
| - Use semantic versioning for breaking changes | ||
| - Include comprehensive comments for all fields | ||
| - Generate code with: `make protos-generate` | ||
|
|
||
| ## Development Workflows | ||
|
|
||
| ### Building & Testing | ||
| ```bash | ||
| # Lint and test all modules | ||
| make lint-test-all | ||
| ## Code Review Guidelines | ||
|
|
||
| # Lint specific module | ||
| cd labeler && make lint | ||
| ### Protocol Buffers | ||
|
|
||
| # Test specific module | ||
| cd health-events-analyzer && make test | ||
| When reviewing `.proto` files: | ||
|
|
||
| # Build container images (uses ko) | ||
| make images | ||
| - **Field naming**: Use `snake_case` for field names | ||
| - **Message/Service naming**: Use `CamelCase` | ||
| - **Documentation**: Every message, field, and RPC must have documentation | ||
| comments explaining purpose and usage | ||
| - **Field numbers**: Never reuse or change field numbers in existing messages | ||
| - **Backwards compatibility**: New fields should be optional; avoid breaking | ||
| changes to existing APIs | ||
| - **Style**: Follow [Google's Protocol Buffer Style Guide](https://protobuf.dev/programming-guides/style/) | ||
|
|
||
| # View all make targets | ||
| make help | ||
| ``` | ||
| ### Generated Code | ||
|
|
||
| ### Version Management | ||
| - All tool versions centralized in `.versions.yaml` | ||
| - Use `yq` to read versions in scripts | ||
| - Update versions in one place, propagates everywhere | ||
| - Check versions: `make show-versions` | ||
| - Files in `api/gen/` are **auto-generated** — do not suggest edits to these | ||
| - If generated code appears outdated, suggest running `make protos-generate` | ||
|
|
||
| ### Go Code | ||
|
|
||
| - Use `gofmt` formatting (enforced by golangci-lint) | ||
| - Documentation comments should wrap at 80 characters | ||
| - Follow standard Go idioms and error handling patterns | ||
| - No manual edits to `*.pb.go` or `*_grpc.pb.go` files | ||
|
|
||
| ## Commit Standards | ||
|
|
||
| ### Conventional Commits | ||
|
|
||
| All commits must follow conventional commit format: | ||
|
|
||
| ### Local Development with Tilt | ||
| ```bash | ||
| cd tilt | ||
| tilt up # Start local development environment | ||
| ``` | ||
| feat: add new GPU condition type | ||
| fix: correct timestamp handling in conditions | ||
| docs: update API documentation | ||
| chore: update protoc-gen-go version | ||
| ``` | ||
|
|
||
| ### DCO Sign-off | ||
|
|
||
| All commits **must** include a DCO sign-off line: | ||
|
|
||
| ## Kata Containers Detection | ||
|
|
||
| The labeler implements Kata Containers detection: | ||
|
|
||
| ### Detection Architecture | ||
| - **Input labels** (on nodes): `katacontainers.io/kata-runtime` (default) + optional custom label | ||
| - **Output label** (set by labeler): `nvsentinel.dgxc.nvidia.com/kata.enabled: "true"|"false"` | ||
| - **Truthy values**: `"true"`, `"enabled"`, `"1"`, `"yes"` (case-insensitive) | ||
| - **Lifecycle separation**: Pod events → DCGM/driver labels, Node events → kata labels | ||
|
|
||
| ### DaemonSet Variants | ||
| - Separate DaemonSets for kata vs regular nodes | ||
| - Selection via `nodeAffinity` based on kata.enabled label | ||
| - Different volume mounts: | ||
| - Regular: `/var/log` (file-based logs) | ||
| - Kata: `/run/log/journal` and `/var/log/journal` (systemd journal) | ||
|
|
||
| ## Important Patterns | ||
|
|
||
| ### Error Handling in Retry Loops | ||
| ```go | ||
| // ✅ CORRECT - Return error as-is for retry logic | ||
| err := retry.RetryOnConflict(retry.DefaultBackoff, func() error { | ||
| _, err := client.Update(ctx, obj, metav1.UpdateOptions{}) | ||
| return err // Don't wrap! | ||
| }) | ||
|
|
||
| // ❌ WRONG - Wrapping breaks retry detection | ||
| err := retry.RetryOnConflict(retry.DefaultBackoff, func() error { | ||
| _, err := client.Update(ctx, obj, metav1.UpdateOptions{}) | ||
| return fmt.Errorf("failed: %w", err) // Breaks retry! | ||
| }) | ||
| ``` | ||
| Signed-off-by: Name <email@example.com> | ||
| ``` | ||
|
|
||
| Use `git commit -s` to add this automatically. | ||
|
|
||
| ## License Headers | ||
|
|
||
| All source files must include the Apache 2.0 license header: | ||
|
|
||
| ### Informer Usage | ||
| ```go | ||
| // Use separate informers for different resource types | ||
| podInformer := factory.Core().V1().Pods().Informer() | ||
| nodeInformer := factory.Core().V1().Nodes().Informer() | ||
|
|
||
| // Extract event handler setup into helper methods | ||
| func (l *Labeler) registerPodEventHandlers() error { | ||
| _, err := l.podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{ | ||
| // handlers... | ||
| }) | ||
| return err | ||
| } | ||
| ``` | ||
| Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved. | ||
|
|
||
| ### Testing with envtest | ||
| ```go | ||
| // Use real Kubernetes API server for tests | ||
| testEnv := &envtest.Environment{ | ||
| CRDDirectoryPaths: []string{}, | ||
| } | ||
| cfg, err := testEnv.Start() | ||
| // Create clients from cfg | ||
| // Run tests | ||
| testEnv.Stop() | ||
| Licensed under the Apache License, Version 2.0 (the "License"); | ||
| ... | ||
| ``` | ||
|
|
||
| ## Documentation Standards | ||
|
|
||
| ### Code Comments | ||
| - Package-level godoc for all packages | ||
| - Function comments for exported functions | ||
| - Inline comments for complex logic only | ||
| - TODO comments should reference issues | ||
|
|
||
| ### Helm Chart Documentation | ||
| - Document all values in `values.yaml` with inline comments | ||
| - Include examples for non-obvious configurations | ||
| - Note truthy value requirements where applicable | ||
| - Explain DaemonSet variant selection logic | ||
|
|
||
| ## Security & Compliance | ||
|
|
||
| ### Licensing | ||
| - All files must include Apache 2.0 license header | ||
| - Use `make license-headers-add` to add headers | ||
| - No GPL or viral licenses in dependencies | ||
|
|
||
| ### Security | ||
| - Report security issues via SECURITY.md process | ||
| - Never commit secrets or credentials | ||
| - Use Kubernetes secrets for sensitive data | ||
| - Scan dependencies: automated via CI | ||
|
|
||
| ## CI/CD Pipeline | ||
|
|
||
| ### GitHub Actions Workflows | ||
| - **PR Checks**: lint, test, build for all modules | ||
| - **Release**: automated image builds and Helm chart publishing | ||
| - **Vulnerability Scanning**: daily Trivy scans uploaded to Security tab | ||
| - **Provenance**: SLSA attestation for all artifacts | ||
|
|
||
| ### Container Images | ||
| - Built with `ko` (Go) and `docker` (Python) | ||
| - Multi-architecture support (amd64, arm64) | ||
| - Published to `ghcr.io/nvidia/nvsentinel/*` | ||
| - Signed with Sigstore cosign | ||
|
|
||
| ## Common Tasks | ||
|
|
||
| ### Adding a New Health Monitor | ||
| 1. Create directory in `health-monitors/` | ||
| 2. Implement gRPC service from `data-models/protobufs/` | ||
| 3. Add Makefile with standard targets | ||
| 4. Create Helm chart in `distros/kubernetes/nvsentinel/charts/` | ||
| 5. Register in platform-connectors for CSP integration | ||
| 6. Add to CI pipeline workflows | ||
|
|
||
| ### Updating Dependencies | ||
| ```bash | ||
| # Update Go dependencies | ||
| make gomod-tidy | ||
| ## What to Flag in Reviews | ||
|
|
||
| # Update Python dependencies | ||
| cd health-monitors/gpu-health-monitor | ||
| poetry update | ||
| ``` | ||
| 1. **Breaking API changes** without migration path | ||
| 2. **Missing documentation** on proto messages/fields/RPCs | ||
| 3. **Manual edits** to generated files | ||
| 4. **Missing license headers** | ||
| 5. **Unsigned commits** (missing DCO) | ||
| 6. **Non-conventional commit messages** | ||
|
|
||
| ### Adding New Node Labels | ||
| 1. Update labeler logic in `pkg/labeler/labeler.go` | ||
| 2. Add tests in `labeler_test.go` | ||
| 3. Document in Helm chart values | ||
| 4. Update KATA_TESTING.md if kata-related | ||
| ## What NOT to Flag | ||
|
|
||
| ## Debugging Tips | ||
| 1. Style issues in generated `*.pb.go` files | ||
| 2. Import ordering in generated code | ||
| 3. Line length in generated code | ||
|
|
||
| ### Local Development | ||
| - Use Tilt for rapid iteration | ||
| - Check logs: `kubectl logs -n nvsentinel <pod>` | ||
| - Port-forward for direct access: `kubectl port-forward -n nvsentinel svc/janitor 8080:8080` | ||
| - MongoDB Compass for event store inspection | ||
| ## Build Commands | ||
|
|
||
| ### Common Issues | ||
| - **Informer not syncing**: Check RBAC permissions | ||
| - **Labels not updating**: Verify labeler is running and has node permissions | ||
| - **DaemonSet not scheduling**: Check node selectors and labels | ||
| - **Kata detection failing**: Verify input labels on nodes | ||
| ```bash | ||
| make protos-generate # Regenerate Go code from .proto files | ||
| make build # Build the Go module | ||
| make lint # Run go vet | ||
| make test # Run tests | ||
| ``` | ||
|
|
||
| ## References | ||
| ## Tool Versions | ||
|
|
||
| - [DEVELOPMENT.md](../DEVELOPMENT.md) - Detailed development guide | ||
| - [CONTRIBUTING.md](../CONTRIBUTING.md) - Contribution guidelines | ||
| - [README.md](../README.md) - Project overview | ||
| - [docs/](../docs/) - Architecture documentation | ||
| Tool versions are centralized in `.versions.yaml`. When suggesting dependency | ||
| updates, reference this file. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Apache 2.0 license header to this file.
The new
.gitattributesfile is missing the required Apache 2.0 license header.📝 Suggested header addition
Based on learnings: All files must include Apache 2.0 license header.
📝 Committable suggestion
🤖 Prompt for AI Agents