Skip to content
Draft
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
006c572
chore - Add small set of GitHub checks and copilot rules (#691)
ArangoGutierrez Jan 14, 2026
35c2bbd
feat: introduce k8s-idiomatic Go SDK for Device API (#692)
pteranodan Jan 14, 2026
bf8db76
chore: update documentation (#693)
pteranodan Jan 15, 2026
85db526
api: add ProviderService proto for device-api-server
ArangoGutierrez Jan 21, 2026
9d40418
feat: add device-api-server with NVML fallback provider
ArangoGutierrez Jan 21, 2026
f28013e
fix(security): bind gRPC TCP listener to localhost by default
ArangoGutierrez Jan 22, 2026
ef3c5a8
docs(provider): clarify Heartbeat RPC is reserved for future use
ArangoGutierrez Jan 22, 2026
6fe8f63
feat(consumer): populate ListMeta.ResourceVersion in ListGpus response
ArangoGutierrez Jan 22, 2026
76fb570
docs: document module structure and naming conventions
ArangoGutierrez Jan 22, 2026
69c9efd
fix: align module paths to github.com/nvidia/nvsentinel
ArangoGutierrez Jan 22, 2026
c9360cd
refactor: consolidate to unified GpuService with standard CRUD methods
ArangoGutierrez Jan 28, 2026
3820941
fix(build): use Go 1.25 for container builds
ArangoGutierrez Jan 29, 2026
a9f9432
fix(nvml-provider): parse command-line flags before returning config
ArangoGutierrez Jan 29, 2026
da15558
fix(helm): remove invalid --provider-address flag from server args
ArangoGutierrez Jan 29, 2026
7f07d0b
fix(helm): correct sidecar test values for cluster deployment
ArangoGutierrez Jan 29, 2026
15efc6e
feat(demo): improve cross-platform builds and idempotency
ArangoGutierrez Jan 29, 2026
996aed0
fix(demo): remove unreliable metrics check from verify_gpu_registration
ArangoGutierrez Jan 29, 2026
af47a00
fix(demo): use correct container name 'nvsentinel' instead of 'device…
ArangoGutierrez Jan 29, 2026
7b65691
fix(ci): update protoc version to v33.4 to match generated files
ArangoGutierrez Jan 29, 2026
196c165
docs: add hybrid device-apiserver design for PR #718 + #720 merge
ArangoGutierrez Feb 4, 2026
4f236c4
chore: add .worktrees/ and temp docs to .gitignore
ArangoGutierrez Feb 4, 2026
4457249
feat: implement cloud-native GPU health event management
ArangoGutierrez Feb 4, 2026
1a8dd5d
docs: add AGENTS.md for AI assistant context persistence
ArangoGutierrez Feb 4, 2026
7e73ba3
docs(AGENTS.md): add E2E test migration plan
ArangoGutierrez Feb 4, 2026
5d15de0
docs(AGENTS.md): complete Phase 1 E2E test audit
ArangoGutierrez Feb 4, 2026
6288d60
feat(tests): add HealthEvent CRD test helpers
ArangoGutierrez Feb 4, 2026
bd3fc9b
docs: update AGENTS.md with Phase 2 completion
ArangoGutierrez Feb 4, 2026
a7f0996
test(e2e): migrate smoke_test.go to CRD-based flow
ArangoGutierrez Feb 4, 2026
ac9f421
docs: update AGENTS.md with Phase 3 progress
ArangoGutierrez Feb 4, 2026
30dad6e
test(e2e): migrate fault_quarantine_test.go to CRD-based flow
ArangoGutierrez Feb 4, 2026
2385015
test(e2e): migrate node_drainer_test.go to CRD-based flow
ArangoGutierrez Feb 4, 2026
3c9ad1a
test(e2e): migrate fault_remediation_test.go to CRD-based flow
ArangoGutierrez Feb 4, 2026
b46933b
docs: complete Phase 3 E2E test migration
ArangoGutierrez Feb 4, 2026
a7f3cc7
fix(tests): register HealthEvent types and fix context keys
ArangoGutierrez Feb 4, 2026
abcad80
docs: document Phase 4 validation results
ArangoGutierrez Feb 4, 2026
0efe0dc
fix: restore complete .versions.yaml from main
ArangoGutierrez Feb 4, 2026
f10a510
fix: restore complete Makefile from main
ArangoGutierrez Feb 4, 2026
c825c99
ci: remove legacy MongoDB/PostgreSQL E2E tests
ArangoGutierrez Feb 4, 2026
43070a1
fix(deviceapiserver): migrate to metadata.resource_version after prot…
ArangoGutierrez Feb 5, 2026
b22c586
fix(ci): add missing strategy.matrix to go job
ArangoGutierrez Feb 5, 2026
c63e1f3
fix(health-provider): remove duplicate flag.Parse() calls
ArangoGutierrez Feb 5, 2026
a5f10ac
fix(controller-test): use metricsAddr flag in manager options
ArangoGutierrez Feb 5, 2026
ad0e661
fix(api): remove duplicate resource_version and clarify CreateGpu docs
ArangoGutierrez Feb 5, 2026
4e0c60d
chore(deps): update go.mod for codegen dependencies
ArangoGutierrez Feb 5, 2026
fc9cfa9
fix(codegen): use Function() for fmt.Errorf and fix connection leak
ArangoGutierrez Feb 5, 2026
dbd2156
fix(docs): correct import path, remove duplicates, add license
ArangoGutierrez Feb 5, 2026
ec1d459
fix(client-go): add nil guards and fix resource leak
ArangoGutierrez Feb 5, 2026
f56b1fc
style: fix lint issues in api and client-go
ArangoGutierrez Feb 5, 2026
ff326b7
fix: address PR review findings across cache, controllers, and codegen
ArangoGutierrez Feb 5, 2026
260812f
fix: drain controller accepts Draining phase on re-entry
ArangoGutierrez Feb 6, 2026
33209ed
fix: remediation controller requeues while reboot job is running
ArangoGutierrez Feb 6, 2026
bde3e4d
fix: executeGPUReset uses Patch instead of Update
ArangoGutierrez Feb 6, 2026
a952254
fix: quarantine controller cancels event when node not found
ArangoGutierrez Feb 6, 2026
6d78945
fix: bind gRPC to localhost in static manifest
ArangoGutierrez Feb 6, 2026
213dbc6
fix: correct Dockerfile EXPOSE ports to match actual defaults
ArangoGutierrez Feb 6, 2026
e2ad468
chore: remove unused Prometheus metrics from healthevents controllers
ArangoGutierrez Feb 6, 2026
36782b1
fix: reject malformed resource_version with InvalidArgument
ArangoGutierrez Feb 6, 2026
d677a61
perf: broadcast events outside write lock in MarkProviderGPUsUnknown
ArangoGutierrez Feb 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# ==============================================================================
# GIT ATTRIBUTES
# ==============================================================================
# Use 'linguist-generated=true' to hide generated code from GitHub PR diffs.
# ==============================================================================

# Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc)
zz_generated.*.go linguist-generated=true

# Hide generated Protobuf Go bindings
*.pb.go linguist-generated=true

# Hide generated client library
client-go/client/** linguist-generated=true
client-go/listers/** linguist-generated=true
client-go/informers/** linguist-generated=true

# Hide copied, unmodified upstream code
code-generator/cmd/client-gen/types/** linguist-generated=true
code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true
code-generator/cmd/client-gen/generators/util/** linguist-generated=true
Comment on lines +15 to +35

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add Apache 2.0 license header to this file.
The new .gitattributes file is missing the required Apache 2.0 license header.

📝 Suggested header addition
+# Copyright (c) 2026, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
 # ==============================================================================
 # GIT ATTRIBUTES
 # ==============================================================================

Based on learnings: All files must include Apache 2.0 license header.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# ==============================================================================
# GIT ATTRIBUTES
# ==============================================================================
# Use 'linguist-generated=true' to hide generated code from GitHub PR diffs.
# ==============================================================================
# Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc)
zz_generated.*.go linguist-generated=true
# Hide generated Protobuf Go bindings
*.pb.go linguist-generated=true
# Hide generated client library
client-go/client/** linguist-generated=true
client-go/listers/** linguist-generated=true
client-go/informers/** linguist-generated=true
# Hide copied, unmodified upstream code
code-generator/cmd/client-gen/types/** linguist-generated=true
code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true
code-generator/cmd/client-gen/generators/util/** linguist-generated=true
# Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
# GIT ATTRIBUTES
# ==============================================================================
# Use 'linguist-generated=true' to hide generated code from GitHub PR diffs.
# ==============================================================================
# Hide Kubernetes generated helpers (DeepCopy, Defaults, Conversions, OpenAPI, etc)
zz_generated.*.go linguist-generated=true
# Hide generated Protobuf Go bindings
*.pb.go linguist-generated=true
# Hide generated client library
client-go/client/** linguist-generated=true
client-go/listers/** linguist-generated=true
client-go/informers/** linguist-generated=true
# Hide copied, unmodified upstream code
code-generator/cmd/client-gen/types/** linguist-generated=true
code-generator/cmd/client-gen/generators/scheme/** linguist-generated=true
code-generator/cmd/client-gen/generators/util/** linguist-generated=true
🤖 Prompt for AI Agents
In @.gitattributes around lines 1 - 21, Add the Apache 2.0 license header to the
top of .gitattributes: insert the standard multi-line Apache-2.0 header text
(including copyright line, SPDX identifier or full license notice and year/owner
as per project convention) as the first lines of the file so the file contains
the required license header before any content; ensure formatting matches other
repository files' headers.

317 changes: 86 additions & 231 deletions .github/copilot-instructions.md
Original file line number Diff line number Diff line change
@@ -1,260 +1,115 @@
# GitHub Copilot Instructions for NVSentinel
# Copilot Instructions for NVSentinel

## Project Overview

NVSentinel is a GPU Node Resilience System for Kubernetes that automatically detects, classifies, and remediates hardware and software faults in GPU nodes. It's designed for high-performance computing environments running NVIDIA GPUs.
This file provides repository-level instructions for GitHub Copilot to improve
code reviews and suggestions.

**Status**: Experimental/Preview Release - APIs and configurations may change.
## Project Overview

## Architecture & Technologies
NVSentinel is the NVIDIA Device API — a Protocol Buffer and gRPC-based API for
GPU device management. The codebase consists primarily of `.proto` definitions
and their generated Go code.

### Core Technologies
- **Language**: Go 1.25+ (primary), Python 3.10+ (monitoring tools)
- **Container Platform**: Kubernetes 1.25+
- **Deployment**: Helm 3.0+, Tilt (development)
- **Storage**: MongoDB (event store with change streams)
- **Communication**: gRPC with Protocol Buffers
- **GPU Monitoring**: NVIDIA DCGM (Data Center GPU Manager)
## Repository Structure

### Project Structure
```
├── health-monitors/ # Pluggable health detection modules
│ ├── gpu-health-monitor/ # DCGM-based GPU monitoring (Python)
│ ├── csp-health-monitor/ # Cloud provider health checks (Go)
│ └── syslog-health-monitor/ # System log analysis (Go)
├── health-events-analyzer/ # Event classification and routing
├── fault-quarantine/ # Node isolation (cordon)
├── node-drainer/ # Workload eviction
├── fault-remediation/ # Break-fix automation
├── labeler/ # Node labeling (DCGM version, driver status, Kata detection)
├── janitor/ # State cleanup and maintenance
├── platform-connectors/ # CSP integration (GCP, AWS, Azure)
├── commons/ # Shared utilities
├── data-models/ # Protocol Buffer definitions
├── store-client/ # MongoDB client library
└── distros/kubernetes/ # Helm charts
api/
├── proto/ # Protocol Buffer source definitions (edit these)
│ └── device/v1alpha1/
│ └── gpu.proto
├── gen/go/ # Generated Go code (DO NOT edit manually)
│ └── device/v1alpha1/
│ ├── gpu.pb.go
│ └── gpu_grpc.pb.go
├── go.mod
└── Makefile
```

## Coding Standards

### Go Code Guidelines

#### Module Organization
- Each service is a separate Go module with its own `go.mod`
- Use semantic import versioning
- Keep dependencies minimal and up-to-date
- Use `commons/` for shared utilities across modules

#### Code Style
- Follow standard Go conventions (gofmt, golint)
- Use structured logging via `log/slog`
- Error handling: wrap errors with context using `fmt.Errorf("context: %w", err)`
- Within `retry.RetryOnConflict` blocks, return errors **without wrapping** to preserve retry behavior
- Use meaningful variable names (`synced` over `ok` for cache sync checks)

#### Kubernetes Integration
- Use `client-go` for Kubernetes API interactions
- Prefer informers over direct API calls for watching resources
- Use `envtest` for testing Kubernetes controllers (not fake clients)
- Implement proper shutdown handling with context cancellation

#### Testing Requirements
- Use `testify/assert` and `testify/require` for assertions
- Write table-driven tests when testing multiple scenarios
- Use `envtest` for integration tests with real Kubernetes API
- Test coverage: aim for >80% on critical paths
- Name tests descriptively: `TestFunctionName_Scenario_ExpectedBehavior`

### Python Code Guidelines
- Use Poetry for dependency management
- Follow PEP 8 style guide
- Use Black for formatting
- Type hints required for all functions
- Use dataclasses for structured data

### Protobuf Guidelines
- Define messages in `data-models/protobufs/`
- Use semantic versioning for breaking changes
- Include comprehensive comments for all fields
- Generate code with: `make protos-generate`

## Development Workflows

### Building & Testing
```bash
# Lint and test all modules
make lint-test-all
## Code Review Guidelines

# Lint specific module
cd labeler && make lint
### Protocol Buffers

# Test specific module
cd health-events-analyzer && make test
When reviewing `.proto` files:

# Build container images (uses ko)
make images
- **Field naming**: Use `snake_case` for field names
- **Message/Service naming**: Use `CamelCase`
- **Documentation**: Every message, field, and RPC must have documentation
comments explaining purpose and usage
- **Field numbers**: Never reuse or change field numbers in existing messages
- **Backwards compatibility**: New fields should be optional; avoid breaking
changes to existing APIs
- **Style**: Follow [Google's Protocol Buffer Style Guide](https://protobuf.dev/programming-guides/style/)

# View all make targets
make help
```
### Generated Code

### Version Management
- All tool versions centralized in `.versions.yaml`
- Use `yq` to read versions in scripts
- Update versions in one place, propagates everywhere
- Check versions: `make show-versions`
- Files in `api/gen/` are **auto-generated** — do not suggest edits to these
- If generated code appears outdated, suggest running `make protos-generate`

### Go Code

- Use `gofmt` formatting (enforced by golangci-lint)
- Documentation comments should wrap at 80 characters
- Follow standard Go idioms and error handling patterns
- No manual edits to `*.pb.go` or `*_grpc.pb.go` files

## Commit Standards

### Conventional Commits

All commits must follow conventional commit format:

### Local Development with Tilt
```bash
cd tilt
tilt up # Start local development environment
```
feat: add new GPU condition type
fix: correct timestamp handling in conditions
docs: update API documentation
chore: update protoc-gen-go version
```

### DCO Sign-off

All commits **must** include a DCO sign-off line:

## Kata Containers Detection

The labeler implements Kata Containers detection:

### Detection Architecture
- **Input labels** (on nodes): `katacontainers.io/kata-runtime` (default) + optional custom label
- **Output label** (set by labeler): `nvsentinel.dgxc.nvidia.com/kata.enabled: "true"|"false"`
- **Truthy values**: `"true"`, `"enabled"`, `"1"`, `"yes"` (case-insensitive)
- **Lifecycle separation**: Pod events → DCGM/driver labels, Node events → kata labels

### DaemonSet Variants
- Separate DaemonSets for kata vs regular nodes
- Selection via `nodeAffinity` based on kata.enabled label
- Different volume mounts:
- Regular: `/var/log` (file-based logs)
- Kata: `/run/log/journal` and `/var/log/journal` (systemd journal)

## Important Patterns

### Error Handling in Retry Loops
```go
// ✅ CORRECT - Return error as-is for retry logic
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
_, err := client.Update(ctx, obj, metav1.UpdateOptions{})
return err // Don't wrap!
})

// ❌ WRONG - Wrapping breaks retry detection
err := retry.RetryOnConflict(retry.DefaultBackoff, func() error {
_, err := client.Update(ctx, obj, metav1.UpdateOptions{})
return fmt.Errorf("failed: %w", err) // Breaks retry!
})
```
Signed-off-by: Name <email@example.com>
```

Use `git commit -s` to add this automatically.

## License Headers

All source files must include the Apache 2.0 license header:

### Informer Usage
```go
// Use separate informers for different resource types
podInformer := factory.Core().V1().Pods().Informer()
nodeInformer := factory.Core().V1().Nodes().Informer()

// Extract event handler setup into helper methods
func (l *Labeler) registerPodEventHandlers() error {
_, err := l.podInformer.AddEventHandler(cache.ResourceEventHandlerFuncs{
// handlers...
})
return err
}
```
Copyright (c) 2026, NVIDIA CORPORATION. All rights reserved.

### Testing with envtest
```go
// Use real Kubernetes API server for tests
testEnv := &envtest.Environment{
CRDDirectoryPaths: []string{},
}
cfg, err := testEnv.Start()
// Create clients from cfg
// Run tests
testEnv.Stop()
Licensed under the Apache License, Version 2.0 (the "License");
...
```

## Documentation Standards

### Code Comments
- Package-level godoc for all packages
- Function comments for exported functions
- Inline comments for complex logic only
- TODO comments should reference issues

### Helm Chart Documentation
- Document all values in `values.yaml` with inline comments
- Include examples for non-obvious configurations
- Note truthy value requirements where applicable
- Explain DaemonSet variant selection logic

## Security & Compliance

### Licensing
- All files must include Apache 2.0 license header
- Use `make license-headers-add` to add headers
- No GPL or viral licenses in dependencies

### Security
- Report security issues via SECURITY.md process
- Never commit secrets or credentials
- Use Kubernetes secrets for sensitive data
- Scan dependencies: automated via CI

## CI/CD Pipeline

### GitHub Actions Workflows
- **PR Checks**: lint, test, build for all modules
- **Release**: automated image builds and Helm chart publishing
- **Vulnerability Scanning**: daily Trivy scans uploaded to Security tab
- **Provenance**: SLSA attestation for all artifacts

### Container Images
- Built with `ko` (Go) and `docker` (Python)
- Multi-architecture support (amd64, arm64)
- Published to `ghcr.io/nvidia/nvsentinel/*`
- Signed with Sigstore cosign

## Common Tasks

### Adding a New Health Monitor
1. Create directory in `health-monitors/`
2. Implement gRPC service from `data-models/protobufs/`
3. Add Makefile with standard targets
4. Create Helm chart in `distros/kubernetes/nvsentinel/charts/`
5. Register in platform-connectors for CSP integration
6. Add to CI pipeline workflows

### Updating Dependencies
```bash
# Update Go dependencies
make gomod-tidy
## What to Flag in Reviews

# Update Python dependencies
cd health-monitors/gpu-health-monitor
poetry update
```
1. **Breaking API changes** without migration path
2. **Missing documentation** on proto messages/fields/RPCs
3. **Manual edits** to generated files
4. **Missing license headers**
5. **Unsigned commits** (missing DCO)
6. **Non-conventional commit messages**

### Adding New Node Labels
1. Update labeler logic in `pkg/labeler/labeler.go`
2. Add tests in `labeler_test.go`
3. Document in Helm chart values
4. Update KATA_TESTING.md if kata-related
## What NOT to Flag

## Debugging Tips
1. Style issues in generated `*.pb.go` files
2. Import ordering in generated code
3. Line length in generated code

### Local Development
- Use Tilt for rapid iteration
- Check logs: `kubectl logs -n nvsentinel <pod>`
- Port-forward for direct access: `kubectl port-forward -n nvsentinel svc/janitor 8080:8080`
- MongoDB Compass for event store inspection
## Build Commands

### Common Issues
- **Informer not syncing**: Check RBAC permissions
- **Labels not updating**: Verify labeler is running and has node permissions
- **DaemonSet not scheduling**: Check node selectors and labels
- **Kata detection failing**: Verify input labels on nodes
```bash
make protos-generate # Regenerate Go code from .proto files
make build # Build the Go module
make lint # Run go vet
make test # Run tests
```

## References
## Tool Versions

- [DEVELOPMENT.md](../DEVELOPMENT.md) - Detailed development guide
- [CONTRIBUTING.md](../CONTRIBUTING.md) - Contribution guidelines
- [README.md](../README.md) - Project overview
- [docs/](../docs/) - Architecture documentation
Tool versions are centralized in `.versions.yaml`. When suggesting dependency
updates, reference this file.
Loading
Loading