Skip to content

Adding support for TLM profiling metrics for partitioned GPU devices#314

Open
davidmirror-ops wants to merge 1 commit intomainfrom
fix/dcgm-mig-partition-metrics
Open

Adding support for TLM profiling metrics for partitioned GPU devices#314
davidmirror-ops wants to merge 1 commit intomainfrom
fix/dcgm-mig-partition-metrics

Conversation

@davidmirror-ops
Copy link
Copy Markdown
Contributor

@davidmirror-ops davidmirror-ops commented Mar 31, 2026

  • jan/wip-selfhosted - ⚠️ No PR associated with branch
    • update #323
      • Adding support for TLM profiling metrics for partitioned GPU devices 👈

@aviator-app
Copy link
Copy Markdown
Contributor

aviator-app Bot commented Mar 31, 2026

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This pull request is currently open (not queued).

How to merge

To merge this PR, comment /aviator merge or add the mergequeue label.


See the real-time status of this PR on the Aviator webapp.
Use the Aviator Chrome Extension to see the status of your PR within GitHub.

- Add relabel_configs and metric_relabel_configs to the gpu-metrics
  Prometheus scrape job so pod/namespace/node labels are propagated;
  without these Union TLM cannot correlate GPU metrics to task pods
- Make dcgm-exporter namespace configurable (default: kube-system) so
  deployments via GPU Operator in a separate namespace work out of the box
- Add NVIDIA_MIG_MONITOR_DEVICES=all extraEnv so DCGM can access
  per-partition profiling counters (DCGM_FI_PROF_*) on A100/H100 MIG slices

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant