Adding support for TLM profiling metrics for partitioned GPU devices by davidmirror-ops · Pull Request #314 · unionai/helm-charts

davidmirror-ops · 2026-03-31T12:53:36Z

jan/wip-selfhosted - ⚠️ No PR associated with branch
- update #323
  - Adding support for TLM profiling metrics for partitioned GPU devices 👈

aviator-app · 2026-03-31T12:53:39Z

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This pull request is currently open (not queued).

How to merge

To merge this PR, comment /aviator merge or add the mergequeue label.

See the real-time status of this PR on the Aviator webapp.

Use the Aviator Chrome Extension to see the status of your PR within GitHub.

- Add relabel_configs and metric_relabel_configs to the gpu-metrics Prometheus scrape job so pod/namespace/node labels are propagated; without these Union TLM cannot correlate GPU metrics to task pods - Make dcgm-exporter namespace configurable (default: kube-system) so deployments via GPU Operator in a separate namespace work out of the box - Add NVIDIA_MIG_MONITOR_DEVICES=all extraEnv so DCGM can access per-partition profiling counters (DCGM_FI_PROF_*) on A100/H100 MIG slices Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

davidmirror-ops force-pushed the fix/dcgm-mig-partition-metrics branch from 8823485 to a928b67 Compare March 31, 2026 13:53

davidmirror-ops mentioned this pull request Mar 31, 2026

Add Kubernetes event exporter for Pod event visibility #315

Open

3 tasks

github-actions Bot mentioned this pull request Apr 3, 2026

update #323

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for TLM profiling metrics for partitioned GPU devices#314

Adding support for TLM profiling metrics for partitioned GPU devices#314
davidmirror-ops wants to merge 1 commit intomainfrom
fix/dcgm-mig-partition-metrics

davidmirror-ops commented Mar 31, 2026 •

edited by github-actions Bot

Loading

Uh oh!

aviator-app Bot commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

davidmirror-ops commented Mar 31, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aviator-app Bot commented Mar 31, 2026

Current Aviator status

How to merge

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

davidmirror-ops commented Mar 31, 2026 •

edited by github-actions Bot

Loading