Adding support for TLM profiling metrics for partitioned GPU devices#314
Open
davidmirror-ops wants to merge 1 commit intomainfrom
Open
Adding support for TLM profiling metrics for partitioned GPU devices#314davidmirror-ops wants to merge 1 commit intomainfrom
davidmirror-ops wants to merge 1 commit intomainfrom
Conversation
Contributor
Current Aviator status
This pull request is currently open (not queued). How to mergeTo merge this PR, comment
See the real-time status of this PR on the
Aviator webapp.
Use the Aviator Chrome Extension
to see the status of your PR within GitHub.
|
- Add relabel_configs and metric_relabel_configs to the gpu-metrics Prometheus scrape job so pod/namespace/node labels are propagated; without these Union TLM cannot correlate GPU metrics to task pods - Make dcgm-exporter namespace configurable (default: kube-system) so deployments via GPU Operator in a separate namespace work out of the box - Add NVIDIA_MIG_MONITOR_DEVICES=all extraEnv so DCGM can access per-partition profiling counters (DCGM_FI_PROF_*) on A100/H100 MIG slices Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8823485 to
a928b67
Compare
3 tasks
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
jan/wip-selfhosted-