Integrate SDK for managed profiler #2544

xibinliu · 2025-10-24T14:48:16Z

Description

Integrate SDK for managed profiler

include new SDK google-cloud-mldiagnostics
seed-env: --seed-commit=459cb056418de7a56c9da0a2842406a58b75e4a3
add new config params
modify profiler.py to add ML run and profiling
modify metrics_logger.py to upload metrics

IMPORTANT

Since the GCP UI support is not formally rolled out yet, currently this feature only works in supercomputer-testing / us-central1. Enabling this feature in other projects and regions will fail.

Tests

Command:

Enable the feature with managed_profiler=True managed_profiler_run_group="<group_name>"

python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run22" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 profiler=xplane managed_profiler=True managed_profiler_run_group="xibin-demo" log_period=5

Enable the feature with managed_profiler=True, with run_group default to run_name

python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run23" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 profiler=xplane managed_profiler=True  log_period=5

Enable the feature with managed_profiler=True upload_all_profiler_results=True on all TPU devices, with run_name becomes <run_name>-0 (jax device 0), <run_name>-1 (jax device 1), etc

python3 -m MaxText.train src/MaxText/configs/base.yml run_name="xibin-run24" model_name="gpt3-52k" base_output_directory=gs://xibin-images/  dataset_type=synthetic steps=22 profiler=xplane managed_profiler=True upload_all_profiler_results=True  log_period=5

See all uploaded runs in the managed profiler GCP UI

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code. For an optional AI review, add the gemini-review label.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed, including adding new documentation pages to the relevant Table of Contents (toctree directive) as explained in our documentation.

bvandermoon

LGTM just a few comments. Thanks @xibinliu, great to see this

src/MaxText/profiler.py

bvandermoon · 2025-10-25T03:54:35Z

src/MaxText/pyconfig.py

+# Don't log the following keys.
+KEYS_NO_LOGGING = ["hf_access_token"]
+


@SamuelMarks FYI we will need this to be compatible with Pydantic in #1836. It's just a constant so should be straightforward

Not sure what I should do for this comment. But I changed it to the tuple making it immutable.

Thanks @xibinliu. No action item on this one, just calling out that this will need to update in the other PR also

src/MaxText/profiler.py

SurbhiJainUSC · 2025-10-26T18:21:17Z

src/MaxText/profiler.py

      raise ValueError("Profiling requested but initial profiling step set past training final step")

+    # Set up the managed profiler on the first device, or all devices, depending on the config.
+    proc_id = jax.process_index()


Instead of line 50-52, can't we just say:

if config.managed_profiler`: self.prof = None .....

Hi Surbhi, the logic is needed because:

if upload_all_profiler_results, we need do this on all TPU devicers.

if not upload_all_profiler_results, we just do it on the first device.

The flag self.managed_profiler will be changed based on the above conditions.

bvandermoon

How is the correct GCP project info picked up?

- include new SDK google-cloud-mldiagnostics - add new config params - modify profiler.py to add ML run and profiling - modify metrics_logger.py to upload metrics

xibinliu · 2025-10-27T18:55:22Z

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

bvandermoon · 2025-10-28T20:29:21Z

How is the correct GCP project info picked up?

The ML Run (managed profiler UI) is always created under the project / regions where the workload is running

Thanks @xibinliu. In that case, the info is coming from XPK, right?

xibinliu requested review from A9isha, NuojCheng, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, jiangjy1982, khatwanimohit, parambole, richjames0, shralex, suexu1025 and vipannalla as code owners October 24, 2025 14:48

xibinliu force-pushed the xibin/diagon_sdk branch 5 times, most recently from 06830de to 48f6fca Compare October 24, 2025 18:54

bvandermoon reviewed Oct 25, 2025

View reviewed changes

xibinliu force-pushed the xibin/diagon_sdk branch 3 times, most recently from 26d3b89 to 5176ace Compare October 25, 2025 19:29

xibinliu requested a review from shuningjin as a code owner October 25, 2025 19:29

xibinliu force-pushed the xibin/diagon_sdk branch 2 times, most recently from 68150f6 to 35b9724 Compare October 25, 2025 19:35

SurbhiJainUSC reviewed Oct 26, 2025

View reviewed changes

bvandermoon reviewed Oct 27, 2025

View reviewed changes

Integrate SDK for managed profiler

117e4c5

- include new SDK google-cloud-mldiagnostics - add new config params - modify profiler.py to add ML run and profiling - modify metrics_logger.py to upload metrics

xibinliu force-pushed the xibin/diagon_sdk branch from 35b9724 to 117e4c5 Compare October 27, 2025 18:48

xibinliu changed the title ~~Integrate Diagon SDK for managed profiler~~ Integrate SDK for managed profiler Oct 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integrate SDK for managed profiler #2544

Integrate SDK for managed profiler #2544

Uh oh!

xibinliu commented Oct 24, 2025 •

edited

Loading

Uh oh!

bvandermoon left a comment

Uh oh!

Uh oh!

bvandermoon Oct 25, 2025

Uh oh!

xibinliu Oct 25, 2025

Uh oh!

bvandermoon Oct 27, 2025

Uh oh!

Uh oh!

SurbhiJainUSC Oct 26, 2025

Uh oh!

xibinliu Oct 27, 2025

Uh oh!

bvandermoon left a comment

Uh oh!

xibinliu commented Oct 27, 2025

Uh oh!

bvandermoon commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		# Don't log the following keys.
		KEYS_NO_LOGGING = ["hf_access_token"]

Uh oh!

Integrate SDK for managed profiler #2544

Are you sure you want to change the base?

Integrate SDK for managed profiler #2544

Uh oh!

Conversation

xibinliu commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

IMPORTANT

Tests

Checklist

Uh oh!

bvandermoon left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bvandermoon Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

xibinliu Oct 25, 2025

Choose a reason for hiding this comment

Uh oh!

bvandermoon Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SurbhiJainUSC Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

xibinliu Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

bvandermoon left a comment

Choose a reason for hiding this comment

Uh oh!

xibinliu commented Oct 27, 2025

Uh oh!

bvandermoon commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xibinliu commented Oct 24, 2025 •

edited

Loading