Skip to content

Commit 203dcf9

Browse files
dipannita08copybara-github
authored andcommitted
Update ml-goodput-measurement package for prod PyPi release v0.0.5.
PiperOrigin-RevId: 722810422
1 parent b9c3bf9 commit 203dcf9

File tree

3 files changed

+191
-19
lines changed

3 files changed

+191
-19
lines changed

CHANGELOG.md

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,15 @@ To release a new version (e.g. from `1.0.0` -> `2.0.0`):
2121
2222
-->
2323

24+
## [0.0.5] - 2025-02-03
25+
26+
* Goodput Cache and library improvements.
27+
* Query and Monitor API support for checkpoint save and restore.
28+
* Interval Query API support.
29+
* Query and Monitor API support for step time deviation.
30+
2431
## [0.0.4] - 2024-09-13
32+
2533
* Add Badput breakdown to GoodputMonitor.
2634
* Add Checkpoint Badput Calculator backend.
2735
* Return last recorded step from Goodput query API.
@@ -30,6 +38,7 @@ To release a new version (e.g. from `1.0.0` -> `2.0.0`):
3038
* Fix zero job time issue on long running jobs
3139

3240
## [0.0.3] - 2024-05-28
41+
3342
* Compute and discount Badput from first step after start or restart.
3443
* Compute and discount Badput due to anomalous step times (Pathways only).
3544
* Badput recording APIs
@@ -39,16 +48,18 @@ To release a new version (e.g. from `1.0.0` -> `2.0.0`):
3948
* Fix Goodput calculation with disruptions
4049
* Fix some Cloud Logging latency and batching issues.
4150

42-
4351
## [0.0.2] - 2024-02-29
52+
4453
* Bug Fixes
4554
* Fixes a typing mismatch in total step time calculation.
4655
* Code and documentation cleanup
4756

4857
## [0.0.1] - 2024-02-26
58+
4959
* Initial release of ML Goodput Measurement PyPi package
5060
* Feature: Contains the Goodput module which allows logging and retrieval of training job's overall productive Goodput
5161

62+
[0.0.5]: https://github.com/AI-Hypercomputer/ml-goodput-measurement/compare/v0.0.4...v0.0.5
5263
[0.0.4]: https://github.com/AI-Hypercomputer/ml-goodput-measurement/compare/v0.0.3...v0.0.4
5364
[0.0.3]: https://github.com/AI-Hypercomputer/ml-goodput-measurement/compare/v0.0.2...v0.0.3
5465
[0.0.2]: https://github.com/AI-Hypercomputer/ml-goodput-measurement/compare/v0.0.1...v0.0.2

README.md

Lines changed: 178 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -26,7 +26,8 @@
2626
workloads and utilization of compute resources.
2727

2828
The package also exposes Goodput Monitoring APIs which allow asynchronous query
29-
and export of the job's Goodput to Tensorboard with configurable upload interval.
29+
and export of the job's Goodput, Badput and Step Time Deviation to Tensorboard
30+
with configurable upload interval.
3031

3132
## Components
3233

@@ -37,6 +38,7 @@
3738

3839
- `GoodputCalculator`
3940
- `GoodputMonitor`
41+
- `GoodputCache`
4042

4143

4244
The `GoodputRecorder`
@@ -51,13 +53,18 @@
5153
from the training application, either on a CPU instance or on the users'
5254
development machine.
5355

54-
The `GoodputMonitor` exposes APIs to query and upload goodput data to
55-
Tensorboard asynchronously. It does this by instantiating a `GoodputCaluclator`
56-
under the hood.
56+
Under the hood, the `GoodputCalculator` uses a `GoodputCache` which is an
57+
internal component that locally caches pre-computations and useful logs such
58+
that repeated computations can be made inexpensive.
59+
60+
The `GoodputMonitor` exposes APIs to query and upload goodput and step time
61+
deviation data to Tensorboard asynchronously. It does this by instantiating a
62+
`GoodputCaluclator` under the hood.
5763

5864
## Installation
5965

60-
To install the ML Goodput Measurement package, run the following command on the VM:
66+
To install the ML Goodput Measurement package, run the following command on the
67+
VM or machine you want to query or monitor your workload from:
6168

6269
```bash
6370
pip install ml-goodput-measurement
@@ -77,17 +84,32 @@ project, then do the following:
7784

7885
3. [Enable](https://console.cloud.google.com/flows/enableapi?apiid=logging.googleapis.com&_ga=2.27841276.1571868865.1726250448-123998259.1726107009) the Cloud Logging API.
7986

80-
To run your training on Cloud accelerator, set up the environment by following
81-
instructions [here](https://cloud.google.com/tpu/docs/setup-gcp-account).
87+
To run your training on Cloud accelerator, set up the environment by following
88+
instructions [here](https://cloud.google.com/tpu/docs/setup-gcp-account).
89+
90+
To learn more about Google Cloud Logging, visit this [page](https://cloud.google.com/logging/docs).
91+
92+
### Access Scopes
93+
94+
You will need both read and write access scopes for cloud logging on both the
95+
GPU or TPU and CPU node pools. Full cloud logging access is granted by the
96+
following access scope during node pool creation:
97+
98+
- `https://www.googleapis.com/auth/cloud-platform`
8299

83-
To learn more about Google Cloud Logging, visit this [page](https://cloud.google.com/logging/docs).
100+
XPK adds this access scope to the GPU, TPU and CPU node pools, so XPK is the recommended method to create clusters and node-pools in you intend to run your workloads on GKE.
84101

102+
Instructions on how to create clusters using XPK can be
103+
found [here](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#cluster-create) and how to create workloads using XPK can be found
104+
[here](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#workload-create).
105+
106+
> **_NOTE:_** Access Scopes are immutable and workloads can only be migrated
107+
to new node pools with required access scopes. Access scopes on already created clusters cannot be updated.
85108

86109
### Import
87110

88111
To use this package, import the `goodput` module:
89112

90-
91113
```python
92114
from ml_goodput_measurement import goodput
93115
from ml_goodput_measurement import monitoring
@@ -219,6 +241,13 @@ goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own log
219241
goodput_calculator = goodput.GoodputCalculator(job_name=config.run_name, logger_name=goodput_logger_name)
220242
```
221243

244+
If you want to enable Pathways, turn on the `using_pathways` flag:
245+
246+
```python
247+
goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own logger name.
248+
goodput_calculator = goodput.GoodputCalculator(job_name=config.run_name, logger_name=goodput_logger_name, using_pathways=True)
249+
```
250+
222251
#### Retrieve Goodput
223252

224253
Finally, call the `get_job_goodput` API to retrieve Goodput for the entire job run. This API takes an optional parameter `include_badput_breakdown`. which defaults to `False`.
@@ -244,15 +273,72 @@ Following Badput Breakdown buckets are supported by the library at this time:
244273
# Supported Badput Types
245274
class BadputType(enum.Enum):
246275
"""The type of Badput."""
247-
TPU_INITIALIZATION = 1
248-
TRAINING_PREP = 2
249-
PROGRAM_STARTUP = 3
250-
DATA_LOADING = 4
251-
UNPRODUCTIVE_CHECKPOINTING = 5
252-
WASTED_PROGRESS_FROM_DISRUPTION = 6
253-
OTHER = 7
276+
TPU_INITIALIZATION = 1
277+
TRAINING_PREP = 2
278+
PROGRAM_STARTUP = 3
279+
DATA_LOADING = 4
280+
UNPRODUCTIVE_CHECKPOINT_SAVE_TIME = 5
281+
UNPRODUCTIVE_CHECKPOINT_RESTORE_TIME = 6
282+
WASTED_PROGRESS_FROM_DISRUPTION = 7
283+
OTHER = 8
254284
```
255285

286+
#### Badput Breakdown Details
287+
288+
- Accelerator Initialization Time (TPU_INITIALIZATION)
289+
290+
This is the time spent on device discovery, slice initialization,
291+
device driver re-initialization and reset, security setup, initialization of
292+
pre-mapped buffers and more.
293+
294+
- Training Preparation Time (TRAINING_PREP)
295+
296+
This is the time spent on the creation of checkpoint managers, checkpoint
297+
loading, running mesh and model optimizers and more.
298+
299+
- Program Startup Time (PROGRAM_STARTUP)
300+
301+
This is the time spent on framework specific function transformations
302+
(such as JAX tracing), compilation tasks, runtime initialization etc.
303+
304+
- Data Loading Time (DATA_LOADING)
305+
306+
This is the time spent on loading each batch of data for the training at a
307+
step to continue. This should be a small contribution to Badput if parallel
308+
data loading is used.
309+
310+
- Checkpointing Time (UNPRODUCTIVE_CHECKPOINT_SAVE_TIME, UNPRODUCTIVE_CHECKPOINT_RESTORE_TIME)
311+
312+
This is the time spent on saving a checkpoint and restoring a checkpoint.
313+
314+
Depending on the type of checkpointing technology used by the program, there
315+
could be unproductive time while saving a checkpoint. When checkpointing is
316+
synchronous, the save operation will block training progress until it is complete.
317+
318+
During asynchronous checkpointing, the model parameters or weights have to be
319+
transferred from the device memory to the host memory which is a blocking
320+
operation on the device. After the transfer, the device can proceed with model
321+
training while the CPU saves the checkpoint to storage in the background. The
322+
first blocking operation contributes to unproductive checkpoint time.
323+
324+
If auto checkpointing is used, the checkpoint save operation is initiated upon
325+
detection of a planned disruption signal. The save operation in type of
326+
checkpointing is synchronous resulting in time lost to Badput.
327+
328+
- Wasted Progress due to Disruption (WASTED_PROGRESS_FROM_DISRUPTION)
329+
330+
Based on checkpointing frequency, a disruption may result in time lost in the
331+
form of wasted progress, i.e. time that was spent on productive training but
332+
lost after restart.
333+
334+
When there is a disruption, Badput is expected to accumulate in
335+
each of the following buckets after restart:
336+
337+
- Accelerator Initialization
338+
- Training Preparation
339+
- Program Startup
340+
- Wasted Progress due to Disruption
341+
256342
If you are interested in retrieving Badput Breakdown along with Goodput:
257343

258344
```python
@@ -266,6 +352,31 @@ print(f"Badput due to data loading: {badput_breakdown[goodput.BadputType.DATA_LO
266352
print(f"Badput due to disruption and wasted progress: {badput_breakdown[goodput.BadputType.WASTED_PROGRESS_FROM_DISRUPTION]:.2f}%")
267353
```
268354

355+
#### Interval Query Goodput and Badput
356+
357+
If you are interested in retrieving Goodput and Badput of the workload within a
358+
specific window of time, the `GoodputCalculator` exposes the
359+
`get_job_goodput_interval` API which computes metrics between the start and end
360+
of this window.
361+
362+
This API also returns the last step recorded for the job. the total job time in
363+
this window and the number of disruptions within the interval window.
364+
365+
> **_IMPORTANT:_** **Use this API if** you know the exact window of time within the workload's total run time that you are interested in.
366+
367+
> **_IMPORTANT:_** **Do NOT use this API if** your workload has been manually disrupted.
368+
369+
> **_IMPORTANT:_** **Do NOT use this API if** you have accidentally re-used a previous `run_name`.
370+
371+
```python
372+
# Example usage
373+
start_time_str = "2024-12-16 1:05:00"
374+
start_time_utc = convert_pst_to_utc(start_time_str)
375+
end_time_str = "2024-12-17 2:00:00"
376+
end_time_utc = convert_pst_to_utc(end_time_str)
377+
current_goodput, badput_breakdown, last_step, total_time, disruptions = goodput_calculator.get_job_goodput_interval(start_time_utc, end_time_utc)
378+
```
379+
269380
### Monitor Goodput with `GoodputMonitor`
270381

271382
In order to monitor the Goodput of a job run on Tensorboard, all you need to do
@@ -307,11 +418,61 @@ goodput_monitor = monitoring.GoodputMonitor(
307418
)
308419
```
309420

421+
If you want to enable Pathways, turn on the `pathway_enabled` flag:
422+
423+
```python
424+
goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own logger name.
425+
goodput_monitoring_enabled = config.monitor_goodput and jax.process_index() == 0 # Check for configs whether or not the enable monitoring.
426+
427+
goodput_monitor = monitoring.GoodputMonitor(
428+
job_name=config.run_name,
429+
logger_name=logger_name,
430+
tensorboard_dir=config.tensorboard_dir,
431+
upload_interval=config.goodput_upload_interval_seconds,
432+
monitoring_enabled=True,
433+
include_badput_breakdown=True,
434+
pathway_enabled=True
435+
)
436+
```
437+
438+
If you want to monitor Step Time Deviation, configure the `GoodputMonitor` as follows:
439+
440+
```python
441+
goodput_logger_name = f'goodput_{config.run_name}' # You can choose your own logger name.
442+
goodput_monitoring_enabled = config.monitor_goodput and jax.process_index() == 0 # Check for configs whether or not the enable monitoring.
443+
444+
goodput_monitor = monitoring.GoodputMonitor(
445+
job_name=config.run_name,
446+
logger_name=logger_name,
447+
tensorboard_dir=config.tensorboard_dir,
448+
upload_interval=config.goodput_upload_interval_seconds,
449+
monitoring_enabled=True,
450+
include_badput_breakdown=True,
451+
include_step_deviation=True,
452+
configured_ideal_step_time=None # Optional, the library will compute ideal step time if it is not provided
453+
)
454+
```
455+
310456
#### Start asynchronous "query and upload" of Goodput
311457

312-
Call the `start_goodput_uploader` API to spin off a thread which continuously queries and uploads Goodput.
458+
Call the `start_goodput_uploader` API to spin off a thread which continuously
459+
queries and uploads Goodput.
313460

314461
```python
315462
goodput_monitor.start_goodput_uploader()
316463
```
317464

465+
#### Start asynchronous "query and upload" of Step Time Deviation
466+
467+
Call the `start_step_deviation_uploader` API to spin off a thread which
468+
continuously queries and uploads step time deviation.
469+
470+
```python
471+
goodput_monitor.start_step_deviation_uploader()
472+
```
473+
474+
#### Visualize on Tensorboard
475+
476+
1. Make sure you have `tensorboard-plugin-profile`, `tensorflow` and `tensorboard` packages installed
477+
2. Follow instructions [here](https://cloud.google.com/tpu/docs/profile-tpu-vm#start_profiling_the_model_training) to start the Tensorboard server
478+

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
[project]
1616
name = "ml_goodput_measurement"
17-
version = "0.0.4"
17+
version = "0.0.5"
1818
authors = [
1919
{ name="Cloud TPU Team", email="[email protected]" },
2020
]

0 commit comments

Comments
 (0)