You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The package also exposes Goodput Monitoring APIs which allow asynchronous query
29
-
and export of the job's Goodput to Tensorboard with configurable upload interval.
29
+
and export of the job's Goodput, Badput and Step Time Deviation to Tensorboard
30
+
with configurable upload interval.
30
31
31
32
## Components
32
33
@@ -37,6 +38,7 @@
37
38
38
39
-`GoodputCalculator`
39
40
-`GoodputMonitor`
41
+
-`GoodputCache`
40
42
41
43
42
44
The `GoodputRecorder`
@@ -51,13 +53,18 @@
51
53
from the training application, either on a CPU instance or on the users'
52
54
development machine.
53
55
54
-
The `GoodputMonitor` exposes APIs to query and upload goodput data to
55
-
Tensorboard asynchronously. It does this by instantiating a `GoodputCaluclator`
56
-
under the hood.
56
+
Under the hood, the `GoodputCalculator` uses a `GoodputCache` which is an
57
+
internal component that locally caches pre-computations and useful logs such
58
+
that repeated computations can be made inexpensive.
59
+
60
+
The `GoodputMonitor` exposes APIs to query and upload goodput and step time
61
+
deviation data to Tensorboard asynchronously. It does this by instantiating a
62
+
`GoodputCaluclator` under the hood.
57
63
58
64
## Installation
59
65
60
-
To install the ML Goodput Measurement package, run the following command on the VM:
66
+
To install the ML Goodput Measurement package, run the following command on the
67
+
VM or machine you want to query or monitor your workload from:
61
68
62
69
```bash
63
70
pip install ml-goodput-measurement
@@ -77,17 +84,32 @@ project, then do the following:
77
84
78
85
3.[Enable](https://console.cloud.google.com/flows/enableapi?apiid=logging.googleapis.com&_ga=2.27841276.1571868865.1726250448-123998259.1726107009) the Cloud Logging API.
79
86
80
-
To run your training on Cloud accelerator, set up the environment by following
To learn more about Google Cloud Logging, visit this [page](https://cloud.google.com/logging/docs).
91
+
92
+
### Access Scopes
93
+
94
+
You will need both read and write access scopes for cloud logging on both the
95
+
GPU or TPU and CPU node pools. Full cloud logging access is granted by the
96
+
following access scope during node pool creation:
97
+
98
+
-`https://www.googleapis.com/auth/cloud-platform`
82
99
83
-
To learn more about Google Cloud Logging, visit this [page](https://cloud.google.com/logging/docs).
100
+
XPK adds this access scope to the GPU, TPU and CPU node pools, so XPK is the recommended method to create clusters and node-pools in you intend to run your workloads on GKE.
84
101
102
+
Instructions on how to create clusters using XPK can be
103
+
found [here](https://github.com/AI-Hypercomputer/xpk/blob/main/README.md#cluster-create) and how to create workloads using XPK can be found
Finally, call the `get_job_goodput` API to retrieve Goodput for the entire job run. This API takes an optional parameter `include_badput_breakdown`. which defaults to `False`.
@@ -244,15 +273,72 @@ Following Badput Breakdown buckets are supported by the library at this time:
244
273
# Supported Badput Types
245
274
classBadputType(enum.Enum):
246
275
"""The type of Badput."""
247
-
TPU_INITIALIZATION=1
248
-
TRAINING_PREP=2
249
-
PROGRAM_STARTUP=3
250
-
DATA_LOADING=4
251
-
UNPRODUCTIVE_CHECKPOINTING=5
252
-
WASTED_PROGRESS_FROM_DISRUPTION=6
253
-
OTHER=7
276
+
TPU_INITIALIZATION=1
277
+
TRAINING_PREP=2
278
+
PROGRAM_STARTUP=3
279
+
DATA_LOADING=4
280
+
UNPRODUCTIVE_CHECKPOINT_SAVE_TIME=5
281
+
UNPRODUCTIVE_CHECKPOINT_RESTORE_TIME=6
282
+
WASTED_PROGRESS_FROM_DISRUPTION=7
283
+
OTHER=8
254
284
```
255
285
286
+
#### Badput Breakdown Details
287
+
288
+
- Accelerator Initialization Time (TPU_INITIALIZATION)
289
+
290
+
This is the time spent on device discovery, slice initialization,
291
+
device driver re-initialization and reset, security setup, initialization of
292
+
pre-mapped buffers and more.
293
+
294
+
- Training Preparation Time (TRAINING_PREP)
295
+
296
+
This is the time spent on the creation of checkpoint managers, checkpoint
297
+
loading, running mesh and model optimizers and more.
298
+
299
+
- Program Startup Time (PROGRAM_STARTUP)
300
+
301
+
This is the time spent on framework specific function transformations
302
+
(such as JAX tracing), compilation tasks, runtime initialization etc.
303
+
304
+
- Data Loading Time (DATA_LOADING)
305
+
306
+
This is the time spent on loading each batch of data for the training at a
307
+
step to continue. This should be a small contribution to Badput if parallel
308
+
data loading is used.
309
+
310
+
- Checkpointing Time (UNPRODUCTIVE_CHECKPOINT_SAVE_TIME, UNPRODUCTIVE_CHECKPOINT_RESTORE_TIME)
311
+
312
+
This is the time spent on saving a checkpoint and restoring a checkpoint.
313
+
314
+
Depending on the type of checkpointing technology used by the program, there
315
+
could be unproductive time while saving a checkpoint. When checkpointing is
316
+
synchronous, the save operation will block training progress until it is complete.
317
+
318
+
During asynchronous checkpointing, the model parameters or weights have to be
319
+
transferred from the device memory to the host memory which is a blocking
320
+
operation on the device. After the transfer, the device can proceed with model
321
+
training while the CPU saves the checkpoint to storage in the background. The
322
+
first blocking operation contributes to unproductive checkpoint time.
323
+
324
+
If auto checkpointing is used, the checkpoint save operation is initiated upon
325
+
detection of a planned disruption signal. The save operation in type of
326
+
checkpointing is synchronous resulting in time lost to Badput.
327
+
328
+
- Wasted Progress due to Disruption (WASTED_PROGRESS_FROM_DISRUPTION)
329
+
330
+
Based on checkpointing frequency, a disruption may result in time lost in the
331
+
form of wasted progress, i.e. time that was spent on productive training but
332
+
lost after restart.
333
+
334
+
When there is a disruption, Badput is expected to accumulate in
335
+
each of the following buckets after restart:
336
+
337
+
- Accelerator Initialization
338
+
- Training Preparation
339
+
- Program Startup
340
+
- Wasted Progress due to Disruption
341
+
256
342
If you are interested in retrieving Badput Breakdown along with Goodput:
257
343
258
344
```python
@@ -266,6 +352,31 @@ print(f"Badput due to data loading: {badput_breakdown[goodput.BadputType.DATA_LO
266
352
print(f"Badput due to disruption and wasted progress: {badput_breakdown[goodput.BadputType.WASTED_PROGRESS_FROM_DISRUPTION]:.2f}%")
267
353
```
268
354
355
+
#### Interval Query Goodput and Badput
356
+
357
+
If you are interested in retrieving Goodput and Badput of the workload within a
358
+
specific window of time, the `GoodputCalculator` exposes the
359
+
`get_job_goodput_interval` API which computes metrics between the start and end
360
+
of this window.
361
+
362
+
This API also returns the last step recorded for the job. the total job time in
363
+
this window and the number of disruptions within the interval window.
364
+
365
+
> **_IMPORTANT:_****Use this API if** you know the exact window of time within the workload's total run time that you are interested in.
366
+
367
+
> **_IMPORTANT:_****Do NOT use this API if** your workload has been manually disrupted.
368
+
369
+
> **_IMPORTANT:_****Do NOT use this API if** you have accidentally re-used a previous `run_name`.
0 commit comments