Skip to content

Commit 9288b6b

Browse files
slight mod.
1 parent 36dc576 commit 9288b6b

File tree

1 file changed

+5
-3
lines changed

1 file changed

+5
-3
lines changed

docs/distrib_optimizer.md

+5-3
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,5 @@
1+
# Distributed Optimizer
2+
13
The motivation for the distributed optimizer is to save memory by distributing the optimizer state evenly across data parallel ranks, versus the current method of replicating the optimizer state across data parallel ranks. As described in https://arxiv.org/abs/1910.02054, this branch specifically implements the following:
24

35
- [yes] distribute all 'non-overlapping' optimizer state (i.e., model params already in fp32 are NOT distributed)
@@ -24,15 +26,15 @@ The grad buffer is used for performing reduce-scatter and all-gather operations,
2426

2527
The figures below illustrate the grad buffer's sharding scheme, and the key steps of the distributed optimizer's param update:
2628

27-
# Data flow
29+
## Data flow
2830

2931
![Data flow](images/distrib_optimizer/data_flow.png)
3032

31-
# Sharding scheme
33+
## Sharding scheme
3234

3335
![Sharding scheme](images/distrib_optimizer/sharding_scheme.png)
3436

35-
# Key steps
37+
## Key steps
3638

3739
_(note: using illustrations above, and assuming fp16 grads)_
3840

0 commit comments

Comments
 (0)