Skip to content

Commit

Permalink
Fixed typos in readme files of docs folder. (#2329)
Browse files Browse the repository at this point in the history
  • Loading branch information
rishit5 authored Jan 12, 2024
1 parent 6ec92cf commit fce61a9
Show file tree
Hide file tree
Showing 9 changed files with 22 additions and 22 deletions.
4 changes: 2 additions & 2 deletions docs/source/basic_tutorials/notebook.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,7 @@ Here is a basic training loop for the animal classification problem:

<Tip>

The code has been split up to allow for explainations on each section. A full version that can be copy and pasted will be available at the end
The code has been split up to allow for explanations on each section. A full version that can be copy and pasted will be available at the end

</Tip>

Expand Down Expand Up @@ -344,7 +344,7 @@ def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64):
mean = mean.to(accelerator.device)
std = std.to(accelerator.device)

# Intantiate the optimizer
# Instantiate the optimizer
optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25)

# Instantiate the learning rate scheduler
Expand Down
8 changes: 4 additions & 4 deletions docs/source/concept_guides/low_precision_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
# Low Precision Training Methods

The release of new kinds of hardware led to the emergence of new training paradigms that better utilize them. Currently, this is in the form of training
in 8-bit precision using packages such as [TranformersEngine](https://github.com/NVIDIA/TransformerEngine) (TE) or [MS-AMP](https://github.com/Azure/MS-AMP/tree/main).
in 8-bit precision using packages such as [TransformersEngine](https://github.com/NVIDIA/TransformerEngine) (TE) or [MS-AMP](https://github.com/Azure/MS-AMP/tree/main).

For an introduction to the topics discussed today, we recommend reviewing the [low-precision usage guide](../usage_guides/low_precision_training.md) as this documentation will reference it regularly.

Expand All @@ -34,9 +34,9 @@ MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16

## `TransformersEngine`

`TranformersEngine` is the first solution to trying to train in 8-bit floating point. It works by using drop-in replacement layers for certain ones in a model that utilize their FP8-engine to reduce the number of bits (such as 32 to 8) without degrading the final accuracy of the model.
`TransformersEngine` is the first solution to trying to train in 8-bit floating point. It works by using drop-in replacement layers for certain ones in a model that utilize their FP8-engine to reduce the number of bits (such as 32 to 8) without degrading the final accuracy of the model.

Specifically, 🤗 Accelerate will find and replace the following layers with `TranformersEngine` versions:
Specifically, 🤗 Accelerate will find and replace the following layers with `TransformersEngine` versions:

* `nn.LayerNorm` for `te.LayerNorm`
* `nn.Linear` for `te.Linear`
Expand Down Expand Up @@ -65,7 +65,7 @@ MS-AMP takes a different approach to `TransformersEngine` by providing three dif

* The base optimization level (`O1`), passes communications of the weights (such as in DDP) in FP8, stores the weights of the model in FP16, and leaves the optimizer states in FP32. The main benefit of this optimization level is that we can reduce the communication bandwidth by essentially half. Additionally, more GPU memory is saved due to 1/2 of everything being cast in FP8, and the weights being cast to FP16. Notably, both the optimizer states remain in FP32.

* The second optimization level (`O2`) improves upon this by also reducing the precision of the optimizer states. One is in FP8 while the other is in FP16. Generally it's been shown that this will only provide a net-gain of no degredated end accuracy, increased training speed, and reduced memory as now every state is either in FP16 or FP8.
* The second optimization level (`O2`) improves upon this by also reducing the precision of the optimizer states. One is in FP8 while the other is in FP16. Generally it's been shown that this will only provide a net-gain of no degraded end accuracy, increased training speed, and reduced memory as now every state is either in FP16 or FP8.

* Finally, MS-AMP has a third optimization level (`O3`) which helps during DDP scenarios such as DeepSpeed. The weights of the model in memory are fully cast to FP8, and the master weights are now stored in FP16. This fully reduces memory by the highest factor as now not only is almost everything in FP8, only two states are left in FP16. Currently, only DeepSpeed versions up through 0.9.2 are supported, so this capability is not included in the 🤗 Accelerate integration

Expand Down
2 changes: 1 addition & 1 deletion docs/source/package_reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ The following arguments are only useful when `use_megatron_lm` is passed or Mega
* `--megatron_lm_num_micro_batches` (``) -- Megatron-LM's number of micro batches when PP degree > 1.
* `--megatron_lm_sequence_parallelism` (``) -- Decides Whether (true|false) to enable Sequence Parallelism when TP degree > 1.
* `--megatron_lm_recompute_activations` (``) -- Decides Whether (true|false) to enable Selective Activation Recomputation.
* `--megatron_lm_use_distributed_optimizer` (``) -- Decides Whether (true|false) to use distributed optimizer which shards optimizer state and gradients across Data Pralellel (DP) ranks.
* `--megatron_lm_use_distributed_optimizer` (``) -- Decides Whether (true|false) to use distributed optimizer which shards optimizer state and gradients across Data Parallel (DP) ranks.
* `--megatron_lm_gradient_clipping` (``) -- Megatron-LM's gradient clipping value based on global L2 Norm (0 to disable).

**AWS SageMaker Arguments**:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/package_reference/utilities.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ These are standalone dataclasses used for checks, such as the type of distribute

### Kwargs

These are configurable arguemnts for specific interactions throughout the PyTorch ecosystem that Accelerate handles under the hood.
These are configurable arguments for specific interactions throughout the PyTorch ecosystem that Accelerate handles under the hood.


[[autodoc]] utils.AutocastKwargs
Expand All @@ -77,7 +77,7 @@ These are configurable arguemnts for specific interactions throughout the PyTorc
## Plugins

These are plugins that can be passed to the [`Accelerator`] object. While they are defined elsewhere in the documentation,
for convience all of them are available to see here:
for convenience all of them are available to see here:

[[autodoc]] utils.DeepSpeedPlugin

Expand Down
4 changes: 2 additions & 2 deletions docs/source/usage_guides/big_modeling.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ will attempt to fill all the space in your GPU(s), then loading them to the CPU,

<Tip>

For more details on desigining your own device map, see this section of the [concept guide](../concept_guide/big_model_inference#designing-a-device-map)
For more details on designing your own device map, see this section of the [concept guide](../concept_guide/big_model_inference#designing-a-device-map)

</Tip>

Expand Down Expand Up @@ -90,7 +90,7 @@ What will happen now is each time the input gets passed through a layer, it will

<Tip>

Multiple GPUs can be utilized, however this is considered "model parallism" and as a result only one GPU will be active at a given moment, waiting for the prior one to send it the output. You should launch your script normally with `python`
Multiple GPUs can be utilized, however this is considered "model parallelism" and as a result only one GPU will be active at a given moment, waiting for the prior one to send it the output. You should launch your script normally with `python`
and not need `torchrun`, `accelerate launch`, etc.

</Tip>
Expand Down
10 changes: 5 additions & 5 deletions docs/source/usage_guides/deepspeed.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
4. Custom mixed precision training handling
5. A range of fast CUDA-extension-based optimizers
6. ZeRO-Offload to CPU and Disk/NVMe
7. Heirarchical partitioning of model parameters (ZeRO++)
7. Hierarchical partitioning of model parameters (ZeRO++)

ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
Expand Down Expand Up @@ -61,7 +61,7 @@ Below is a short description of Data Parallelism using ZeRO - Zero Redundancy Op

e. **Param Offload**: Offloads the model parameters to CPU/Disk building on top of ZERO Stage 3

f. **Heirarchical Paritioning**: Enables efficient multi-node training with data-parallel training across nodes and ZeRO-3 sharding within a node, built on top of ZeRO Stage 3.
f. **Hierarchical Partitioning**: Enables efficient multi-node training with data-parallel training across nodes and ZeRO-3 sharding within a node, built on top of ZeRO Stage 3.

<u>Note</u>: With respect to Disk Offload, the disk should be an NVME for decent speed but it technically works on any Disk

Expand Down Expand Up @@ -371,7 +371,7 @@ You can use the the features of ZeRO++ by using the appropriate config parameter
}
```

For heirarchical partitioning, the partition size `zero_hpz_partition_size` should ideally be set to the number of GPUs per node. (For example, the above config file assumes 8 GPUs per node)
For hierarchical partitioning, the partition size `zero_hpz_partition_size` should ideally be set to the number of GPUs per node. (For example, the above config file assumes 8 GPUs per node)

**Important code changes when using DeepSpeed Config File**

Expand All @@ -383,7 +383,7 @@ We will look at the changes needed in the code when using these.
In this situation, those will be used and the user has to use `accelerate.utils.DummyOptim` and `accelerate.utils.DummyScheduler` to replace the PyTorch/Custom optimizers and schedulers in their code.
Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
```python
# Creates Dummy Optimizer if `optimizer` was spcified in the config file else creates Adam Optimizer
# Creates Dummy Optimizer if `optimizer` was specified in the config file else creates Adam Optimizer
optimizer_cls = (
torch.optim.AdamW
if accelerator.state.deepspeed_plugin is None
Expand All @@ -392,7 +392,7 @@ We will look at the changes needed in the code when using these.
)
optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)

# Creates Dummy Scheduler if `scheduler` was spcified in the config file else creates `args.lr_scheduler_type` Scheduler
# Creates Dummy Scheduler if `scheduler` was specified in the config file else creates `args.lr_scheduler_type` Scheduler
if (
accelerator.state.deepspeed_plugin is None
or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config
Expand Down
6 changes: 3 additions & 3 deletions docs/source/usage_guides/fsdp.md
Original file line number Diff line number Diff line change
Expand Up @@ -85,11 +85,11 @@ Currently, `Accelerate` supports the following config through the CLI:

`fsdp_backward_prefetch_policy`: [1] BACKWARD_PRE, [2] BACKWARD_POST, [3] NO_PREFETCH

`fsdp_forward_prefetch`: if True, then FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. Should only be used for static-graph models since the prefetching follows the first iteration’s execution order. i.e., if the sub-modules' order changes dynamically during the model's executation do not enable this feature.
`fsdp_forward_prefetch`: if True, then FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. Should only be used for static-graph models since the prefetching follows the first iteration’s execution order. i.e., if the sub-modules' order changes dynamically during the model's execution do not enable this feature.

`fsdp_state_dict_type`: [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT

`fsdp_use_orig_params`: If True, allows non-uniform `requires_grad` during init, which means support for interspersed frozen and trainable paramteres. This setting is useful in cases such as parameter-efficient fine-tuning as discussed in [this post](https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019). This option also allows one to have multiple optimizer param groups. This should be `True` when creating an optimizer before preparing/wrapping the model with FSDP.
`fsdp_use_orig_params`: If True, allows non-uniform `requires_grad` during init, which means support for interspersed frozen and trainable parameters. This setting is useful in cases such as parameter-efficient fine-tuning as discussed in [this post](https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019). This option also allows one to have multiple optimizer param groups. This should be `True` when creating an optimizer before preparing/wrapping the model with FSDP.

`fsdp_cpu_ram_efficient_loading`: Only applicable for 🤗 Transformers models. If True, only the first process loads the pretrained model checkpoint while all other processes have empty weights. This should be set to False if you experience errors when loading the pretrained 🤗 Transformers model via `from_pretrained` method. When this setting is True `fsdp_sync_module_states` also must to be True, otherwise all the processes except the main process would have random weights leading to unexpected behaviour during training.

Expand Down Expand Up @@ -123,7 +123,7 @@ Below is the code snippet to save using `save_state` utility of accelerate.
accelerator.save_state("ckpt")
```

Inspect the ckeckpoint folder to see model and optimizer as shards per process:
Inspect the checkpoint folder to see model and optimizer as shards per process:
```
ls ckpt
# optimizer_0 pytorch_model_0 random_states_0.pkl random_states_1.pkl scheduler.bin
Expand Down
4 changes: 2 additions & 2 deletions docs/source/usage_guides/low_precision_training.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.

## What training on FP8 means

To explore more of the nitty-gritty in traninig in FP8 with PyTorch and 🤗 Accelerate, check out the [concept_guide](../concept_guides/low_precision_training.md) on why this can be difficult. But essentially rather than training in BF16, some (or all) aspects of training a model can be performed using 8 bits instead of 16. The challenge is doing so without degrading final performance.
To explore more of the nitty-gritty in training in FP8 with PyTorch and 🤗 Accelerate, check out the [concept_guide](../concept_guides/low_precision_training.md) on why this can be difficult. But essentially rather than training in BF16, some (or all) aspects of training a model can be performed using 8 bits instead of 16. The challenge is doing so without degrading final performance.

This is only enabled on specific NVIDIA hardware, namely:

Expand Down Expand Up @@ -57,7 +57,7 @@ Of the two, `MS-AMP` is traditionally the easier one to configure as there is on
Currently two levels of optimization are supported in the 🤗 Accelerate integration, `"O1"` and `"O2"` (using the letter 'o', not zero).

* `"O1"` will cast the weight gradients and `all_reduce` communications to happen in 8-bit, while the rest are done in 16 bit. This reduces the general GPU memory usage and speeds up communication bandwidths.
* `"O2"` will also cast first-order optimizer states into 8 bit, while the second order states are in FP16. (Currently just the `Adam` optimizer is supported). This tries it's best to minimize final accuracy degredation and will save the highest potential memory.
* `"O2"` will also cast first-order optimizer states into 8 bit, while the second order states are in FP16. (Currently just the `Adam` optimizer is supported). This tries it's best to minimize final accuracy degradation and will save the highest potential memory.

To specify an optimization level, pass it to the `FP8KwargsHandler` by setting the `optimization_level` argument:

Expand Down
2 changes: 1 addition & 1 deletion docs/source/usage_guides/megatron_lm.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ pip install git+https://github.com/huggingface/Megatron-LM.git
## Accelerate Megatron-LM Plugin

Important features are directly supported via the `accelerate config` command.
An example of thr corresponding questions for using Megatron-LM features is shown below:
An example of the corresponding questions for using Megatron-LM features is shown below:

```bash
:~$ accelerate config --config_file "megatron_gpt_config.yaml"
Expand Down

0 comments on commit fce61a9

Please sign in to comment.