Fixed typos in readme files of docs folder. (#2329)

huggingface · Jan 12, 2024 · fce61a9 · fce61a9
1 parent 6ec92cf
commit fce61a9
Show file tree

Hide file tree

Showing 9 changed files with 22 additions and 22 deletions.
diff --git a/docs/source/basic_tutorials/notebook.md b/docs/source/basic_tutorials/notebook.md
@@ -186,7 +186,7 @@ Here is a basic training loop for the animal classification problem:
 
 <Tip>
 
-    The code has been split up to allow for explainations on each section. A full version that can be copy and pasted will be available at the end
+    The code has been split up to allow for explanations on each section. A full version that can be copy and pasted will be available at the end
 
 </Tip>
 
@@ -344,7 +344,7 @@ def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64):
     mean = mean.to(accelerator.device)
     std = std.to(accelerator.device)
 
-    # Intantiate the optimizer
+    # Instantiate the optimizer
     optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25)
 
     # Instantiate the learning rate scheduler

diff --git a/docs/source/concept_guides/low_precision_training.md b/docs/source/concept_guides/low_precision_training.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 # Low Precision Training Methods
 
 The release of new kinds of hardware led to the emergence of new training paradigms that better utilize them. Currently, this is in the form of training
-in 8-bit precision using packages such as [TranformersEngine](https://github.com/NVIDIA/TransformerEngine) (TE) or [MS-AMP](https://github.com/Azure/MS-AMP/tree/main).
+in 8-bit precision using packages such as [TransformersEngine](https://github.com/NVIDIA/TransformerEngine) (TE) or [MS-AMP](https://github.com/Azure/MS-AMP/tree/main).
 
 For an introduction to the topics discussed today, we recommend reviewing the [low-precision usage guide](../usage_guides/low_precision_training.md) as this documentation will reference it regularly. 
 
@@ -34,9 +34,9 @@ MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16
 
 ## `TransformersEngine`
 
-`TranformersEngine` is the first solution to trying to train in 8-bit floating point. It works by using drop-in replacement layers for certain ones in a model that utilize their FP8-engine to reduce the number of bits (such as 32 to 8) without degrading the final accuracy of the model. 
+`TransformersEngine` is the first solution to trying to train in 8-bit floating point. It works by using drop-in replacement layers for certain ones in a model that utilize their FP8-engine to reduce the number of bits (such as 32 to 8) without degrading the final accuracy of the model. 
 
-Specifically, 🤗 Accelerate will find and replace the following layers with `TranformersEngine` versions:
+Specifically, 🤗 Accelerate will find and replace the following layers with `TransformersEngine` versions:
 
 * `nn.LayerNorm` for `te.LayerNorm`
 * `nn.Linear` for `te.Linear`
@@ -65,7 +65,7 @@ MS-AMP takes a different approach to `TransformersEngine` by providing three dif
 
 * The base optimization level (`O1`), passes communications of the weights (such as in DDP) in FP8, stores the weights of the model in FP16, and leaves the optimizer states in FP32. The main benefit of this optimization level is that we can reduce the communication bandwidth by essentially half. Additionally, more GPU memory is saved due to 1/2 of everything being cast in FP8, and the weights being cast to FP16. Notably, both the optimizer states remain in FP32.
 
-* The second optimization level (`O2`) improves upon this by also reducing the precision of the optimizer states. One is in FP8 while the other is in FP16. Generally it's been shown that this will only provide a net-gain of no degredated end accuracy, increased training speed, and reduced memory as now every state is either in FP16 or FP8. 
+* The second optimization level (`O2`) improves upon this by also reducing the precision of the optimizer states. One is in FP8 while the other is in FP16. Generally it's been shown that this will only provide a net-gain of no degraded end accuracy, increased training speed, and reduced memory as now every state is either in FP16 or FP8. 
 
 * Finally, MS-AMP has a third optimization level (`O3`) which helps during DDP scenarios such as DeepSpeed. The weights of the model in memory are fully cast to FP8, and the master weights are now stored in FP16. This fully reduces memory by the highest factor as now not only is almost everything in FP8, only two states are left in FP16. Currently, only DeepSpeed versions up through 0.9.2 are supported, so this capability is not included in the 🤗 Accelerate integration
 

diff --git a/docs/source/package_reference/cli.md b/docs/source/package_reference/cli.md
@@ -218,7 +218,7 @@ The following arguments are only useful when `use_megatron_lm` is passed or Mega
 * `--megatron_lm_num_micro_batches` (``) -- Megatron-LM's number of micro batches when PP degree > 1.
 * `--megatron_lm_sequence_parallelism` (``) -- Decides Whether (true|false) to enable Sequence Parallelism when TP degree > 1.
 * `--megatron_lm_recompute_activations` (``) -- Decides Whether (true|false) to enable Selective Activation Recomputation.
-* `--megatron_lm_use_distributed_optimizer` (``) -- Decides Whether (true|false) to use distributed optimizer which shards optimizer state and gradients across Data Pralellel (DP) ranks.
+* `--megatron_lm_use_distributed_optimizer` (``) -- Decides Whether (true|false) to use distributed optimizer which shards optimizer state and gradients across Data Parallel (DP) ranks.
 * `--megatron_lm_gradient_clipping` (``) -- Megatron-LM's gradient clipping value based on global L2 Norm (0 to disable).
 
 **AWS SageMaker Arguments**:

diff --git a/docs/source/package_reference/utilities.md b/docs/source/package_reference/utilities.md
@@ -60,7 +60,7 @@ These are standalone dataclasses used for checks, such as the type of distribute
 
 ### Kwargs
 
-These are configurable arguemnts for specific interactions throughout the PyTorch ecosystem that Accelerate handles under the hood.
+These are configurable arguments for specific interactions throughout the PyTorch ecosystem that Accelerate handles under the hood.
 
 
 [[autodoc]] utils.AutocastKwargs
@@ -77,7 +77,7 @@ These are configurable arguemnts for specific interactions throughout the PyTorc
 ## Plugins
 
 These are plugins that can be passed to the [`Accelerator`] object. While they are defined elsewhere in the documentation, 
-for convience all of them are available to see here:
+for convenience all of them are available to see here:
 
 [[autodoc]] utils.DeepSpeedPlugin
 

diff --git a/docs/source/usage_guides/big_modeling.md b/docs/source/usage_guides/big_modeling.md
@@ -52,7 +52,7 @@ will attempt to fill all the space in your GPU(s), then loading them to the CPU,
 
 <Tip>
 
-For more details on desigining your own device map, see this section of the [concept guide](../concept_guide/big_model_inference#designing-a-device-map)
+For more details on designing your own device map, see this section of the [concept guide](../concept_guide/big_model_inference#designing-a-device-map)
 
 </Tip>
 
@@ -90,7 +90,7 @@ What will happen now is each time the input gets passed through a layer, it will
 
 <Tip>
 
-    Multiple GPUs can be utilized, however this is considered "model parallism" and as a result only one GPU will be active at a given moment, waiting for the prior one to send it the output. You should launch your script normally with `python`
+    Multiple GPUs can be utilized, however this is considered "model parallelism" and as a result only one GPU will be active at a given moment, waiting for the prior one to send it the output. You should launch your script normally with `python`
     and not need `torchrun`, `accelerate launch`, etc.
 
 </Tip>

diff --git a/docs/source/usage_guides/deepspeed.md b/docs/source/usage_guides/deepspeed.md
@@ -23,7 +23,7 @@ rendered properly in your Markdown viewer.
 4. Custom mixed precision training handling
 5. A range of fast CUDA-extension-based optimizers
 6. ZeRO-Offload to CPU and Disk/NVMe
-7. Heirarchical partitioning of model parameters (ZeRO++)
+7. Hierarchical partitioning of model parameters (ZeRO++)
 
 ZeRO-Offload has its own dedicated paper: [ZeRO-Offload: Democratizing Billion-Scale Model Training](https://arxiv.org/abs/2101.06840). And NVMe-support is described in the paper [ZeRO-Infinity: Breaking the GPU
 Memory Wall for Extreme Scale Deep Learning](https://arxiv.org/abs/2104.07857).
@@ -61,7 +61,7 @@ Below is a short description of Data Parallelism using ZeRO - Zero Redundancy Op
 
  e. **Param Offload**: Offloads the model parameters to CPU/Disk building on top of ZERO Stage 3
 
- f. **Heirarchical Paritioning**: Enables efficient multi-node training with data-parallel training across nodes and ZeRO-3 sharding within a node, built on top of ZeRO Stage 3.
+ f. **Hierarchical Partitioning**: Enables efficient multi-node training with data-parallel training across nodes and ZeRO-3 sharding within a node, built on top of ZeRO Stage 3.
 
 <u>Note</u>: With respect to Disk Offload, the disk should be an NVME for decent speed but it technically works on any Disk
 
@@ -371,7 +371,7 @@ You can use the the features of ZeRO++ by using the appropriate config parameter
 }
 ```
 
-For heirarchical partitioning, the partition size `zero_hpz_partition_size` should ideally be set to the number of GPUs per node. (For example, the above config file assumes 8 GPUs per node)
+For hierarchical partitioning, the partition size `zero_hpz_partition_size` should ideally be set to the number of GPUs per node. (For example, the above config file assumes 8 GPUs per node)
 
 **Important code changes when using DeepSpeed Config File**
 
@@ -383,7 +383,7 @@ We will look at the changes needed in the code when using these.
    In this situation, those will be used and the user has to use `accelerate.utils.DummyOptim` and `accelerate.utils.DummyScheduler` to replace the PyTorch/Custom optimizers and schedulers in their code.
    Below is the snippet from `examples/by_feature/deepspeed_with_config_support.py` showing this:
    ```python
-    # Creates Dummy Optimizer if `optimizer` was spcified in the config file else creates Adam Optimizer
+    # Creates Dummy Optimizer if `optimizer` was specified in the config file else creates Adam Optimizer
     optimizer_cls = (
         torch.optim.AdamW
         if accelerator.state.deepspeed_plugin is None
@@ -392,7 +392,7 @@ We will look at the changes needed in the code when using these.
     )
     optimizer = optimizer_cls(optimizer_grouped_parameters, lr=args.learning_rate)
 
-    # Creates Dummy Scheduler if `scheduler` was spcified in the config file else creates `args.lr_scheduler_type` Scheduler
+    # Creates Dummy Scheduler if `scheduler` was specified in the config file else creates `args.lr_scheduler_type` Scheduler
     if (
         accelerator.state.deepspeed_plugin is None
         or "scheduler" not in accelerator.state.deepspeed_plugin.deepspeed_config

diff --git a/docs/source/usage_guides/fsdp.md b/docs/source/usage_guides/fsdp.md
@@ -85,11 +85,11 @@ Currently, `Accelerate` supports the following config through the CLI:
 
 `fsdp_backward_prefetch_policy`: [1] BACKWARD_PRE, [2] BACKWARD_POST, [3] NO_PREFETCH
 
-`fsdp_forward_prefetch`: if True, then FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. Should only be used for static-graph models since the prefetching follows the first iteration’s execution order. i.e., if the sub-modules' order changes dynamically during the model's executation do not enable this feature.
+`fsdp_forward_prefetch`: if True, then FSDP explicitly prefetches the next upcoming all-gather while executing in the forward pass. Should only be used for static-graph models since the prefetching follows the first iteration’s execution order. i.e., if the sub-modules' order changes dynamically during the model's execution do not enable this feature.
 
 `fsdp_state_dict_type`: [1] FULL_STATE_DICT, [2] LOCAL_STATE_DICT, [3] SHARDED_STATE_DICT
 
-`fsdp_use_orig_params`: If True, allows non-uniform `requires_grad` during init, which means support for interspersed frozen and trainable paramteres. This setting is useful in cases such as parameter-efficient fine-tuning as discussed in [this post](https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019). This option also allows one to have multiple optimizer param groups. This should be `True` when creating an optimizer before preparing/wrapping the model with FSDP.
+`fsdp_use_orig_params`: If True, allows non-uniform `requires_grad` during init, which means support for interspersed frozen and trainable parameters. This setting is useful in cases such as parameter-efficient fine-tuning as discussed in [this post](https://dev-discuss.pytorch.org/t/rethinking-pytorch-fully-sharded-data-parallel-fsdp-from-first-principles/1019). This option also allows one to have multiple optimizer param groups. This should be `True` when creating an optimizer before preparing/wrapping the model with FSDP.
 
 `fsdp_cpu_ram_efficient_loading`: Only applicable for 🤗 Transformers models. If True, only the first process loads the pretrained model checkpoint while all other processes have empty weights. This should be set to False if you experience errors when loading the pretrained 🤗 Transformers model via `from_pretrained` method. When this setting is True `fsdp_sync_module_states` also must to be True, otherwise all the processes except the main process would have random weights leading to unexpected behaviour during training.
 
@@ -123,7 +123,7 @@ Below is the code snippet to save using `save_state` utility of accelerate.
 accelerator.save_state("ckpt")
 ```
 
-Inspect the ckeckpoint folder to see model and optimizer as shards per process:
+Inspect the checkpoint folder to see model and optimizer as shards per process:
 ```
 ls ckpt
 # optimizer_0  pytorch_model_0  random_states_0.pkl  random_states_1.pkl  scheduler.bin

diff --git a/docs/source/usage_guides/low_precision_training.md b/docs/source/usage_guides/low_precision_training.md
@@ -19,7 +19,7 @@ rendered properly in your Markdown viewer.
 
 ## What training on FP8 means
 
-To explore more of the nitty-gritty in traninig in FP8 with PyTorch and 🤗 Accelerate, check out the [concept_guide](../concept_guides/low_precision_training.md) on why this can be difficult. But essentially rather than training in BF16, some (or all) aspects of training a model can be performed using 8 bits instead of 16. The challenge is doing so without degrading final performance. 
+To explore more of the nitty-gritty in training in FP8 with PyTorch and 🤗 Accelerate, check out the [concept_guide](../concept_guides/low_precision_training.md) on why this can be difficult. But essentially rather than training in BF16, some (or all) aspects of training a model can be performed using 8 bits instead of 16. The challenge is doing so without degrading final performance. 
 
 This is only enabled on specific NVIDIA hardware, namely:
 
@@ -57,7 +57,7 @@ Of the two, `MS-AMP` is traditionally the easier one to configure as there is on
 Currently two levels of optimization are supported in the 🤗 Accelerate integration, `"O1"` and `"O2"` (using the letter 'o', not zero). 
 
 * `"O1"` will cast the weight gradients and `all_reduce` communications to happen in 8-bit, while the rest are done in 16 bit. This reduces the general GPU memory usage and speeds up communication bandwidths.
-* `"O2"` will also cast first-order optimizer states into 8 bit, while the second order states are in FP16. (Currently just the `Adam` optimizer is supported). This tries it's best to minimize final accuracy degredation and will save the highest potential memory.
+* `"O2"` will also cast first-order optimizer states into 8 bit, while the second order states are in FP16. (Currently just the `Adam` optimizer is supported). This tries it's best to minimize final accuracy degradation and will save the highest potential memory.
 
 To specify an optimization level, pass it to the `FP8KwargsHandler` by setting the `optimization_level` argument:
 

diff --git a/docs/source/usage_guides/megatron_lm.md b/docs/source/usage_guides/megatron_lm.md
@@ -113,7 +113,7 @@ pip install git+https://github.com/huggingface/Megatron-LM.git
 ## Accelerate Megatron-LM Plugin
 
 Important features are directly supported via the `accelerate config` command. 
-An example of thr corresponding questions for using Megatron-LM features is shown below:
+An example of the corresponding questions for using Megatron-LM features is shown below:
 
 ```bash
 :~$ accelerate config --config_file "megatron_gpt_config.yaml"