huggingface · stevhliu · Mar 3, 2026 · Mar 3, 2026 · Apr 23, 2026 · Apr 24, 2026
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -219,8 +219,10 @@
       title: Accelerator selection
     - local: accelerate
       title: Accelerate
+    - local: ddp
+      title: DDP
     - local: fsdp
-      title: FullyShardedDataParallel
+      title: FSDP2
     - local: deepspeed
       title: DeepSpeed ZeRO
     - local: deepspeed_alst
@@ -243,7 +245,7 @@
     - local: perf_hardware
       title: Building a GPU workstation
     - local: model_memory_anatomy
-      title: Model training anatomy
+      title: GPU memory usage
     title: Hardware
   title: Training
 - isExpanded: false

diff --git a/docs/source/en/accelerate.md b/docs/source/en/accelerate.md
@@ -16,150 +16,108 @@ rendered properly in your Markdown viewer.
 
 # Accelerate
 
-[Accelerate](https://hf.co/docs/accelerate/index) is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks ([Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) and [DeepSpeed](https://www.deepspeed.ai/)) for it into a single interface. [`Trainer`] is powered by Accelerate under the hood, enabling loading big models and distributed training.
+[Accelerate](https://hf.co/docs/accelerate/index) provides a unified interface for distributed training backends like [FSDP](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) or [DeepSpeed](https://www.deepspeed.ai/). It detects your environment (number of GPUs, distributed backend, mixed precision, etc.) and automatically configures training, whether you're on 1 GPU with DDP or 8 GPUs with FSDP.
 
-This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with [`Trainer`], and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the [documentation](https://hf.co/docs/accelerate/index).
+Accelerate wraps the model in the appropriate distributed wrapper, moves it to the correct device, and creates a compatible optimizer. During training, Accelerate uses its own [`~accelerate.Accelerator.backward`] method to handle gradient scaling for mixed precision. [`Trainer`] calls the appropriate Accelerate APIs and delegates all distributed mechanics to Accelerate.
 
-```bash
-pip install accelerate
-```
-
-Start by running [accelerate config](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-config) in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.
+Configure Accelerate for [`Trainer`] with either an Accelerate config file or [`TrainingArguments`].
 
-```bash
-accelerate config
-```
+## Accelerate config file
 
-Depending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.
+Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config) command and answer questions about your hardware and training setup. This creates a `default_config.yaml` file in your cache. The example below is for FSDP.
 
 ```yaml
 compute_environment: LOCAL_MACHINE
-debug: false
 distributed_type: FSDP
-downcast_bf16: 'no'
 fsdp_config:
+  fsdp_version: 2
+  fsdp_reshard_after_forward: true
+  fsdp_cpu_offload: false
   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_backward_prefetch_policy: BACKWARD_PRE
-  fsdp_forward_prefetch: false
   fsdp_cpu_ram_efficient_loading: true
-  fsdp_offload_params: false
-  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_activation_checkpointing: false
   fsdp_state_dict_type: SHARDED_STATE_DICT
-  fsdp_sync_module_states: true
-  fsdp_transformer_layer_cls_to_wrap: BertLayer
-  fsdp_use_orig_params: true
-machine_rank: 0
-main_training_function: main
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
 mixed_precision: bf16
 num_machines: 1
-num_processes: 2
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
+num_processes: 4
 ```
 
-## Trainer
+Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script, and Accelerate reads the config file to set up training. The [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) and [deepspeed](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.deepspeed) args are unnecessary because the Accelerate config file covers the same settings.
+
+```cli
+accelerate launch train.py
+```
 
-Pass the path to the saved configuration file to [`TrainingArguments`], and from there, pass your [`TrainingArguments`] to [`Trainer`].
+The [accelerator_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config) accepts settings that don't have dedicated top-level arguments. For example, set `non_blocking=True` together with [`~TrainingArguments.dataloader_pin_memory`] to overlap data transfer with compute for higher GPU throughput.
 
 ```py
-from transformers import TrainingArguments, Trainer
-
-training_args = TrainingArguments(
-    output_dir="your-model",
-    learning_rate=2e-5,
-    per_device_train_batch_size=16,
-    per_device_eval_batch_size=16,
-    num_train_epochs=2,
-    fsdp_config="path/to/fsdp_config",
-    fsdp="full_shard",
-    weight_decay=0.01,
-    eval_strategy="epoch",
-    save_strategy="epoch",
-    load_best_model_at_end=True,
-    push_to_hub=True,
+from transformers import TrainingArguments
+
+TrainingArguments(
+    ...,
+    dataloader_pin_memory=True,
+    accelerator_config={
+        "non_blocking": True,
+    },
 )
+```
 
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=dataset["train"],
-    eval_dataset=dataset["test"],
-    processing_class=tokenizer,
-    data_collator=data_collator,
-    compute_metrics=compute_metrics,
-)
+## TrainingArguments
 
-trainer.train()
-```
+Pass a backend-specific config to [`TrainingArguments`]. The [`~Trainer.create_accelerator_and_postprocess`] method reads the settings and configures training.
 
-## Native PyTorch
+<hfoptions id="backend">
+<hfoption id="FSDP">
 
-Accelerate can also be added to any PyTorch training loop to enable distributed training. The [`~accelerate.Accelerator`] is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don't need to explicitly place your model on a device because [`~accelerate.Accelerator`] knows which device to move your model to.
+Pass a JSON config file or dict to [`~TrainingArguments.fsdp_config`]. See [FSDP](./fsdp) for a full guide and config reference.
 
 ```py
-from accelerate import Accelerator
+from transformers import TrainingArguments
 
-accelerator = Accelerator()
-device = accelerator.device
+TrainingArguments(
+    ...,
+    fsdp=True,
+    fsdp_config="path/to/fsdp.json",
+)
 ```
 
-All PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the [`~accelerate.Accelerator.prepare`] method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use [`~accelerate.optimizer.AcceleratedOptimizer`] and [`~accelerate.scheduler.AcceleratedScheduler`], and creates a new shardable dataloader.
+</hfoption>
+<hfoption id="DeepSpeed">
+
+Pass a JSON config file or dict to [`~TrainingArguments.deepspeed`]. See [DeepSpeed](./deepspeed) for a full guide and config reference.
 
 ```py
-train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
-    train_dataloader, eval_dataloader, model, optimizer
+from transformers import TrainingArguments
+
+TrainingArguments(
+    ...,
+    deepspeed="path/to/ds_config.json",
 )
 ```
 
-Replace `loss.backward` in your training loop with Accelerates [`~accelerate.Accelerator.backward`] method to scale the gradients and determine the appropriate `backward` method to use depending on your framework (for example, DeepSpeed or Megatron).
-
-```py
-for epoch in range(num_epochs):
-    for batch in train_dataloader:
-        outputs = model(**batch)
-        loss = outputs.loss
-        accelerator.backward(loss)
-        optimizer.step()
-        lr_scheduler.step()
-        optimizer.zero_grad()
-        progress_bar.update(1)
-```
+</hfoption>
+<hfoption id="DDP">
 
-Combine everything into a function and make it callable as a script.
+DDP is configured directly through [`TrainingArguments`] fields. See [DDP](./ddp) for details.
 
 ```py
-from accelerate import Accelerator
-
-def main():
-  accelerator = Accelerator()
-
-  model, optimizer, training_dataloader, scheduler = accelerator.prepare(
-      model, optimizer, training_dataloader, scheduler
-  )
-
-  for batch in training_dataloader:
-      optimizer.zero_grad()
-      inputs, targets = batch
-      outputs = model(inputs)
-      loss = loss_function(outputs, targets)
-      accelerator.backward(loss)
-      optimizer.step()
-      scheduler.step()
-
-if __name__ == "__main__":
-    main()
+from transformers import TrainingArguments
+
+TrainingArguments(
+    ...,
+    ddp_backend="nccl",
+    ddp_find_unused_parameters=False,
+    ddp_bucket_cap_mb=25,
+    ddp_timeout=1800,
+)
 ```
 
-From the command line, call [accelerate launch](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-launch) to run your training script. Any additional arguments or parameters can be passed here as well.
+</hfoption>
+</hfoptions>
 
-To launch your training script on two GPUs, add the `--num_processes` argument.
-
-```bash
-accelerate launch --num_processes=2 your_script.py
-```
+## Next steps
 
-Refer to the [Launching Accelerate scripts](https://hf.co/docs/accelerate/main/en/basic_tutorials/launch) for more details.
+- See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
+- See [FSDP](./fsdp) for sharding parameters, gradients, and optimizer states across GPUs.
+- See [DeepSpeed](./deepspeed) for ZeRO optimization and offloading.
diff --git a/docs/source/en/accelerator_selection.md b/docs/source/en/accelerator_selection.md
@@ -16,108 +16,70 @@ rendered properly in your Markdown viewer.
 
 # Accelerator selection
 
-During distributed training, you can specify the number and order of accelerators (CUDA, XPU, MPS, HPU, etc.) to use. This can be useful when you have accelerators with different computing power and you want to use the faster accelerator first. Or you could only use a subset of the available accelerators. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed).
-
-This guide will show you how to select the number of accelerators to use and the order to use them in.
-
-## Number of accelerators
-
-For example, if there are 4 accelerators and you only want to use the first 2, run the command below.
-
-<hfoptions id="select-accelerator">
-<hfoption id="torchrun">
-
-Use the `--nproc_per_node` to select how many accelerators to use.
-
-```bash
-torchrun --nproc_per_node=2  trainer-program.py ...
-```
-
-</hfoption>
-<hfoption id="Accelerate">
-
-Use `--num_processes` to select how many accelerators to use.
-
-```bash
-accelerate launch --num_processes 2 trainer-program.py ...
-```
-
-</hfoption>
-<hfoption id="DeepSpeed">
-
-Use `--num_gpus` to select how many GPUs to use.
-
-```bash
-deepspeed --num_gpus 2 trainer-program.py ...
-```
-
-</hfoption>
-</hfoptions>
+You can control which accelerators (CUDA, XPU, MPS, HPU, etc.) PyTorch sees and in what order during distributed training. Prioritize faster devices or limit training to a subset of available hardware. It works with both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html), and doesn't require Accelerate or the [DeepSpeed integration](./main_classes/deepspeed).
 
 ## Order of accelerators
 
-To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.
+Use the hardware-specific environment variable to select accelerators and set their order. Set it on the command line per run, or add it to `~/.bashrc` or another startup config file.
+
+> [!WARNING]
+> Avoid exporting environment variables because if you forget how an environment variable was set up, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.
 
-For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
+For example, to select accelerators 0 and 2 out of four:
 
 <hfoptions id="accelerator-type">
 <hfoption id="CUDA">
 
-```bash
+```cli
 CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
 ```
 
-Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively.  
-To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
+PyTorch sees only GPUs 0 and 2, which are mapped to `cuda:0` and `cuda:1`. To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
 
-```bash
+```cli
 CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
 ```
 
 To run without any GPUs:
 
-```bash
+```cli
 CUDA_VISIBLE_DEVICES= python trainer-program.py ...
 ```
 
-You can also control the order of CUDA devices using `CUDA_DEVICE_ORDER`:
+Control the order of CUDA devices with `CUDA_DEVICE_ORDER`.
 
 - Order by PCIe bus ID (matches `nvidia-smi`):
 
-    ```bash
+    ```cli
     export CUDA_DEVICE_ORDER=PCI_BUS_ID
     ```
 
 - Order by compute capability (fastest first):
 
-    ```bash
+    ```cli
     export CUDA_DEVICE_ORDER=FASTEST_FIRST
     ```
 
 </hfoption>
 <hfoption id="Intel XPU">
 
-```bash
+```cli
 ZE_AFFINITY_MASK=0,2 torchrun trainer-program.py ...
 ```
 
-Only XPUs 0 and 2 are "visible" to PyTorch and are mapped to `xpu:0` and `xpu:1` respectively.  
-To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
+PyTorch sees only XPUs 0 and 2, which are mapped to `xpu:0` and `xpu:1`. To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
 
-```bash
+```cli
 ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
 ```
 
-You can also control the order of Intel XPUs with:
+Control the order of Intel XPUs with:
 
-```bash
+```cli
 export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
 ```
 
-For more information about device enumeration and sorting on Intel XPU, please refer to the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
+For more on device enumeration and sorting on Intel XPU, see the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
 
 </hfoption>
 </hfoptions>
-
-> [!WARNING]
-> Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.