From fbe9a444ccebb2b3c4428e6d3be9a3fd04a2887c Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Tue, 3 Mar 2026 10:37:07 -0800
Subject: [PATCH 1/7] accelerate, ddp, fsdp

---
 docs/source/en/_toctree.yml             |   4 +-
 docs/source/en/accelerate.md            | 163 ++++++++------------
 docs/source/en/accelerator_selection.md |  60 ++------
 docs/source/en/ddp.md                   |  82 ++++++++++
 docs/source/en/fsdp.md                  | 190 ++++++++++++------------
 5 files changed, 249 insertions(+), 250 deletions(-)
 create mode 100644 docs/source/en/ddp.md

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 340a37ff37d2..6ccc32cf7cf2 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -219,8 +219,10 @@
       title: Accelerator selection
     - local: accelerate
       title: Accelerate
+    - local: ddp
+      title: DDP
     - local: fsdp
-      title: FullyShardedDataParallel
+      title: FSDP
     - local: deepspeed
       title: DeepSpeed ZeRO
     - local: deepspeed_alst
diff --git a/docs/source/en/accelerate.md b/docs/source/en/accelerate.md
index a18436889e03..7bf28fba5f24 100644
--- a/docs/source/en/accelerate.md
+++ b/docs/source/en/accelerate.md
@@ -16,150 +16,111 @@ rendered properly in your Markdown viewer.
 
 # Accelerate
 
-[Accelerate](https://hf.co/docs/accelerate/index) is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks ([Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) and [DeepSpeed](https://www.deepspeed.ai/)) for it into a single interface. [`Trainer`] is powered by Accelerate under the hood, enabling loading big models and distributed training.
+[Accelerate](https://hf.co/docs/accelerate/index) provides a unified interface for distributed training backends like [FSDP](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) or [DeepSpeed](https://www.deepspeed.ai/). It detects your environment (number of GPUs, distributed backend, mixed precision, etc.) and automatically configures training, whether you're on 1 GPU with DDP or 8 GPUs with FSDP.
 
-This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with [`Trainer`], and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the [documentation](https://hf.co/docs/accelerate/index).
+Accelerate wraps the model in the appropriate distributed wrapper, moves it to the correct device, and creates a compatible optimizer. During training, Accelerate uses its own [`~accelerate.Accelerator.backward`] method to handle gradient scaling for mixed precision. [`Trainer`] calls the appropriate Accelerate APIs and delegates all distributed mechanics to Accelerate.
 
-```bash
-pip install accelerate
-```
-
-Start by running [accelerate config](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-config) in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.
+Configure Accelerate for [`Trainer`] with either an Accelerate config file or [`TrainingArguments`].
 
-```bash
-accelerate config
-```
+## Accelerate config file
 
-Depending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.
+Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config) command and answer questions about your hardware and training setup. This creates a `default_config.yaml` file in your cache. The example below is for FSDP.
 
 ```yaml
 compute_environment: LOCAL_MACHINE
-debug: false
 distributed_type: FSDP
-downcast_bf16: 'no'
 fsdp_config:
   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
   fsdp_backward_prefetch_policy: BACKWARD_PRE
-  fsdp_forward_prefetch: false
   fsdp_cpu_ram_efficient_loading: true
   fsdp_offload_params: false
   fsdp_sharding_strategy: FULL_SHARD
   fsdp_state_dict_type: SHARDED_STATE_DICT
   fsdp_sync_module_states: true
-  fsdp_transformer_layer_cls_to_wrap: BertLayer
+  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
   fsdp_use_orig_params: true
-machine_rank: 0
-main_training_function: main
 mixed_precision: bf16
 num_machines: 1
-num_processes: 2
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
+num_processes: 4
 ```
 
-## Trainer
+Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script, and Accelerate reads the config file to set up training. The [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) and [deepspeed](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.deepspeed) args are unnecessary because the Accelerate config file covers the same settings.
+
+```cli
+accelerate launch train.py
+```
 
-Pass the path to the saved configuration file to [`TrainingArguments`], and from there, pass your [`TrainingArguments`] to [`Trainer`].
+## TrainingArguments
+
+Pass a backend-specific config to [`TrainingArguments`]. The [`~Trainer.create_accelerator_and_postprocess`] method reads the settings and configures training.
+
+<hfoptions id="backend">
+<hfoption id="FSDP">
+
+Pass a JSON config file or dict to [`fsdp_config`]. See [FSDP](./fsdp) for a full guide and config reference.
 
 ```py
-from transformers import TrainingArguments, Trainer
-
-training_args = TrainingArguments(
-    output_dir="your-model",
-    learning_rate=2e-5,
-    per_device_train_batch_size=16,
-    per_device_eval_batch_size=16,
-    num_train_epochs=2,
-    fsdp_config="path/to/fsdp_config",
-    fsdp="full_shard",
-    weight_decay=0.01,
-    eval_strategy="epoch",
-    save_strategy="epoch",
-    load_best_model_at_end=True,
-    push_to_hub=True,
-)
+from transformers import TrainingArguments
 
-trainer = Trainer(
-    model=model,
-    args=training_args,
-    train_dataset=dataset["train"],
-    eval_dataset=dataset["test"],
-    processing_class=tokenizer,
-    data_collator=data_collator,
-    compute_metrics=compute_metrics,
+TrainingArguments(
+    ...,
+    fsdp="full_shard auto_wrap",
+    fsdp_config="path/to/fsdp.json",
 )
-
-trainer.train()
 ```
 
-## Native PyTorch
+</hfoption>
+<hfoption id="DeepSpeed">
 
-Accelerate can also be added to any PyTorch training loop to enable distributed training. The [`~accelerate.Accelerator`] is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don't need to explicitly place your model on a device because [`~accelerate.Accelerator`] knows which device to move your model to.
+Pass a JSON config file or dict to [`deepspeed`]. See [DeepSpeed](./deepspeed) for a full guide and config reference.
 
 ```py
-from accelerate import Accelerator
+from transformers import TrainingArguments
 
-accelerator = Accelerator()
-device = accelerator.device
+TrainingArguments(
+    ...,
+    deepspeed="path/to/ds_config.json",
+)
 ```
 
-All PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the [`~accelerate.Accelerator.prepare`] method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use [`~accelerate.optimizer.AcceleratedOptimizer`] and [`~accelerate.scheduler.AcceleratedScheduler`], and creates a new shardable dataloader.
+</hfoption>
+<hfoption id="DDP">
+
+DDP is configured directly through [`TrainingArguments`] fields. See [DDP](./ddp) for details.
 
 ```py
-train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
-    train_dataloader, eval_dataloader, model, optimizer
+from transformers import TrainingArguments
+
+TrainingArguments(
+    ...,
+    ddp_backend="nccl",
+    ddp_find_unused_parameters=False,
+    ddp_bucket_cap_mb=25,
+    ddp_timeout=1800,
 )
 ```
 
-Replace `loss.backward` in your training loop with Accelerates [`~accelerate.Accelerator.backward`] method to scale the gradients and determine the appropriate `backward` method to use depending on your framework (for example, DeepSpeed or Megatron).
+</hfoption>
+</hfoptions>
 
-```py
-for epoch in range(num_epochs):
-    for batch in train_dataloader:
-        outputs = model(**batch)
-        loss = outputs.loss
-        accelerator.backward(loss)
-        optimizer.step()
-        lr_scheduler.step()
-        optimizer.zero_grad()
-        progress_bar.update(1)
-```
+## Accelerate training settings
 
-Combine everything into a function and make it callable as a script.
+The [accelerator_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config) accepts settings that don't have dedicated top-level arguments. For example, set `non_blocking=True` together with [`~TrainingArguments.dataloader_pin_memory`] to overlap data transfer with compute for higher GPU throughput.
 
 ```py
-from accelerate import Accelerator
-  
-def main():
-  accelerator = Accelerator()
-
-  model, optimizer, training_dataloader, scheduler = accelerator.prepare(
-      model, optimizer, training_dataloader, scheduler
-  )
-
-  for batch in training_dataloader:
-      optimizer.zero_grad()
-      inputs, targets = batch
-      outputs = model(inputs)
-      loss = loss_function(outputs, targets)
-      accelerator.backward(loss)
-      optimizer.step()
-      scheduler.step()
-
-if __name__ == "__main__":
-    main()
+from transformers import TrainingArguments
+
+TrainingArguments(
+    ...,
+    dataloader_pin_memory=True,
+    accelerator_config={
+        "non_blocking": True,
+    },
+)
 ```
 
-From the command line, call [accelerate launch](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-launch) to run your training script. Any additional arguments or parameters can be passed here as well.
-
-To launch your training script on two GPUs, add the `--num_processes` argument.
-
-```bash
-accelerate launch --num_processes=2 your_script.py
-```
+## Next steps
 
-Refer to the [Launching Accelerate scripts](https://hf.co/docs/accelerate/main/en/basic_tutorials/launch) for more details.
+- See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
+- See [FSDP](./fsdp) for sharding parameters, gradients, and optimizer states across GPUs.
+- See [DeepSpeed](./deepspeed) for ZeRO optimization and offloading.
diff --git a/docs/source/en/accelerator_selection.md b/docs/source/en/accelerator_selection.md
index 8b2e315706ad..ea2d55cef855 100644
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@@ -16,49 +16,16 @@ rendered properly in your Markdown viewer.
 
 # Accelerator selection
 
-During distributed training, you can specify the number and order of accelerators (CUDA, XPU, MPS, HPU, etc.) to use. This can be useful when you have accelerators with different computing power and you want to use the faster accelerator first. Or you could only use a subset of the available accelerators. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed).
-
-This guide will show you how to select the number of accelerators to use and the order to use them in.
-
-## Number of accelerators
-
-For example, if there are 4 accelerators and you only want to use the first 2, run the command below.
-
-<hfoptions id="select-accelerator">
-<hfoption id="torchrun">
-
-Use the `--nproc_per_node` to select how many accelerators to use.
-
-```bash
-torchrun --nproc_per_node=2  trainer-program.py ...
-```
-
-</hfoption>
-<hfoption id="Accelerate">
-
-Use `--num_processes` to select how many accelerators to use.
-
-```bash
-accelerate launch --num_processes 2 trainer-program.py ...
-```
-
-</hfoption>
-<hfoption id="DeepSpeed">
-
-Use `--num_gpus` to select how many GPUs to use.
-
-```bash
-deepspeed --num_gpus 2 trainer-program.py ...
-```
-
-</hfoption>
-</hfoptions>
+You can control which accelerators (CUDA, XPU, MPS, HPU, etc.) PyTorch sees and in what order during distributed training. Prioritize faster devices or limit training to a subset of available hardware. It works with both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html), and doesn't require Accelerate or the [DeepSpeed integration](./main_classes/deepspeed).
 
 ## Order of accelerators
 
-To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.
+Use the hardware-specific environment variable to select accelerators and set their order. Set it on the command line per run, or add it to `~/.bashrc` or another startup config file.
+
+> [!WARNING]
+> Avoid exporting environment variables instead of setting them on the command line. If you forget a previously exported value, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.
 
-For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
+For example, to select accelerators 0 and 2 out of four:
 
 <hfoptions id="accelerator-type">
 <hfoption id="CUDA">
@@ -67,8 +34,7 @@ For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run a
 CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
 ```
 
-Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively.  
-To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
+PyTorch sees only GPUs 0 and 2, which are mapped to `cuda:0` and `cuda:1`. To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
 
 ```bash
 CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
@@ -80,7 +46,7 @@ To run without any GPUs:
 CUDA_VISIBLE_DEVICES= python trainer-program.py ...
 ```
 
-You can also control the order of CUDA devices using `CUDA_DEVICE_ORDER`:
+Control the order of CUDA devices with `CUDA_DEVICE_ORDER`.
 
 - Order by PCIe bus ID (matches `nvidia-smi`):
 
@@ -101,23 +67,19 @@ You can also control the order of CUDA devices using `CUDA_DEVICE_ORDER`:
 ZE_AFFINITY_MASK=0,2 torchrun trainer-program.py ...
 ```
 
-Only XPUs 0 and 2 are "visible" to PyTorch and are mapped to `xpu:0` and `xpu:1` respectively.  
-To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
+PyTorch sees only XPUs 0 and 2, which are mapped to `xpu:0` and `xpu:1`. To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
 
 ```bash
 ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
 ```
 
-You can also control the order of Intel XPUs with:
+Control the order of Intel XPUs with:
 
 ```bash
 export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
 ```
 
-For more information about device enumeration and sorting on Intel XPU, please refer to the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
+For more on device enumeration and sorting on Intel XPU, see the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
 
 </hfoption>
 </hfoptions>
-
-> [!WARNING]
-> Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
diff --git a/docs/source/en/ddp.md b/docs/source/en/ddp.md
new file mode 100644
index 000000000000..ce4832904ea1
--- /dev/null
+++ b/docs/source/en/ddp.md
@@ -0,0 +1,82 @@
+<!--Copyright 2026 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# DDP
+
+[DistributedDataParallel (DDP)](https://docs.pytorch.org/tutorials/beginner/ddp_series_theory.html) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs. The all-reduce runs on the final micro-batch. [`Trainer`] skips the all-reduce on intermediate gradient accumulation steps, keeping all GPUs in sync after every update. Use DDP when your model fits on a single GPU.
+
+```text
+                         ┌─────────────────┐
+                         │  training data  │
+                         └────────┬────────┘
+               ┌──────────────────┼──────────────────┐
+               │ shard 0          │ shard 1          │ shard 2
+               ▼                  ▼                  ▼
+        ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
+        │   model     │    │   model     │    │   model     │
+        │  (copy 0)   │    │  (copy 1)   │    │  (copy 2)   │
+        │   GPU 0     │    │   GPU 1     │    │   GPU 2     │
+        └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
+               │ grads            │ grads            │ grads
+               └──────────────────┼──────────────────┘
+                               all-reduce
+                          (average gradients)
+               ┌──────────────────┼──────────────────┐
+               ▼                  ▼                  ▼
+        ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
+        │  optimizer  │    │  optimizer  │    │  optimizer  │
+        │    step     │    │    step     │    │    step     │
+        └─────────────┘    └─────────────┘    └─────────────┘
+          (identical)        (identical)        (identical)
+```
+
+DDP activates automatically when you launch with a multi-process launcher like [Accelerate](./accelerate).
+
+```cli
+# 4 GPUs on one machine
+accelerate launch --num_processes 4 train.py
+```
+
+## Configure DDP
+
+Pass these [`TrainingArguments`] to control DDP behavior.
+
+- [`~TrainingArguments.gradient_accumulation_steps`] determines when to perform the all-reduce. For example, with `gradient_accumulation_steps=4`, the all-reduce runs every 4 backward passes. This is a general [`TrainingArguments`] setting that interacts with DDP.
+- [`ddp_find_unused_parameters`] searches the full graph at the *start* of the backward pass for parameters that won't receive a gradient and marks them as ready so they don't block the all-reduce. Don't use with [`~TrainingArguments.gradient_checkpointing`] because gradient checkpointing discards intermediate activations and recomputes them on the fly.
+- [`ddp_bucket_cap_mb`] is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
+- [`ddp_broadcast_buffers`] synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don't use with [`~TrainingArguments.gradient_checkpointing`].
+- [`ddp_backend`] sets the communication backend. Use `"nccl"` for NVIDIA GPUs (default and fastest), `"gloo"` for CPU training or debugging, and `"xccl"`, `"hccl"`, or `"cncl"` for other hardware.
+- [`~TrainingArguments.ddp_timeout`] sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+    ...,
+    gradient_accumulation_steps=4,
+    ddp_backend="nccl",
+    ddp_find_unused_parameters=False,
+    ddp_bucket_cap_mb=25,
+    ddp_broadcast_buffers=True,
+    ddp_timeout=1800,
+)
+```
+
+## Next steps
+
+- See [FSDP](./fsdp) for training models too large to fit on a single GPU.
+- See [DeepSpeed](./deepspeed) for ZeRO optimization and offloading.
+- Read the [Data Parallelism](https://nanotron-ultrascale-playbook.static.hf.space/index.html#data_parallelism) chapter from The Ultra-Scale Playbook for more information about how DDP works.
diff --git a/docs/source/en/fsdp.md b/docs/source/en/fsdp.md
index 944c5a18e109..07aa6d49b28f 100644
--- a/docs/source/en/fsdp.md
+++ b/docs/source/en/fsdp.md
@@ -14,132 +14,124 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# FullyShardedDataParallel
-
-[Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a [parallelism](./perf_train_gpu_many) method that combines the advantages of data and model parallelism for distributed training.
-
-Unlike [DistributedDataParallel (DDP)](./perf_train_gpu_many#distributeddataparallel), FSDP saves more memory because it doesn't replicate a model on each GPU. It shards the models parameters, gradients and optimizer states across GPUs. Each model shard processes a portion of the data and the results are synchronized to speed up training.
-
-This guide covers how to set up training a model with FSDP and [Accelerate](https://hf.co/docs/accelerate/index), a library for managing distributed training.
-
-```bash
-pip install accelerate
+# FSDP
+
+[Fully Sharded Data Parallel (FSDP)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) shards the model, gradients, and optimizer states across GPUs. Before computation, each GPU gathers a complete set of parameters from all shards, then frees them afterward. Sharding lets you train models larger than a single GPU's memory, at the cost of more communication than [DDP](./ddp). Use FSDP when your model or optimizer states don't fit on a single GPU.
+
+```text
+                      ┌─────────────────┐
+                      │  training data  │
+                      └────────┬────────┘
+            ┌──────────────────┼──────────────────┐
+            │ shard 0          │ shard 1          │ shard 2
+            ▼                  ▼                  ▼
+     ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
+     │  param      │    │  param      │    │  param      │
+     │  shard 0    │    │  shard 1    │    │  shard 2    │
+     │  GPU 0      │    │  GPU 1      │    │  GPU 2      │
+     └──────┬──────┘    └──────┬──────┘    └──────┬──────┘
+            │                  │                  │
+            └──────── all-gather (params) ────────┘
+                               │
+                    full params on each GPU
+                               │
+            ┌──────────────────┼──────────────────┐
+            ▼                  ▼                  ▼
+         forward             forward             forward
+            │                  │                  │
+            └───── reduce-scatter (grads) ────────┘
+                               │
+            ┌──────────────────┼──────────────────┐
+            ▼                  ▼                  ▼
+     grad shard 0       grad shard 1       grad shard 2
+     optim shard 0      optim shard 1      optim shard 2
+        step               step               step
 ```
 
-## Configuration options
+## Sharding strategies
 
-Always start by running the [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config) command to help Accelerate set up the correct distributed training environment.
+Pass one of the sharding strategies below to [fsdp](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp).
 
-```bash
-accelerate config
-```
+| strategy | description |
+|---|---|
+| `full_shard` | shard parameters, gradients, and optimizer states |
+| `shard_grad_op` | shard gradients and optimizer states |
+| `no_shard` | DDP |
+| `hybrid_shard` | full shard within a node, replicate across nodes |
+| `hybrid_shard_zero2` | shard gradients and optimizer states within a node, replicate across nodes |
+| `offload` | CPU offload (combine with `full_shard` or `shard_grad_op`) |
 
-The section below discusses some of the more important FSDP configuration options. Learn more about other available options in the [fsdp_config](https://hf.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) parameter.
+Always combine a sharding strategy with `auto_wrap` to enable the auto-wrapping policy like `fsdp="full_shard auto_wrap"`. Without `auto_wrap`, the entire model is one FSDP unit and you lose the memory benefit of sharding.
 
-### Sharding strategy
+## Configure FSDP
 
-FSDP offers several sharding strategies to distribute a model. Refer to the table below to help you choose the best strategy for your setup. Specify a strategy with the `fsdp_sharding_strategy` parameter in the configuration file.
+These fields control how FSDP wraps and loads the model.
 
-| sharding strategy | description | parameter value |
-|---|---|---|
-| `FULL_SHARD` | shards model parameters, gradients, and optimizer states | `1` |
-| `SHARD_GRAD_OP` | shards gradients and optimizer states | `2` |
-| `NO_SHARD` | don't shard the model | `3` |
-| `HYBRID_SHARD` | shards model parameters, gradients, and optimizer states within each GPU | `4` |
-| `HYBRID_SHARD_ZERO2` | shards gradients and optimizer states within each GPU | `5` |
+- `transformer_layer_cls_to_wrap` defines the transformer layer to wrap into an FSDP unit. Each unit manages its own gather and scatter ops. Only the current unit's parameters are gathered during the forward pass. The previous units' parameters are released to save memory.
 
-### CPU offload
+  Wrapping only the top-level model yields no GPU memory savings. Wrapping every individual `Linear` layer makes inter-unit communication very expensive. Leave this field empty and FSDP reads the value from the model definition.
 
-Offload model parameters and gradients when they aren't being used to the CPU to save additional GPU memory. This is useful for scenarios where a model is too large even with FSDP.
+- `backward_prefetch` determines when to start the all-gather for the next FSDP unit during the backward pass. The default `"backward_pre"` prefetches before the current unit's backward to overlap communication with compute.
 
-Specify `fsdp_offload_params: true` in the configuration file to enable offloading.
+- `forward_prefetch` prefetches the next FSDP unit during the forward pass, improving throughput at the cost of higher peak memory.
 
-### Wrapping policy
+- `limit_all_gathers` adds a CPU synchronization point to prevent too many simultaneous all-gathers, reducing peak memory at the cost of slightly lower throughput.
 
-FSDP is applied by wrapping each layer in the network. The wrapping is usually applied in a nested way where the full weights are discarded after each forward pass to save memory for the next layer.
+- `cpu_ram_efficient_loading` loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM. Use with `sync_module_states` to broadcast the parameters from rank 0 to other processes.
 
-There are several wrapping policies available, but the *auto wrapping* policy is the simplest and doesn't require any changes to your code. Specify `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` to wrap a Transformer layer and `fsdp_transformer_layer_cls_to_wrap` to determine which layer to wrap (for example, `BertLayer`).
+- `sync_module_states` broadcasts rank 0's parameters to all other ranks after wrapping. Required when `cpu_ram_efficient_loading` is enabled. Without it, non-rank-0 processes train on uninitialized weights.
 
-Size-based wrapping is also available. If a layer exceeds a certain number of parameters, it is wrapped. Specify `fsdp_wrap_policy: SIZED_BASED_WRAP` and `min_num_param` to set the minimum number of parameters for a layer to be wrapped.
+- `use_orig_params` preserves the original parameter structure, allowing non-uniform `requires_grad` within an FSDP unit. Required for parameter-efficient fine-tuning (PEFT/LoRA) where only adapter layers are trainable.
 
-### Checkpoints
+- `activation_checkpointing` recomputes activations during the backward pass instead of storing them. Use this instead of [gradient checkpointing](./grad_checkpointing) in [`TrainingArguments`]. Setting both raises an error.
 
-Intermediate checkpoints should be saved as a sharded state dict because saving the full state dict - even with CPU offloading - is time consuming and can cause `NCCL Timeout` errors due to indefinite hanging during broadcasting.
+Configure FSDP training with either an [Accelerate config file](./accelerate#accelerate-config-file) or an FSDP config file passed to [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config).
 
-Specify `fsdp_state_dict_type: SHARDED_STATE_DICT` in the configuration file to save the sharded state dict. Now you can resume training from the sharded state dict with [`~accelerate.Accelerator.load_state`].
+<hfoptions id="launch">
+<hfoption id="Accelerate config file">
 
-```py
-accelerator.load_state("directory/containing/checkpoints")
-```
+Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config) command and answer questions about your hardware and training setup. This creates a `default_config.yaml` file in your cache.
 
-Once training is complete though, you should save the full state dict because the sharded state dict is only compatible with FSDP.
-
-```py
-if trainer.is_fsdp_enabled:
-  trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
+Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script. The [`fsdp_config`] is unnecessary because the Accelerate config file covers the same settings.
 
-trainer.save_model(script_args.output_dir)
+```cli
+accelerate launch train.py
 ```
 
-### TPU
-
-[PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html), a package for running PyTorch on XLA devices, enables FSDP on TPUs. Modify the configuration file to include the parameters below. Refer to the [xla_fsdp_settings](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128) parameter for additional XLA-specific parameters you can configure for FSDP.
-
-```yaml
-xla: True # must be set to True to enable PyTorch/XLA
-xla_fsdp_settings: # XLA specific FSDP parameters
-xla_fsdp_grad_ckpt: True # enable gradient checkpointing
-```
-
-## Training
-
-After running [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config), your configuration file should be ready. An example configuration file is shown below that fully shards the parameter, gradient and optimizer states on two GPUs. Your file may look different depending on how you set up your configuration.
-
-```yaml
-compute_environment: LOCAL_MACHINE
-debug: false
-distributed_type: FSDP
-downcast_bf16: 'no'
-fsdp_config:
-  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_backward_prefetch_policy: BACKWARD_PRE
-  fsdp_cpu_ram_efficient_loading: true
-  fsdp_forward_prefetch: false
-  fsdp_offload_params: true
-  fsdp_sharding_strategy: 1
-  fsdp_state_dict_type: SHARDED_STATE_DICT
-  fsdp_sync_module_states: true
-  fsdp_transformer_layer_cls_to_wrap: BertLayer
-  fsdp_use_orig_params: true
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 2
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
+</hfoption>
+<hfoption id="FSDP config file">
+
+Pass an FSDP config file to [`fsdp_config`]. All fields are optional except for the sharding strategy in `fsdp`.
+
+```json
+{
+  "version": 1,
+  "transformer_layer_cls_to_wrap": ["LlamaDecoderLayer"],
+  "backward_prefetch": "backward_pre",
+  "forward_prefetch": false,
+  "limit_all_gathers": true,
+  "use_orig_params": true,
+  "sync_module_states": true,
+  "cpu_ram_efficient_loading": true,
+  "activation_checkpointing": true
+}
 ```
 
-Run the [accelerate launch](https://hf.co/docs/accelerate/package_reference/cli#accelerate-launch) command to launch a training script with the FSDP configurations you chose in the configuration file.
-
-```bash
-accelerate launch my-training-script.py
-```
-
-It is also possible to directly specify some of the FSDP arguments in the command line.
+```py
+from transformers import TrainingArguments
 
-```bash
-accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/" my-training-script.py
+TrainingArguments(
+    ...,
+    fsdp="full_shard auto_wrap",
+    fsdp_config="path/to/fsdp.json",
+)
 ```
 
-## Resources
+</hfoption>
+</hfoptions>
 
-FSDP is a powerful tool for training large models with fewer GPUs compared to other parallelism strategies. Refer to the following resources below to learn even more about FSDP.
+## Next steps
 
-- Follow along with the more in-depth Accelerate guide for [FSDP](https://hf.co/docs/accelerate/usage_guides/fsdp).
-- Read the [Introducing PyTorch Fully Sharded Data Parallel (FSDP) API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) blog post.
-- Read the [Scaling PyTorch models on Cloud TPUs with FSDP](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) blog post.
+- See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
+- See [DeepSpeed](./deepspeed) for ZeRO optimization and NVMe offloading.
+- Read the [FSDP chapter](https://nanotron-ultrascale-playbook.static.hf.space/index.html#zero-3:_adding_parameter_partitioning_(fsdp)) from The Ultra-Scale Playbook for more information about how FSDP works.

From 582578d05d0a7b3abee5a459a549ec926318c496 Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Tue, 3 Mar 2026 14:35:53 -0800
Subject: [PATCH 2/7] update

---
 docs/source/en/accelerator_selection.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/accelerator_selection.md b/docs/source/en/accelerator_selection.md
index ea2d55cef855..44afc50032ca 100644
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@@ -23,7 +23,7 @@ You can control which accelerators (CUDA, XPU, MPS, HPU, etc.) PyTorch sees and
 Use the hardware-specific environment variable to select accelerators and set their order. Set it on the command line per run, or add it to `~/.bashrc` or another startup config file.
 
 > [!WARNING]
-> Avoid exporting environment variables instead of setting them on the command line. If you forget a previously exported value, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.
+> Avoid exporting environment variables because if you forget a previously exported value, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.
 
 For example, to select accelerators 0 and 2 out of four:
 

From cddecaa4f00f73c28df6a7de53205d7463e78b8d Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Thu, 23 Apr 2026 09:44:42 -0700
Subject: [PATCH 3/7] feedback

---
 docs/source/en/accelerate.md | 1 +
 docs/source/en/fsdp.md       | 7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/docs/source/en/accelerate.md b/docs/source/en/accelerate.md
index 7bf28fba5f24..59e57fe2448a 100644
--- a/docs/source/en/accelerate.md
+++ b/docs/source/en/accelerate.md
@@ -64,6 +64,7 @@ from transformers import TrainingArguments
 
 TrainingArguments(
     ...,
+    fsdp=True,
     fsdp="full_shard auto_wrap",
     fsdp_config="path/to/fsdp.json",
 )
diff --git a/docs/source/en/fsdp.md b/docs/source/en/fsdp.md
index 07aa6d49b28f..6c6163fb53b1 100644
--- a/docs/source/en/fsdp.md
+++ b/docs/source/en/fsdp.md
@@ -101,11 +101,9 @@ accelerate launch train.py
 </hfoption>
 <hfoption id="FSDP config file">
 
-Pass an FSDP config file to [`fsdp_config`]. All fields are optional except for the sharding strategy in `fsdp`.
-
 ```json
 {
-  "version": 1,
+  "version": 2,
   "transformer_layer_cls_to_wrap": ["LlamaDecoderLayer"],
   "backward_prefetch": "backward_pre",
   "forward_prefetch": false,
@@ -117,11 +115,14 @@ Pass an FSDP config file to [`fsdp_config`]. All fields are optional except for
 }
 ```
 
+Set `fsdp=True`, a sharding strategy, and pass the FSDP config file to [`fsdp_config`].
+
 ```py
 from transformers import TrainingArguments
 
 TrainingArguments(
     ...,
+    fsdp=True,
     fsdp="full_shard auto_wrap",
     fsdp_config="path/to/fsdp.json",
 )

From 6d4f030b4120cc15754fb988639d79495ad3ad81 Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Fri, 24 Apr 2026 13:28:17 -0700
Subject: [PATCH 4/7] feedback

---
 docs/source/en/_toctree.yml  |  2 +-
 docs/source/en/accelerate.md | 10 +++-----
 docs/source/en/fsdp.md       | 48 ++++++++++++++++--------------------
 3 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 6ccc32cf7cf2..8a391a30c071 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -245,7 +245,7 @@
     - local: perf_hardware
       title: Building a GPU workstation
     - local: model_memory_anatomy
-      title: Model training anatomy
+      title: GPU memory usage
     title: Hardware
   title: Training
 - isExpanded: false
diff --git a/docs/source/en/accelerate.md b/docs/source/en/accelerate.md
index 59e57fe2448a..2b6737302de9 100644
--- a/docs/source/en/accelerate.md
+++ b/docs/source/en/accelerate.md
@@ -30,15 +30,14 @@ Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_re
 compute_environment: LOCAL_MACHINE
 distributed_type: FSDP
 fsdp_config:
+  fsdp_version: 2
+  fsdp_reshard_after_forward: true
+  fsdp_cpu_offload: false
   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
-  fsdp_backward_prefetch_policy: BACKWARD_PRE
   fsdp_cpu_ram_efficient_loading: true
-  fsdp_offload_params: false
-  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_activation_checkpointing: false
   fsdp_state_dict_type: SHARDED_STATE_DICT
-  fsdp_sync_module_states: true
   fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
-  fsdp_use_orig_params: true
 mixed_precision: bf16
 num_machines: 1
 num_processes: 4
@@ -65,7 +64,6 @@ from transformers import TrainingArguments
 TrainingArguments(
     ...,
     fsdp=True,
-    fsdp="full_shard auto_wrap",
     fsdp_config="path/to/fsdp.json",
 )
 ```
diff --git a/docs/source/en/fsdp.md b/docs/source/en/fsdp.md
index 6c6163fb53b1..325da73705e7 100644
--- a/docs/source/en/fsdp.md
+++ b/docs/source/en/fsdp.md
@@ -14,9 +14,9 @@ rendered properly in your Markdown viewer.
 
 -->
 
-# FSDP
+# FSDP2
 
-[Fully Sharded Data Parallel (FSDP)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) shards the model, gradients, and optimizer states across GPUs. Before computation, each GPU gathers a complete set of parameters from all shards, then frees them afterward. Sharding lets you train models larger than a single GPU's memory, at the cost of more communication than [DDP](./ddp). Use FSDP when your model or optimizer states don't fit on a single GPU.
+[Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) shards the model, gradients, and optimizer states across GPUs. Before computation, each GPU gathers a complete set of parameters from all shards, then frees them afterward. Sharding lets you train models larger than a single GPU's memory, at the cost of more communication than [DDP](./ddp). Use FSDP when your model or optimizer states don't fit on a single GPU.
 
 ```text
                       ┌─────────────────┐
@@ -50,38 +50,34 @@ rendered properly in your Markdown viewer.
 
 ## Sharding strategies
 
-Pass one of the sharding strategies below to [fsdp](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp).
+FSDP2 controls sharding with [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config). Set `fsdp=True` to enable FSDP, and set `reshard_after_forward` in the FSDP config to choose the memory and throughput tradeoff.
 
-| strategy | description |
+| `reshard_after_forward` | behavior |
 |---|---|
-| `full_shard` | shard parameters, gradients, and optimizer states |
-| `shard_grad_op` | shard gradients and optimizer states |
-| `no_shard` | DDP |
-| `hybrid_shard` | full shard within a node, replicate across nodes |
-| `hybrid_shard_zero2` | shard gradients and optimizer states within a node, replicate across nodes |
-| `offload` | CPU offload (combine with `full_shard` or `shard_grad_op`) |
+| `true` | reshard parameters after the forward pass to save more memory |
+| `false` | keep parameters gathered between forward and backward to avoid the re-all-gather, at the cost of higher peak memory |
 
-Always combine a sharding strategy with `auto_wrap` to enable the auto-wrapping policy like `fsdp="full_shard auto_wrap"`. Without `auto_wrap`, the entire model is one FSDP unit and you lose the memory benefit of sharding.
+Use `auto_wrap_policy` to enable wrapping. Without wrapping, the entire model is one FSDP unit and you lose the memory benefit of sharding.
 
 ## Configure FSDP
 
-These fields control how FSDP wraps and loads the model.
+These fields control how FSDP2 wraps, shards, and loads the model.
 
-- `transformer_layer_cls_to_wrap` defines the transformer layer to wrap into an FSDP unit. Each unit manages its own gather and scatter ops. Only the current unit's parameters are gathered during the forward pass. The previous units' parameters are released to save memory.
+- `reshard_after_forward` determines whether to reshard parameters after the forward pass. The default `true` saves more memory. Set it to `false` to keep parameters gathered between the forward and backward passes and reduce communication at the cost of higher peak memory.
 
-  Wrapping only the top-level model yields no GPU memory savings. Wrapping every individual `Linear` layer makes inter-unit communication very expensive. Leave this field empty and FSDP reads the value from the model definition.
+- `cpu_offload` offloads parameters and gradients to CPU when they aren't in use to save GPU memory.
 
-- `backward_prefetch` determines when to start the all-gather for the next FSDP unit during the backward pass. The default `"backward_pre"` prefetches before the current unit's backward to overlap communication with compute.
+- `auto_wrap_policy` determines how modules are wrapped into FSDP units. Use `"TRANSFORMER_BASED_WRAP"` for transformer layers, `"SIZE_BASED_WRAP"` for modules above a parameter-count threshold, or `"NO_WRAP"` to disable auto wrapping.
 
-- `forward_prefetch` prefetches the next FSDP unit during the forward pass, improving throughput at the cost of higher peak memory.
+- `transformer_layer_cls_to_wrap` defines the transformer layer to wrap into an FSDP unit when `auto_wrap_policy` is `"TRANSFORMER_BASED_WRAP"`. Each unit manages its own gather and scatter ops. Only the current unit's parameters are gathered during the forward pass. The previous units' parameters are released to save memory.
 
-- `limit_all_gathers` adds a CPU synchronization point to prevent too many simultaneous all-gathers, reducing peak memory at the cost of slightly lower throughput.
+  Wrapping only the top-level model yields no GPU memory savings. Wrapping every individual `Linear` layer makes inter-unit communication very expensive. Leave this field empty and FSDP reads the value from the model definition.
 
-- `cpu_ram_efficient_loading` loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM. Use with `sync_module_states` to broadcast the parameters from rank 0 to other processes.
+- `min_num_params` sets the minimum number of parameters per module for size-based wrapping. It is only used when `auto_wrap_policy` is `"SIZE_BASED_WRAP"`.
 
-- `sync_module_states` broadcasts rank 0's parameters to all other ranks after wrapping. Required when `cpu_ram_efficient_loading` is enabled. Without it, non-rank-0 processes train on uninitialized weights.
+- `state_dict_type` controls the checkpoint format. Use `"FULL_STATE_DICT"` for a single Transformers-compatible checkpoint or `"SHARDED_STATE_DICT"` for one checkpoint file per rank.
 
-- `use_orig_params` preserves the original parameter structure, allowing non-uniform `requires_grad` within an FSDP unit. Required for parameter-efficient fine-tuning (PEFT/LoRA) where only adapter layers are trainable.
+- `cpu_ram_efficient_loading` loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM.
 
 - `activation_checkpointing` recomputes activations during the backward pass instead of storing them. Use this instead of [gradient checkpointing](./grad_checkpointing) in [`TrainingArguments`]. Setting both raises an error.
 
@@ -104,18 +100,17 @@ accelerate launch train.py
 ```json
 {
   "version": 2,
+  "reshard_after_forward": true,
+  "cpu_offload": false,
+  "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
   "transformer_layer_cls_to_wrap": ["LlamaDecoderLayer"],
-  "backward_prefetch": "backward_pre",
-  "forward_prefetch": false,
-  "limit_all_gathers": true,
-  "use_orig_params": true,
-  "sync_module_states": true,
+  "state_dict_type": "FULL_STATE_DICT",
   "cpu_ram_efficient_loading": true,
   "activation_checkpointing": true
 }
 ```
 
-Set `fsdp=True`, a sharding strategy, and pass the FSDP config file to [`fsdp_config`].
+Set `fsdp=True` and pass the FSDP config file to [`fsdp_config`].
 
 ```py
 from transformers import TrainingArguments
@@ -123,7 +118,6 @@ from transformers import TrainingArguments
 TrainingArguments(
     ...,
     fsdp=True,
-    fsdp="full_shard auto_wrap",
     fsdp_config="path/to/fsdp.json",
 )
 ```

From 7d48af85f2d5013a60e4c9f841eda9927123c0a0 Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Fri, 24 Apr 2026 13:28:57 -0700
Subject: [PATCH 5/7] fix

---
 docs/source/en/_toctree.yml | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 8a391a30c071..858da6b58d05 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -222,7 +222,7 @@
     - local: ddp
       title: DDP
     - local: fsdp
-      title: FSDP
+      title: FSDP2
     - local: deepspeed
       title: DeepSpeed ZeRO
     - local: deepspeed_alst

From 69088411179b0a00c86d515b0af46487a4fa521f Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Wed, 20 May 2026 15:21:51 +0900
Subject: [PATCH 6/7] polish

---
 docs/source/en/fsdp.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/en/fsdp.md b/docs/source/en/fsdp.md
index 325da73705e7..a679145fa77c 100644
--- a/docs/source/en/fsdp.md
+++ b/docs/source/en/fsdp.md
@@ -57,7 +57,7 @@ FSDP2 controls sharding with [fsdp_config](https://huggingface.co/docs/transform
 | `true` | reshard parameters after the forward pass to save more memory |
 | `false` | keep parameters gathered between forward and backward to avoid the re-all-gather, at the cost of higher peak memory |
 
-Use `auto_wrap_policy` to enable wrapping. Without wrapping, the entire model is one FSDP unit and you lose the memory benefit of sharding.
+`auto_wrap_policy` controls how modules are wrapped into FSDP units. It defaults to `"TRANSFORMER_BASED_WRAP"`, which wraps the model's transformer layers. Without wrapping (`"NO_WRAP"`), the entire model is one FSDP unit and you lose the memory benefit of sharding.
 
 ## Configure FSDP
 
@@ -67,7 +67,7 @@ These fields control how FSDP2 wraps, shards, and loads the model.
 
 - `cpu_offload` offloads parameters and gradients to CPU when they aren't in use to save GPU memory.
 
-- `auto_wrap_policy` determines how modules are wrapped into FSDP units. Use `"TRANSFORMER_BASED_WRAP"` for transformer layers, `"SIZE_BASED_WRAP"` for modules above a parameter-count threshold, or `"NO_WRAP"` to disable auto wrapping.
+- `auto_wrap_policy` determines how modules are wrapped into FSDP units. Defaults to `"TRANSFORMER_BASED_WRAP"`, which wraps transformer layers. Use `"SIZE_BASED_WRAP"` for modules above a parameter-count threshold, or `"NO_WRAP"` to disable auto wrapping.
 
 - `transformer_layer_cls_to_wrap` defines the transformer layer to wrap into an FSDP unit when `auto_wrap_policy` is `"TRANSFORMER_BASED_WRAP"`. Each unit manages its own gather and scatter ops. Only the current unit's parameters are gathered during the forward pass. The previous units' parameters are released to save memory.
 
@@ -75,7 +75,7 @@ These fields control how FSDP2 wraps, shards, and loads the model.
 
 - `min_num_params` sets the minimum number of parameters per module for size-based wrapping. It is only used when `auto_wrap_policy` is `"SIZE_BASED_WRAP"`.
 
-- `state_dict_type` controls the checkpoint format. Use `"FULL_STATE_DICT"` for a single Transformers-compatible checkpoint or `"SHARDED_STATE_DICT"` for one checkpoint file per rank.
+- `state_dict_type` controls the checkpoint format. Defaults to `"FULL_STATE_DICT"` for a single Transformers-compatible checkpoint. Use `"SHARDED_STATE_DICT"` for one checkpoint file per rank.
 
 - `cpu_ram_efficient_loading` loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM.
 

From 6f26fe4ab4bab382fca60387dc9610646bf681c0 Mon Sep 17 00:00:00 2001
From: stevhliu <steven.liu@huggingface.co>
Date: Tue, 23 Jun 2026 13:58:45 -0700
Subject: [PATCH 7/7] update

---
 docs/source/en/accelerate.md            | 34 ++++++++++++-------------
 docs/source/en/accelerator_selection.md | 18 ++++++-------
 docs/source/en/ddp.md                   | 12 ++++-----
 docs/source/en/fsdp.md                  | 17 +++++--------
 4 files changed, 38 insertions(+), 43 deletions(-)

diff --git a/docs/source/en/accelerate.md b/docs/source/en/accelerate.md
index 2b6737302de9..51e26308c823 100644
--- a/docs/source/en/accelerate.md
+++ b/docs/source/en/accelerate.md
@@ -49,6 +49,20 @@ Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_refere
 accelerate launch train.py
 ```
 
+The [accelerator_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config) accepts settings that don't have dedicated top-level arguments. For example, set `non_blocking=True` together with [`~TrainingArguments.dataloader_pin_memory`] to overlap data transfer with compute for higher GPU throughput.
+
+```py
+from transformers import TrainingArguments
+
+TrainingArguments(
+    ...,
+    dataloader_pin_memory=True,
+    accelerator_config={
+        "non_blocking": True,
+    },
+)
+```
+
 ## TrainingArguments
 
 Pass a backend-specific config to [`TrainingArguments`]. The [`~Trainer.create_accelerator_and_postprocess`] method reads the settings and configures training.
@@ -56,7 +70,7 @@ Pass a backend-specific config to [`TrainingArguments`]. The [`~Trainer.create_a
 <hfoptions id="backend">
 <hfoption id="FSDP">
 
-Pass a JSON config file or dict to [`fsdp_config`]. See [FSDP](./fsdp) for a full guide and config reference.
+Pass a JSON config file or dict to [`~TrainingArguments.fsdp_config`]. See [FSDP](./fsdp) for a full guide and config reference.
 
 ```py
 from transformers import TrainingArguments
@@ -71,7 +85,7 @@ TrainingArguments(
 </hfoption>
 <hfoption id="DeepSpeed">
 
-Pass a JSON config file or dict to [`deepspeed`]. See [DeepSpeed](./deepspeed) for a full guide and config reference.
+Pass a JSON config file or dict to [`~TrainingArguments.deepspeed`]. See [DeepSpeed](./deepspeed) for a full guide and config reference.
 
 ```py
 from transformers import TrainingArguments
@@ -102,22 +116,6 @@ TrainingArguments(
 </hfoption>
 </hfoptions>
 
-## Accelerate training settings
-
-The [accelerator_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config) accepts settings that don't have dedicated top-level arguments. For example, set `non_blocking=True` together with [`~TrainingArguments.dataloader_pin_memory`] to overlap data transfer with compute for higher GPU throughput.
-
-```py
-from transformers import TrainingArguments
-
-TrainingArguments(
-    ...,
-    dataloader_pin_memory=True,
-    accelerator_config={
-        "non_blocking": True,
-    },
-)
-```
-
 ## Next steps
 
 - See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
diff --git a/docs/source/en/accelerator_selection.md b/docs/source/en/accelerator_selection.md
index 44afc50032ca..425a34f16e7f 100644
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@@ -23,26 +23,26 @@ You can control which accelerators (CUDA, XPU, MPS, HPU, etc.) PyTorch sees and
 Use the hardware-specific environment variable to select accelerators and set their order. Set it on the command line per run, or add it to `~/.bashrc` or another startup config file.
 
 > [!WARNING]
-> Avoid exporting environment variables because if you forget a previously exported value, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.
+> Avoid exporting environment variables because if you forget how an environment variable was set up, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.
 
 For example, to select accelerators 0 and 2 out of four:
 
 <hfoptions id="accelerator-type">
 <hfoption id="CUDA">
 
-```bash
+```cli
 CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
 ```
 
 PyTorch sees only GPUs 0 and 2, which are mapped to `cuda:0` and `cuda:1`. To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
 
-```bash
+```cli
 CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
 ```
 
 To run without any GPUs:
 
-```bash
+```cli
 CUDA_VISIBLE_DEVICES= python trainer-program.py ...
 ```
 
@@ -50,32 +50,32 @@ Control the order of CUDA devices with `CUDA_DEVICE_ORDER`.
 
 - Order by PCIe bus ID (matches `nvidia-smi`):
 
-    ```bash
+    ```cli
     export CUDA_DEVICE_ORDER=PCI_BUS_ID
     ```
 
 - Order by compute capability (fastest first):
 
-    ```bash
+    ```cli
     export CUDA_DEVICE_ORDER=FASTEST_FIRST
     ```
 
 </hfoption>
 <hfoption id="Intel XPU">
 
-```bash
+```cli
 ZE_AFFINITY_MASK=0,2 torchrun trainer-program.py ...
 ```
 
 PyTorch sees only XPUs 0 and 2, which are mapped to `xpu:0` and `xpu:1`. To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
 
-```bash
+```cli
 ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
 ```
 
 Control the order of Intel XPUs with:
 
-```bash
+```cli
 export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
 ```
 
diff --git a/docs/source/en/ddp.md b/docs/source/en/ddp.md
index ce4832904ea1..7154269ac2a2 100644
--- a/docs/source/en/ddp.md
+++ b/docs/source/en/ddp.md
@@ -16,7 +16,7 @@ rendered properly in your Markdown viewer.
 
 # DDP
 
-[DistributedDataParallel (DDP)](https://docs.pytorch.org/tutorials/beginner/ddp_series_theory.html) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs. The all-reduce runs on the final micro-batch. [`Trainer`] skips the all-reduce on intermediate gradient accumulation steps, keeping all GPUs in sync after every update. Use DDP when your model fits on a single GPU.
+[DistributedDataParallel (DDP)](https://docs.pytorch.org/tutorials/beginner/ddp_series_theory.html) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.
 
 ```text
                          ┌─────────────────┐
@@ -54,11 +54,11 @@ accelerate launch --num_processes 4 train.py
 
 Pass these [`TrainingArguments`] to control DDP behavior.
 
-- [`~TrainingArguments.gradient_accumulation_steps`] determines when to perform the all-reduce. For example, with `gradient_accumulation_steps=4`, the all-reduce runs every 4 backward passes. This is a general [`TrainingArguments`] setting that interacts with DDP.
-- [`ddp_find_unused_parameters`] searches the full graph at the *start* of the backward pass for parameters that won't receive a gradient and marks them as ready so they don't block the all-reduce. Don't use with [`~TrainingArguments.gradient_checkpointing`] because gradient checkpointing discards intermediate activations and recomputes them on the fly.
-- [`ddp_bucket_cap_mb`] is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
-- [`ddp_broadcast_buffers`] synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don't use with [`~TrainingArguments.gradient_checkpointing`].
-- [`ddp_backend`] sets the communication backend. Use `"nccl"` for NVIDIA GPUs (default and fastest), `"gloo"` for CPU training or debugging, and `"xccl"`, `"hccl"`, or `"cncl"` for other hardware.
+- [`~TrainingArguments.gradient_accumulation_steps`] determines when to perform the all-reduce. [`Trainer`] skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, with `gradient_accumulation_steps=4`, the all-reduce runs every 4 backward passes.
+- [`~TrainingArguments.ddp_find_unused_parameters`] traverses the autograd graph at the end of the forward pass for parameters that won't receive a gradient and marks them as ready so they don't block the all-reduce. Don't use with [`~TrainingArguments.gradient_checkpointing`] because gradient checkpointing discards intermediate activations and recomputes them on the fly.
+- [`~TrainingArguments.ddp_bucket_cap_mb`] is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
+- [`~TrainingArguments.ddp_broadcast_buffers`] synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don't use with [`~TrainingArguments.gradient_checkpointing`].
+- [`~TrainingArguments.ddp_backend`] sets the communication backend. Use `"nccl"` for NVIDIA GPUs (default and fastest), `"gloo"` for CPU training or debugging, and `"xccl"`, `"hccl"`, or `"cncl"` for other hardware.
 - [`~TrainingArguments.ddp_timeout`] sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.
 
 ```py
diff --git a/docs/source/en/fsdp.md b/docs/source/en/fsdp.md
index a679145fa77c..4b9314fe25ef 100644
--- a/docs/source/en/fsdp.md
+++ b/docs/source/en/fsdp.md
@@ -50,7 +50,7 @@ rendered properly in your Markdown viewer.
 
 ## Sharding strategies
 
-FSDP2 controls sharding with [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config). Set `fsdp=True` to enable FSDP, and set `reshard_after_forward` in the FSDP config to choose the memory and throughput tradeoff.
+FSDP2 controls sharding with [`~TrainingArguments.fsdp_config`]. Set `fsdp=True` to enable FSDP, and set `reshard_after_forward` in the FSDP config to choose the memory and throughput tradeoff.
 
 | `reshard_after_forward` | behavior |
 |---|---|
@@ -61,34 +61,30 @@ FSDP2 controls sharding with [fsdp_config](https://huggingface.co/docs/transform
 
 ## Configure FSDP
 
-These fields control how FSDP2 wraps, shards, and loads the model.
-
-- `reshard_after_forward` determines whether to reshard parameters after the forward pass. The default `true` saves more memory. Set it to `false` to keep parameters gathered between the forward and backward passes and reduce communication at the cost of higher peak memory.
+These fields control how FSDP2 wraps, shards, and loads the model. `reshard_after_forward` and `auto_wrap_policy` are covered in [Sharding strategies](#sharding-strategies).
 
 - `cpu_offload` offloads parameters and gradients to CPU when they aren't in use to save GPU memory.
 
-- `auto_wrap_policy` determines how modules are wrapped into FSDP units. Defaults to `"TRANSFORMER_BASED_WRAP"`, which wraps transformer layers. Use `"SIZE_BASED_WRAP"` for modules above a parameter-count threshold, or `"NO_WRAP"` to disable auto wrapping.
-
 - `transformer_layer_cls_to_wrap` defines the transformer layer to wrap into an FSDP unit when `auto_wrap_policy` is `"TRANSFORMER_BASED_WRAP"`. Each unit manages its own gather and scatter ops. Only the current unit's parameters are gathered during the forward pass. The previous units' parameters are released to save memory.
 
   Wrapping only the top-level model yields no GPU memory savings. Wrapping every individual `Linear` layer makes inter-unit communication very expensive. Leave this field empty and FSDP reads the value from the model definition.
 
 - `min_num_params` sets the minimum number of parameters per module for size-based wrapping. It is only used when `auto_wrap_policy` is `"SIZE_BASED_WRAP"`.
 
-- `state_dict_type` controls the checkpoint format. Defaults to `"FULL_STATE_DICT"` for a single Transformers-compatible checkpoint. Use `"SHARDED_STATE_DICT"` for one checkpoint file per rank.
+- `state_dict_type` controls the checkpoint format. Defaults to `"FULL_STATE_DICT"` for a single Transformers-compatible checkpoint. Use `"SHARDED_STATE_DICT"` for one checkpoint file per rank, which is faster for large models. Sharded checkpoints only load back into FSDP, so save a `"FULL_STATE_DICT"` for the final checkpoint you want to share or load outside FSDP.
 
 - `cpu_ram_efficient_loading` loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM.
 
 - `activation_checkpointing` recomputes activations during the backward pass instead of storing them. Use this instead of [gradient checkpointing](./grad_checkpointing) in [`TrainingArguments`]. Setting both raises an error.
 
-Configure FSDP training with either an [Accelerate config file](./accelerate#accelerate-config-file) or an FSDP config file passed to [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config).
+Configure FSDP training with either an [Accelerate config file](./accelerate#accelerate-config-file) or an FSDP config file passed to `fsdp_config`.
 
 <hfoptions id="launch">
 <hfoption id="Accelerate config file">
 
 Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config) command and answer questions about your hardware and training setup. This creates a `default_config.yaml` file in your cache.
 
-Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script. The [`fsdp_config`] is unnecessary because the Accelerate config file covers the same settings.
+Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script. The `fsdp_config` is unnecessary because the Accelerate config file covers the same settings.
 
 ```cli
 accelerate launch train.py
@@ -110,7 +106,7 @@ accelerate launch train.py
 }
 ```
 
-Set `fsdp=True` and pass the FSDP config file to [`fsdp_config`].
+Set `fsdp=True` and pass the FSDP config file to `fsdp_config`.
 
 ```py
 from transformers import TrainingArguments
@@ -129,4 +125,5 @@ TrainingArguments(
 
 - See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
 - See [DeepSpeed](./deepspeed) for ZeRO optimization and NVMe offloading.
+- For FSDP on TPUs with PyTorch/XLA, set `xla`, `xla_fsdp_settings`, and `xla_fsdp_grad_ckpt` in [`~TrainingArguments.fsdp_config`].
 - Read the [FSDP chapter](https://nanotron-ultrascale-playbook.static.hf.space/index.html#zero-3:_adding_parameter_partitioning_(fsdp)) from The Ultra-Scale Playbook for more information about how FSDP works.