diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
index 340a37ff37d2..858da6b58d05 100644
--- a/docs/source/en/_toctree.yml
+++ b/docs/source/en/_toctree.yml
@@ -219,8 +219,10 @@
title: Accelerator selection
- local: accelerate
title: Accelerate
+ - local: ddp
+ title: DDP
- local: fsdp
- title: FullyShardedDataParallel
+ title: FSDP2
- local: deepspeed
title: DeepSpeed ZeRO
- local: deepspeed_alst
@@ -243,7 +245,7 @@
- local: perf_hardware
title: Building a GPU workstation
- local: model_memory_anatomy
- title: Model training anatomy
+ title: GPU memory usage
title: Hardware
title: Training
- isExpanded: false
diff --git a/docs/source/en/accelerate.md b/docs/source/en/accelerate.md
index a18436889e03..51e26308c823 100644
--- a/docs/source/en/accelerate.md
+++ b/docs/source/en/accelerate.md
@@ -16,150 +16,108 @@ rendered properly in your Markdown viewer.
# Accelerate
-[Accelerate](https://hf.co/docs/accelerate/index) is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks ([Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) and [DeepSpeed](https://www.deepspeed.ai/)) for it into a single interface. [`Trainer`] is powered by Accelerate under the hood, enabling loading big models and distributed training.
+[Accelerate](https://hf.co/docs/accelerate/index) provides a unified interface for distributed training backends like [FSDP](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) or [DeepSpeed](https://www.deepspeed.ai/). It detects your environment (number of GPUs, distributed backend, mixed precision, etc.) and automatically configures training, whether you're on 1 GPU with DDP or 8 GPUs with FSDP.
-This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with [`Trainer`], and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the [documentation](https://hf.co/docs/accelerate/index).
+Accelerate wraps the model in the appropriate distributed wrapper, moves it to the correct device, and creates a compatible optimizer. During training, Accelerate uses its own [`~accelerate.Accelerator.backward`] method to handle gradient scaling for mixed precision. [`Trainer`] calls the appropriate Accelerate APIs and delegates all distributed mechanics to Accelerate.
-```bash
-pip install accelerate
-```
-
-Start by running [accelerate config](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-config) in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.
+Configure Accelerate for [`Trainer`] with either an Accelerate config file or [`TrainingArguments`].
-```bash
-accelerate config
-```
+## Accelerate config file
-Depending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.
+Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config) command and answer questions about your hardware and training setup. This creates a `default_config.yaml` file in your cache. The example below is for FSDP.
```yaml
compute_environment: LOCAL_MACHINE
-debug: false
distributed_type: FSDP
-downcast_bf16: 'no'
fsdp_config:
+ fsdp_version: 2
+ fsdp_reshard_after_forward: true
+ fsdp_cpu_offload: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
- fsdp_backward_prefetch_policy: BACKWARD_PRE
- fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
- fsdp_offload_params: false
- fsdp_sharding_strategy: FULL_SHARD
+ fsdp_activation_checkpointing: false
fsdp_state_dict_type: SHARDED_STATE_DICT
- fsdp_sync_module_states: true
- fsdp_transformer_layer_cls_to_wrap: BertLayer
- fsdp_use_orig_params: true
-machine_rank: 0
-main_training_function: main
+ fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
mixed_precision: bf16
num_machines: 1
-num_processes: 2
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
+num_processes: 4
```
-## Trainer
+Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script, and Accelerate reads the config file to set up training. The [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) and [deepspeed](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.deepspeed) args are unnecessary because the Accelerate config file covers the same settings.
+
+```cli
+accelerate launch train.py
+```
-Pass the path to the saved configuration file to [`TrainingArguments`], and from there, pass your [`TrainingArguments`] to [`Trainer`].
+The [accelerator_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config) accepts settings that don't have dedicated top-level arguments. For example, set `non_blocking=True` together with [`~TrainingArguments.dataloader_pin_memory`] to overlap data transfer with compute for higher GPU throughput.
```py
-from transformers import TrainingArguments, Trainer
-
-training_args = TrainingArguments(
- output_dir="your-model",
- learning_rate=2e-5,
- per_device_train_batch_size=16,
- per_device_eval_batch_size=16,
- num_train_epochs=2,
- fsdp_config="path/to/fsdp_config",
- fsdp="full_shard",
- weight_decay=0.01,
- eval_strategy="epoch",
- save_strategy="epoch",
- load_best_model_at_end=True,
- push_to_hub=True,
+from transformers import TrainingArguments
+
+TrainingArguments(
+ ...,
+ dataloader_pin_memory=True,
+ accelerator_config={
+ "non_blocking": True,
+ },
)
+```
-trainer = Trainer(
- model=model,
- args=training_args,
- train_dataset=dataset["train"],
- eval_dataset=dataset["test"],
- processing_class=tokenizer,
- data_collator=data_collator,
- compute_metrics=compute_metrics,
-)
+## TrainingArguments
-trainer.train()
-```
+Pass a backend-specific config to [`TrainingArguments`]. The [`~Trainer.create_accelerator_and_postprocess`] method reads the settings and configures training.
-## Native PyTorch
+
+
-Accelerate can also be added to any PyTorch training loop to enable distributed training. The [`~accelerate.Accelerator`] is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don't need to explicitly place your model on a device because [`~accelerate.Accelerator`] knows which device to move your model to.
+Pass a JSON config file or dict to [`~TrainingArguments.fsdp_config`]. See [FSDP](./fsdp) for a full guide and config reference.
```py
-from accelerate import Accelerator
+from transformers import TrainingArguments
-accelerator = Accelerator()
-device = accelerator.device
+TrainingArguments(
+ ...,
+ fsdp=True,
+ fsdp_config="path/to/fsdp.json",
+)
```
-All PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the [`~accelerate.Accelerator.prepare`] method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use [`~accelerate.optimizer.AcceleratedOptimizer`] and [`~accelerate.scheduler.AcceleratedScheduler`], and creates a new shardable dataloader.
+
+
+
+Pass a JSON config file or dict to [`~TrainingArguments.deepspeed`]. See [DeepSpeed](./deepspeed) for a full guide and config reference.
```py
-train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
- train_dataloader, eval_dataloader, model, optimizer
+from transformers import TrainingArguments
+
+TrainingArguments(
+ ...,
+ deepspeed="path/to/ds_config.json",
)
```
-Replace `loss.backward` in your training loop with Accelerates [`~accelerate.Accelerator.backward`] method to scale the gradients and determine the appropriate `backward` method to use depending on your framework (for example, DeepSpeed or Megatron).
-
-```py
-for epoch in range(num_epochs):
- for batch in train_dataloader:
- outputs = model(**batch)
- loss = outputs.loss
- accelerator.backward(loss)
- optimizer.step()
- lr_scheduler.step()
- optimizer.zero_grad()
- progress_bar.update(1)
-```
+
+
-Combine everything into a function and make it callable as a script.
+DDP is configured directly through [`TrainingArguments`] fields. See [DDP](./ddp) for details.
```py
-from accelerate import Accelerator
-
-def main():
- accelerator = Accelerator()
-
- model, optimizer, training_dataloader, scheduler = accelerator.prepare(
- model, optimizer, training_dataloader, scheduler
- )
-
- for batch in training_dataloader:
- optimizer.zero_grad()
- inputs, targets = batch
- outputs = model(inputs)
- loss = loss_function(outputs, targets)
- accelerator.backward(loss)
- optimizer.step()
- scheduler.step()
-
-if __name__ == "__main__":
- main()
+from transformers import TrainingArguments
+
+TrainingArguments(
+ ...,
+ ddp_backend="nccl",
+ ddp_find_unused_parameters=False,
+ ddp_bucket_cap_mb=25,
+ ddp_timeout=1800,
+)
```
-From the command line, call [accelerate launch](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-launch) to run your training script. Any additional arguments or parameters can be passed here as well.
+
+
-To launch your training script on two GPUs, add the `--num_processes` argument.
-
-```bash
-accelerate launch --num_processes=2 your_script.py
-```
+## Next steps
-Refer to the [Launching Accelerate scripts](https://hf.co/docs/accelerate/main/en/basic_tutorials/launch) for more details.
+- See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
+- See [FSDP](./fsdp) for sharding parameters, gradients, and optimizer states across GPUs.
+- See [DeepSpeed](./deepspeed) for ZeRO optimization and offloading.
diff --git a/docs/source/en/accelerator_selection.md b/docs/source/en/accelerator_selection.md
index 8b2e315706ad..425a34f16e7f 100644
--- a/docs/source/en/accelerator_selection.md
+++ b/docs/source/en/accelerator_selection.md
@@ -16,108 +16,70 @@ rendered properly in your Markdown viewer.
# Accelerator selection
-During distributed training, you can specify the number and order of accelerators (CUDA, XPU, MPS, HPU, etc.) to use. This can be useful when you have accelerators with different computing power and you want to use the faster accelerator first. Or you could only use a subset of the available accelerators. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed).
-
-This guide will show you how to select the number of accelerators to use and the order to use them in.
-
-## Number of accelerators
-
-For example, if there are 4 accelerators and you only want to use the first 2, run the command below.
-
-
-
-
-Use the `--nproc_per_node` to select how many accelerators to use.
-
-```bash
-torchrun --nproc_per_node=2 trainer-program.py ...
-```
-
-
-
-
-Use `--num_processes` to select how many accelerators to use.
-
-```bash
-accelerate launch --num_processes 2 trainer-program.py ...
-```
-
-
-
-
-Use `--num_gpus` to select how many GPUs to use.
-
-```bash
-deepspeed --num_gpus 2 trainer-program.py ...
-```
-
-
-
+You can control which accelerators (CUDA, XPU, MPS, HPU, etc.) PyTorch sees and in what order during distributed training. Prioritize faster devices or limit training to a subset of available hardware. It works with both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html), and doesn't require Accelerate or the [DeepSpeed integration](./main_classes/deepspeed).
## Order of accelerators
-To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.
+Use the hardware-specific environment variable to select accelerators and set their order. Set it on the command line per run, or add it to `~/.bashrc` or another startup config file.
+
+> [!WARNING]
+> Avoid exporting environment variables because if you forget how an environment variable was set up, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.
-For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
+For example, to select accelerators 0 and 2 out of four:
-```bash
+```cli
CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
```
-Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively.
-To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
+PyTorch sees only GPUs 0 and 2, which are mapped to `cuda:0` and `cuda:1`. To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
-```bash
+```cli
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
```
To run without any GPUs:
-```bash
+```cli
CUDA_VISIBLE_DEVICES= python trainer-program.py ...
```
-You can also control the order of CUDA devices using `CUDA_DEVICE_ORDER`:
+Control the order of CUDA devices with `CUDA_DEVICE_ORDER`.
- Order by PCIe bus ID (matches `nvidia-smi`):
- ```bash
+ ```cli
export CUDA_DEVICE_ORDER=PCI_BUS_ID
```
- Order by compute capability (fastest first):
- ```bash
+ ```cli
export CUDA_DEVICE_ORDER=FASTEST_FIRST
```
-```bash
+```cli
ZE_AFFINITY_MASK=0,2 torchrun trainer-program.py ...
```
-Only XPUs 0 and 2 are "visible" to PyTorch and are mapped to `xpu:0` and `xpu:1` respectively.
-To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
+PyTorch sees only XPUs 0 and 2, which are mapped to `xpu:0` and `xpu:1`. To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
-```bash
+```cli
ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
```
-You can also control the order of Intel XPUs with:
+Control the order of Intel XPUs with:
-```bash
+```cli
export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
```
-For more information about device enumeration and sorting on Intel XPU, please refer to the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
+For more on device enumeration and sorting on Intel XPU, see the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
-
-> [!WARNING]
-> Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
diff --git a/docs/source/en/ddp.md b/docs/source/en/ddp.md
new file mode 100644
index 000000000000..7154269ac2a2
--- /dev/null
+++ b/docs/source/en/ddp.md
@@ -0,0 +1,82 @@
+
+
+# DDP
+
+[DistributedDataParallel (DDP)](https://docs.pytorch.org/tutorials/beginner/ddp_series_theory.html) maintains a full copy of a model on each GPU. Each GPU processes a non-overlapping shard of data with a forward and backward pass. Before the optimizer step, an all-reduce averages gradients across all GPUs so every model copy stays identical. Use DDP when your model fits on a single GPU.
+
+```text
+ ┌─────────────────┐
+ │ training data │
+ └────────┬────────┘
+ ┌──────────────────┼──────────────────┐
+ │ shard 0 │ shard 1 │ shard 2
+ ▼ ▼ ▼
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+ │ model │ │ model │ │ model │
+ │ (copy 0) │ │ (copy 1) │ │ (copy 2) │
+ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │
+ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
+ │ grads │ grads │ grads
+ └──────────────────┼──────────────────┘
+ all-reduce
+ (average gradients)
+ ┌──────────────────┼──────────────────┐
+ ▼ ▼ ▼
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+ │ optimizer │ │ optimizer │ │ optimizer │
+ │ step │ │ step │ │ step │
+ └─────────────┘ └─────────────┘ └─────────────┘
+ (identical) (identical) (identical)
+```
+
+DDP activates automatically when you launch with a multi-process launcher like [Accelerate](./accelerate).
+
+```cli
+# 4 GPUs on one machine
+accelerate launch --num_processes 4 train.py
+```
+
+## Configure DDP
+
+Pass these [`TrainingArguments`] to control DDP behavior.
+
+- [`~TrainingArguments.gradient_accumulation_steps`] determines when to perform the all-reduce. [`Trainer`] skips the all-reduce on intermediate accumulation steps and runs it only on the final micro-batch. For example, with `gradient_accumulation_steps=4`, the all-reduce runs every 4 backward passes.
+- [`~TrainingArguments.ddp_find_unused_parameters`] traverses the autograd graph at the end of the forward pass for parameters that won't receive a gradient and marks them as ready so they don't block the all-reduce. Don't use with [`~TrainingArguments.gradient_checkpointing`] because gradient checkpointing discards intermediate activations and recomputes them on the fly.
+- [`~TrainingArguments.ddp_bucket_cap_mb`] is the bucket size for batching gradients into a single all-reduce during the backward pass. A larger bucket means fewer all-reduce calls and less launch overhead.
+- [`~TrainingArguments.ddp_broadcast_buffers`] synchronizes model buffers (such as BatchNorm running statistics) from rank 0 to all other ranks at the start of every forward pass. Disable if your model only uses LayerNorm. Don't use with [`~TrainingArguments.gradient_checkpointing`].
+- [`~TrainingArguments.ddp_backend`] sets the communication backend. Use `"nccl"` for NVIDIA GPUs (default and fastest), `"gloo"` for CPU training or debugging, and `"xccl"`, `"hccl"`, or `"cncl"` for other hardware.
+- [`~TrainingArguments.ddp_timeout`] sets the time limit for all processes and operations (all-reduce, broadcast) to complete. If a process hangs, like when loading a large model slowly, the timeout raises an error instead of blocking indefinitely.
+
+```py
+from transformers import TrainingArguments
+
+args = TrainingArguments(
+ ...,
+ gradient_accumulation_steps=4,
+ ddp_backend="nccl",
+ ddp_find_unused_parameters=False,
+ ddp_bucket_cap_mb=25,
+ ddp_broadcast_buffers=True,
+ ddp_timeout=1800,
+)
+```
+
+## Next steps
+
+- See [FSDP](./fsdp) for training models too large to fit on a single GPU.
+- See [DeepSpeed](./deepspeed) for ZeRO optimization and offloading.
+- Read the [Data Parallelism](https://nanotron-ultrascale-playbook.static.hf.space/index.html#data_parallelism) chapter from The Ultra-Scale Playbook for more information about how DDP works.
diff --git a/docs/source/en/fsdp.md b/docs/source/en/fsdp.md
index 944c5a18e109..4b9314fe25ef 100644
--- a/docs/source/en/fsdp.md
+++ b/docs/source/en/fsdp.md
@@ -14,132 +14,116 @@ rendered properly in your Markdown viewer.
-->
-# FullyShardedDataParallel
-
-[Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) is a [parallelism](./perf_train_gpu_many) method that combines the advantages of data and model parallelism for distributed training.
-
-Unlike [DistributedDataParallel (DDP)](./perf_train_gpu_many#distributeddataparallel), FSDP saves more memory because it doesn't replicate a model on each GPU. It shards the models parameters, gradients and optimizer states across GPUs. Each model shard processes a portion of the data and the results are synchronized to speed up training.
-
-This guide covers how to set up training a model with FSDP and [Accelerate](https://hf.co/docs/accelerate/index), a library for managing distributed training.
-
-```bash
-pip install accelerate
+# FSDP2
+
+[Fully Sharded Data Parallel (FSDP2)](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) shards the model, gradients, and optimizer states across GPUs. Before computation, each GPU gathers a complete set of parameters from all shards, then frees them afterward. Sharding lets you train models larger than a single GPU's memory, at the cost of more communication than [DDP](./ddp). Use FSDP when your model or optimizer states don't fit on a single GPU.
+
+```text
+ ┌─────────────────┐
+ │ training data │
+ └────────┬────────┘
+ ┌──────────────────┼──────────────────┐
+ │ shard 0 │ shard 1 │ shard 2
+ ▼ ▼ ▼
+ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
+ │ param │ │ param │ │ param │
+ │ shard 0 │ │ shard 1 │ │ shard 2 │
+ │ GPU 0 │ │ GPU 1 │ │ GPU 2 │
+ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
+ │ │ │
+ └──────── all-gather (params) ────────┘
+ │
+ full params on each GPU
+ │
+ ┌──────────────────┼──────────────────┐
+ ▼ ▼ ▼
+ forward forward forward
+ │ │ │
+ └───── reduce-scatter (grads) ────────┘
+ │
+ ┌──────────────────┼──────────────────┐
+ ▼ ▼ ▼
+ grad shard 0 grad shard 1 grad shard 2
+ optim shard 0 optim shard 1 optim shard 2
+ step step step
```
-## Configuration options
-
-Always start by running the [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config) command to help Accelerate set up the correct distributed training environment.
-
-```bash
-accelerate config
-```
-
-The section below discusses some of the more important FSDP configuration options. Learn more about other available options in the [fsdp_config](https://hf.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) parameter.
-
-### Sharding strategy
+## Sharding strategies
-FSDP offers several sharding strategies to distribute a model. Refer to the table below to help you choose the best strategy for your setup. Specify a strategy with the `fsdp_sharding_strategy` parameter in the configuration file.
+FSDP2 controls sharding with [`~TrainingArguments.fsdp_config`]. Set `fsdp=True` to enable FSDP, and set `reshard_after_forward` in the FSDP config to choose the memory and throughput tradeoff.
-| sharding strategy | description | parameter value |
-|---|---|---|
-| `FULL_SHARD` | shards model parameters, gradients, and optimizer states | `1` |
-| `SHARD_GRAD_OP` | shards gradients and optimizer states | `2` |
-| `NO_SHARD` | don't shard the model | `3` |
-| `HYBRID_SHARD` | shards model parameters, gradients, and optimizer states within each GPU | `4` |
-| `HYBRID_SHARD_ZERO2` | shards gradients and optimizer states within each GPU | `5` |
+| `reshard_after_forward` | behavior |
+|---|---|
+| `true` | reshard parameters after the forward pass to save more memory |
+| `false` | keep parameters gathered between forward and backward to avoid the re-all-gather, at the cost of higher peak memory |
-### CPU offload
+`auto_wrap_policy` controls how modules are wrapped into FSDP units. It defaults to `"TRANSFORMER_BASED_WRAP"`, which wraps the model's transformer layers. Without wrapping (`"NO_WRAP"`), the entire model is one FSDP unit and you lose the memory benefit of sharding.
-Offload model parameters and gradients when they aren't being used to the CPU to save additional GPU memory. This is useful for scenarios where a model is too large even with FSDP.
+## Configure FSDP
-Specify `fsdp_offload_params: true` in the configuration file to enable offloading.
+These fields control how FSDP2 wraps, shards, and loads the model. `reshard_after_forward` and `auto_wrap_policy` are covered in [Sharding strategies](#sharding-strategies).
-### Wrapping policy
+- `cpu_offload` offloads parameters and gradients to CPU when they aren't in use to save GPU memory.
-FSDP is applied by wrapping each layer in the network. The wrapping is usually applied in a nested way where the full weights are discarded after each forward pass to save memory for the next layer.
+- `transformer_layer_cls_to_wrap` defines the transformer layer to wrap into an FSDP unit when `auto_wrap_policy` is `"TRANSFORMER_BASED_WRAP"`. Each unit manages its own gather and scatter ops. Only the current unit's parameters are gathered during the forward pass. The previous units' parameters are released to save memory.
-There are several wrapping policies available, but the *auto wrapping* policy is the simplest and doesn't require any changes to your code. Specify `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP` to wrap a Transformer layer and `fsdp_transformer_layer_cls_to_wrap` to determine which layer to wrap (for example, `BertLayer`).
+ Wrapping only the top-level model yields no GPU memory savings. Wrapping every individual `Linear` layer makes inter-unit communication very expensive. Leave this field empty and FSDP reads the value from the model definition.
-Size-based wrapping is also available. If a layer exceeds a certain number of parameters, it is wrapped. Specify `fsdp_wrap_policy: SIZED_BASED_WRAP` and `min_num_param` to set the minimum number of parameters for a layer to be wrapped.
+- `min_num_params` sets the minimum number of parameters per module for size-based wrapping. It is only used when `auto_wrap_policy` is `"SIZE_BASED_WRAP"`.
-### Checkpoints
+- `state_dict_type` controls the checkpoint format. Defaults to `"FULL_STATE_DICT"` for a single Transformers-compatible checkpoint. Use `"SHARDED_STATE_DICT"` for one checkpoint file per rank, which is faster for large models. Sharded checkpoints only load back into FSDP, so save a `"FULL_STATE_DICT"` for the final checkpoint you want to share or load outside FSDP.
-Intermediate checkpoints should be saved as a sharded state dict because saving the full state dict - even with CPU offloading - is time consuming and can cause `NCCL Timeout` errors due to indefinite hanging during broadcasting.
+- `cpu_ram_efficient_loading` loads the checkpoint from disk on rank 0 only. Other GPUs initialize an empty model and receive the weights by broadcast, avoiding multiple processes loading a large model into CPU RAM.
-Specify `fsdp_state_dict_type: SHARDED_STATE_DICT` in the configuration file to save the sharded state dict. Now you can resume training from the sharded state dict with [`~accelerate.Accelerator.load_state`].
+- `activation_checkpointing` recomputes activations during the backward pass instead of storing them. Use this instead of [gradient checkpointing](./grad_checkpointing) in [`TrainingArguments`]. Setting both raises an error.
-```py
-accelerator.load_state("directory/containing/checkpoints")
-```
+Configure FSDP training with either an [Accelerate config file](./accelerate#accelerate-config-file) or an FSDP config file passed to `fsdp_config`.
-Once training is complete though, you should save the full state dict because the sharded state dict is only compatible with FSDP.
+
+
-```py
-if trainer.is_fsdp_enabled:
- trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
-
-trainer.save_model(script_args.output_dir)
-```
+Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config) command and answer questions about your hardware and training setup. This creates a `default_config.yaml` file in your cache.
-### TPU
+Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script. The `fsdp_config` is unnecessary because the Accelerate config file covers the same settings.
-[PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html), a package for running PyTorch on XLA devices, enables FSDP on TPUs. Modify the configuration file to include the parameters below. Refer to the [xla_fsdp_settings](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128) parameter for additional XLA-specific parameters you can configure for FSDP.
-
-```yaml
-xla: True # must be set to True to enable PyTorch/XLA
-xla_fsdp_settings: # XLA specific FSDP parameters
-xla_fsdp_grad_ckpt: True # enable gradient checkpointing
+```cli
+accelerate launch train.py
```
-## Training
-
-After running [accelerate config](https://hf.co/docs/accelerate/package_reference/cli#accelerate-config), your configuration file should be ready. An example configuration file is shown below that fully shards the parameter, gradient and optimizer states on two GPUs. Your file may look different depending on how you set up your configuration.
-
-```yaml
-compute_environment: LOCAL_MACHINE
-debug: false
-distributed_type: FSDP
-downcast_bf16: 'no'
-fsdp_config:
- fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
- fsdp_backward_prefetch_policy: BACKWARD_PRE
- fsdp_cpu_ram_efficient_loading: true
- fsdp_forward_prefetch: false
- fsdp_offload_params: true
- fsdp_sharding_strategy: 1
- fsdp_state_dict_type: SHARDED_STATE_DICT
- fsdp_sync_module_states: true
- fsdp_transformer_layer_cls_to_wrap: BertLayer
- fsdp_use_orig_params: true
-machine_rank: 0
-main_training_function: main
-mixed_precision: bf16
-num_machines: 1
-num_processes: 2
-rdzv_backend: static
-same_network: true
-tpu_env: []
-tpu_use_cluster: false
-tpu_use_sudo: false
-use_cpu: false
+
+
+
+```json
+{
+ "version": 2,
+ "reshard_after_forward": true,
+ "cpu_offload": false,
+ "auto_wrap_policy": "TRANSFORMER_BASED_WRAP",
+ "transformer_layer_cls_to_wrap": ["LlamaDecoderLayer"],
+ "state_dict_type": "FULL_STATE_DICT",
+ "cpu_ram_efficient_loading": true,
+ "activation_checkpointing": true
+}
```
-Run the [accelerate launch](https://hf.co/docs/accelerate/package_reference/cli#accelerate-launch) command to launch a training script with the FSDP configurations you chose in the configuration file.
-
-```bash
-accelerate launch my-training-script.py
-```
+Set `fsdp=True` and pass the FSDP config file to `fsdp_config`.
-It is also possible to directly specify some of the FSDP arguments in the command line.
+```py
+from transformers import TrainingArguments
-```bash
-accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/" my-training-script.py
+TrainingArguments(
+ ...,
+ fsdp=True,
+ fsdp_config="path/to/fsdp.json",
+)
```
-## Resources
+
+
-FSDP is a powerful tool for training large models with fewer GPUs compared to other parallelism strategies. Refer to the following resources below to learn even more about FSDP.
+## Next steps
-- Follow along with the more in-depth Accelerate guide for [FSDP](https://hf.co/docs/accelerate/usage_guides/fsdp).
-- Read the [Introducing PyTorch Fully Sharded Data Parallel (FSDP) API](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) blog post.
-- Read the [Scaling PyTorch models on Cloud TPUs with FSDP](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) blog post.
+- See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
+- See [DeepSpeed](./deepspeed) for ZeRO optimization and NVMe offloading.
+- For FSDP on TPUs with PyTorch/XLA, set `xla`, `xla_fsdp_settings`, and `xla_fsdp_grad_ckpt` in [`~TrainingArguments.fsdp_config`].
+- Read the [FSDP chapter](https://nanotron-ultrascale-playbook.static.hf.space/index.html#zero-3:_adding_parameter_partitioning_(fsdp)) from The Ultra-Scale Playbook for more information about how FSDP works.