Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 4 additions & 2 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -219,8 +219,10 @@
title: Accelerator selection
- local: accelerate
title: Accelerate
- local: ddp
title: DDP
- local: fsdp
title: FullyShardedDataParallel
title: FSDP2
- local: deepspeed
title: DeepSpeed ZeRO
- local: deepspeed_alst
Expand All @@ -243,7 +245,7 @@
- local: perf_hardware
title: Building a GPU workstation
- local: model_memory_anatomy
title: Model training anatomy
title: GPU memory usage
title: Hardware
title: Training
- isExpanded: false
Expand Down
170 changes: 64 additions & 106 deletions docs/source/en/accelerate.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,150 +16,108 @@ rendered properly in your Markdown viewer.

# Accelerate

[Accelerate](https://hf.co/docs/accelerate/index) is a library designed to simplify distributed training on any type of setup with PyTorch by uniting the most common frameworks ([Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) and [DeepSpeed](https://www.deepspeed.ai/)) for it into a single interface. [`Trainer`] is powered by Accelerate under the hood, enabling loading big models and distributed training.
[Accelerate](https://hf.co/docs/accelerate/index) provides a unified interface for distributed training backends like [FSDP](https://docs.pytorch.org/tutorials/intermediate/FSDP_tutorial.html) or [DeepSpeed](https://www.deepspeed.ai/). It detects your environment (number of GPUs, distributed backend, mixed precision, etc.) and automatically configures training, whether you're on 1 GPU with DDP or 8 GPUs with FSDP.

This guide will show you two ways to use Accelerate with Transformers, using FSDP as the backend. The first method demonstrates distributed training with [`Trainer`], and the second method demonstrates adapting a PyTorch training loop. For more detailed information about Accelerate, please refer to the [documentation](https://hf.co/docs/accelerate/index).
Accelerate wraps the model in the appropriate distributed wrapper, moves it to the correct device, and creates a compatible optimizer. During training, Accelerate uses its own [`~accelerate.Accelerator.backward`] method to handle gradient scaling for mixed precision. [`Trainer`] calls the appropriate Accelerate APIs and delegates all distributed mechanics to Accelerate.

```bash
pip install accelerate
```

Start by running [accelerate config](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-config) in the command line to answer a series of prompts about your training system. This creates and saves a configuration file to help Accelerate correctly set up training based on your setup.
Configure Accelerate for [`Trainer`] with either an Accelerate config file or [`TrainingArguments`].

```bash
accelerate config
```
## Accelerate config file

Depending on your setup and the answers you provide, an example configuration file for distributing training with FSDP on one machine with two GPUs may look like the following.
Run the [accelerate config](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-config) command and answer questions about your hardware and training setup. This creates a `default_config.yaml` file in your cache. The example below is for FSDP.

```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_version: 2
fsdp_reshard_after_forward: true
fsdp_cpu_offload: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_cpu_ram_efficient_loading: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_activation_checkpointing: false
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: BertLayer
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
num_processes: 4
```

## Trainer
Run [accelerate launch](https://huggingface.co/docs/accelerate/en/package_reference/cli#accelerate-launch) with a [`Trainer`]-based script, and Accelerate reads the config file to set up training. The [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) and [deepspeed](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.deepspeed) args are unnecessary because the Accelerate config file covers the same settings.

```cli
accelerate launch train.py
```

Pass the path to the saved configuration file to [`TrainingArguments`], and from there, pass your [`TrainingArguments`] to [`Trainer`].
The [accelerator_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.accelerator_config) accepts settings that don't have dedicated top-level arguments. For example, set `non_blocking=True` together with [`~TrainingArguments.dataloader_pin_memory`] to overlap data transfer with compute for higher GPU throughput.

```py
from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
output_dir="your-model",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=2,
fsdp_config="path/to/fsdp_config",
fsdp="full_shard",
weight_decay=0.01,
eval_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
push_to_hub=True,
from transformers import TrainingArguments

TrainingArguments(
...,
dataloader_pin_memory=True,
accelerator_config={
"non_blocking": True,
},
)
```

trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset["train"],
eval_dataset=dataset["test"],
processing_class=tokenizer,
data_collator=data_collator,
compute_metrics=compute_metrics,
)
## TrainingArguments

trainer.train()
```
Pass a backend-specific config to [`TrainingArguments`]. The [`~Trainer.create_accelerator_and_postprocess`] method reads the settings and configures training.

## Native PyTorch
<hfoptions id="backend">
<hfoption id="FSDP">

Accelerate can also be added to any PyTorch training loop to enable distributed training. The [`~accelerate.Accelerator`] is the main entry point for adapting your PyTorch code to work with Accelerate. It automatically detects your distributed training setup and initializes all the necessary components for training. You don't need to explicitly place your model on a device because [`~accelerate.Accelerator`] knows which device to move your model to.
Pass a JSON config file or dict to [`~TrainingArguments.fsdp_config`]. See [FSDP](./fsdp) for a full guide and config reference.

```py
from accelerate import Accelerator
from transformers import TrainingArguments

accelerator = Accelerator()
device = accelerator.device
TrainingArguments(
...,
fsdp=True,
fsdp_config="path/to/fsdp.json",
Comment thread
stevhliu marked this conversation as resolved.
)
```

All PyTorch objects (model, optimizer, scheduler, dataloaders) should be passed to the [`~accelerate.Accelerator.prepare`] method now. This method moves your model to the appropriate device or devices, adapts the optimizer and scheduler to use [`~accelerate.optimizer.AcceleratedOptimizer`] and [`~accelerate.scheduler.AcceleratedScheduler`], and creates a new shardable dataloader.
</hfoption>
<hfoption id="DeepSpeed">

Pass a JSON config file or dict to [`~TrainingArguments.deepspeed`]. See [DeepSpeed](./deepspeed) for a full guide and config reference.

```py
train_dataloader, eval_dataloader, model, optimizer = accelerator.prepare(
train_dataloader, eval_dataloader, model, optimizer
from transformers import TrainingArguments

TrainingArguments(
...,
deepspeed="path/to/ds_config.json",
)
```

Replace `loss.backward` in your training loop with Accelerates [`~accelerate.Accelerator.backward`] method to scale the gradients and determine the appropriate `backward` method to use depending on your framework (for example, DeepSpeed or Megatron).

```py
for epoch in range(num_epochs):
for batch in train_dataloader:
outputs = model(**batch)
loss = outputs.loss
accelerator.backward(loss)
optimizer.step()
lr_scheduler.step()
optimizer.zero_grad()
progress_bar.update(1)
```
</hfoption>
<hfoption id="DDP">

Combine everything into a function and make it callable as a script.
DDP is configured directly through [`TrainingArguments`] fields. See [DDP](./ddp) for details.

```py
from accelerate import Accelerator

def main():
accelerator = Accelerator()

model, optimizer, training_dataloader, scheduler = accelerator.prepare(
model, optimizer, training_dataloader, scheduler
)

for batch in training_dataloader:
optimizer.zero_grad()
inputs, targets = batch
outputs = model(inputs)
loss = loss_function(outputs, targets)
accelerator.backward(loss)
optimizer.step()
scheduler.step()

if __name__ == "__main__":
main()
from transformers import TrainingArguments

TrainingArguments(
...,
ddp_backend="nccl",
ddp_find_unused_parameters=False,
ddp_bucket_cap_mb=25,
ddp_timeout=1800,
)
```

From the command line, call [accelerate launch](https://hf.co/docs/accelerate/main/en/package_reference/cli#accelerate-launch) to run your training script. Any additional arguments or parameters can be passed here as well.
</hfoption>
</hfoptions>

To launch your training script on two GPUs, add the `--num_processes` argument.

```bash
accelerate launch --num_processes=2 your_script.py
```
## Next steps

Refer to the [Launching Accelerate scripts](https://hf.co/docs/accelerate/main/en/basic_tutorials/launch) for more details.
- See [DDP](./ddp) for data-parallel training when your model fits on one GPU.
- See [FSDP](./fsdp) for sharding parameters, gradients, and optimizer states across GPUs.
- See [DeepSpeed](./deepspeed) for ZeRO optimization and offloading.
76 changes: 19 additions & 57 deletions docs/source/en/accelerator_selection.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,108 +16,70 @@ rendered properly in your Markdown viewer.

# Accelerator selection

During distributed training, you can specify the number and order of accelerators (CUDA, XPU, MPS, HPU, etc.) to use. This can be useful when you have accelerators with different computing power and you want to use the faster accelerator first. Or you could only use a subset of the available accelerators. The selection process works for both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html). You don't need Accelerate or [DeepSpeed integration](./main_classes/deepspeed).

This guide will show you how to select the number of accelerators to use and the order to use them in.

## Number of accelerators

For example, if there are 4 accelerators and you only want to use the first 2, run the command below.

<hfoptions id="select-accelerator">
<hfoption id="torchrun">

Use the `--nproc_per_node` to select how many accelerators to use.

```bash
torchrun --nproc_per_node=2 trainer-program.py ...
```

</hfoption>
<hfoption id="Accelerate">

Use `--num_processes` to select how many accelerators to use.

```bash
accelerate launch --num_processes 2 trainer-program.py ...
```

</hfoption>
<hfoption id="DeepSpeed">

Use `--num_gpus` to select how many GPUs to use.

```bash
deepspeed --num_gpus 2 trainer-program.py ...
```

</hfoption>
</hfoptions>
You can control which accelerators (CUDA, XPU, MPS, HPU, etc.) PyTorch sees and in what order during distributed training. Prioritize faster devices or limit training to a subset of available hardware. It works with both [DistributedDataParallel](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html) and [DataParallel](https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html), and doesn't require Accelerate or the [DeepSpeed integration](./main_classes/deepspeed).

## Order of accelerators

To select specific accelerators to use and their order, use the environment variable appropriate for your hardware. This is often set on the command line for each run, but can also be added to your `~/.bashrc` or other startup config file.
Use the hardware-specific environment variable to select accelerators and set their order. Set it on the command line per run, or add it to `~/.bashrc` or another startup config file.

> [!WARNING]
> Avoid exporting environment variables because if you forget how an environment variable was set up, you may silently train on the wrong accelerators. Set the environment variable on the same command line as the training run.

For example, if there are 4 accelerators (0, 1, 2, 3) and you only want to run accelerators 0 and 2:
For example, to select accelerators 0 and 2 out of four:

<hfoptions id="accelerator-type">
<hfoption id="CUDA">

```bash
```cli
CUDA_VISIBLE_DEVICES=0,2 torchrun trainer-program.py ...
```

Only GPUs 0 and 2 are "visible" to PyTorch and are mapped to `cuda:0` and `cuda:1` respectively.
To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):
PyTorch sees only GPUs 0 and 2, which are mapped to `cuda:0` and `cuda:1`. To reverse the order (use GPU 2 as `cuda:0` and GPU 0 as `cuda:1`):

```bash
```cli
CUDA_VISIBLE_DEVICES=2,0 torchrun trainer-program.py ...
```

To run without any GPUs:

```bash
```cli
CUDA_VISIBLE_DEVICES= python trainer-program.py ...
```

You can also control the order of CUDA devices using `CUDA_DEVICE_ORDER`:
Control the order of CUDA devices with `CUDA_DEVICE_ORDER`.

- Order by PCIe bus ID (matches `nvidia-smi`):

```bash
```cli
export CUDA_DEVICE_ORDER=PCI_BUS_ID
```

- Order by compute capability (fastest first):

```bash
```cli
export CUDA_DEVICE_ORDER=FASTEST_FIRST
```

</hfoption>
<hfoption id="Intel XPU">

```bash
```cli
ZE_AFFINITY_MASK=0,2 torchrun trainer-program.py ...
```

Only XPUs 0 and 2 are "visible" to PyTorch and are mapped to `xpu:0` and `xpu:1` respectively.
To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):
PyTorch sees only XPUs 0 and 2, which are mapped to `xpu:0` and `xpu:1`. To reverse the order (use XPU 2 as `xpu:0` and XPU 0 as `xpu:1`):

```bash
```cli
ZE_AFFINITY_MASK=2,0 torchrun trainer-program.py ...
```

You can also control the order of Intel XPUs with:
Control the order of Intel XPUs with:

```bash
```cli
export ZE_ENABLE_PCI_ID_DEVICE_ORDER=1
```

For more information about device enumeration and sorting on Intel XPU, please refer to the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.
For more on device enumeration and sorting on Intel XPU, see the [Level Zero](https://github.com/oneapi-src/level-zero/blob/master/README.md?plain=1#L87) documentation.

</hfoption>
</hfoptions>

> [!WARNING]
> Environment variables can be exported instead of being added to the command line. This is not recommended because it can be confusing if you forget how the environment variable was set up and you end up using the wrong accelerators. Instead, it is common practice to set the environment variable for a specific training run on the same command line.
Loading
Loading