distributedconfig api

stevhliu · stevhliu · commit ed89ef161778 · 2026-06-17T09:17:48.000-07:00
diff --git a/.claude/skills b/.claude/skills
@@ -0,0 +1 @@
+../.ai/skills
diff --git a/docs/source/en/_toctree.yml b/docs/source/en/_toctree.yml
@@ -219,6 +219,8 @@
       title: Accelerator selection
     - local: accelerate
       title: Accelerate
+    - local: distributed_config
+      title: DistributedConfig
     - local: fsdp
       title: FullyShardedDataParallel
     - local: deepspeed
diff --git a/docs/source/en/distributed_config.md b/docs/source/en/distributed_config.md
@@ -0,0 +1,133 @@
+<!--Copyright 2026 The HuggingFace Team. All rights reserved.
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
+an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
+specific language governing permissions and limitations under the License.
+
+⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
+rendered properly in your Markdown viewer.
+
+-->
+
+# DistributedConfig
+
+[`DistributedConfig`] shards a model across GPUs directly through [`~PreTrainedModel.from_pretrained`]. It supports [FSDP2](./fsdp), [tensor parallelism](./tensor_parallelism), and [N-D parallelism](./perf_train_gpu_many).
+
+Pass a [`DistributedConfig`] to [`~PreTrainedModel.from_pretrained`] and Transformers builds the device mesh and shards the supported layers for you.
+
+The fields below control how the model is sharded.
+
+| field | description |
+|---|---|
+| `tp_size` | Number of devices for tensor parallelism. Defaults to 1 when only `fsdp_size` is set. |
+| `tp_plan` | Tensor parallel sharding plan. Leave as `None` to use the model's default plan. |
+| `fsdp_size` | Number of devices for FSDP2. Defaults to 1 when only `tp_size` is set. |
+| `fsdp_cpu_offload` | Offload parameters and gradients to CPU to save GPU memory. Defaults to `False`. |
+| `fsdp_mixed_precision` | Compute in `bfloat16` and reduce gradients in `float32`. Defaults to `False`. |
+| `enable_expert_parallel` | Shard mixture-of-experts layers across devices. See [Expert parallelism](./expert_parallelism). |
+
+The product of `tp_size` and `fsdp_size` must equal the number of devices you launch with.
+
+## FSDP2
+
+[FSDP2](./fsdp) shards parameters, gradients, and optimizer states across GPUs. Set `fsdp_size` to the number of devices to shard across.
+
+```py
+import torch
+from transformers import AutoModelForCausalLM
+from transformers.distributed.configuration_utils import DistributedConfig
+
+distributed_config = DistributedConfig(fsdp_size=4)
+
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    distributed_config=distributed_config,
+)
+```
+
+Transformers wraps each layer according to the model's `base_model_fsdp_plan`. Check whether a model declares one before sharding.
+
+```py
+from transformers import AutoConfig
+
+config = AutoConfig.from_pretrained("Qwen/Qwen3-0.6B")
+print(config.base_model_fsdp_plan)
+```
+
+The plan maps modules to a sharding strategy. `free_full_weight` reshards a module after the forward pass to save memory, and `keep_full_weight` keeps it gathered to avoid a second all-gather during the backward pass.
+
+```py
+{
+    "embed_tokens": "free_full_weight",
+    "layers.*": "free_full_weight",
+    "norm": "keep_full_weight",
+}
+```
+
+Set `fsdp_mixed_precision=True` to compute in `bfloat16` while reducing gradients in `float32`, and set `fsdp_cpu_offload=True` to move parameters and gradients to CPU when they aren't in use.
+
+```py
+distributed_config = DistributedConfig(
+    fsdp_size=4,
+    fsdp_mixed_precision=True,
+    fsdp_cpu_offload=True,
+)
+```
+
+## Tensor parallelism
+
+[Tensor parallelism](./tensor_parallelism) splits weight matrices across GPUs. Set `tp_size` to shard the model's supported layers.
+
+```py
+import torch
+from transformers import AutoModelForCausalLM
+from transformers.distributed.configuration_utils import DistributedConfig
+
+distributed_config = DistributedConfig(tp_size=4)
+
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    distributed_config=distributed_config,
+)
+```
+
+Transformers shards according to the model's `base_model_tp_plan`. Pass `tp_plan` to override the layout, for example `{"model.layers.*.self_attn.q_proj": "colwise"}`.
+
+## N-D parallelism
+
+Combine FSDP2 and tensor parallelism by setting both sizes. The example below runs on 4 GPUs, sharding each tensor-parallel group of 2 GPUs with FSDP2 across the remaining 2.
+
+```py
+import torch
+from transformers import AutoModelForCausalLM
+from transformers.distributed.configuration_utils import DistributedConfig
+
+distributed_config = DistributedConfig(tp_size=2, fsdp_size=2)
+
+model = AutoModelForCausalLM.from_pretrained(
+    "Qwen/Qwen3-0.6B",
+    dtype=torch.bfloat16,
+    distributed_config=distributed_config,
+)
+```
+
+## Launch
+
+Launch your script with [torchrun](https://pytorch.org/docs/stable/elastic/run.html) and set `--nproc-per-node` to the total number of devices, equal to `tp_size * fsdp_size`.
+
+```shell
+torchrun --nproc-per-node 4 train.py
+```
+
+## Next steps
+
+- See [FSDP2](./fsdp) for sharded training.
+- See [Tensor parallelism](./tensor_parallelism) for more details on partitioning strategies and manual plans.
+- See [Expert parallelism](./expert_parallelism) for sharding mixture-of-experts models.
+- See [N-D parallelism](./perf_train_gpu_many) for combining parallelism strategies.
+- Read [The Ultra-Scale Playbook](https://huggingface.co/spaces/nanotron/ultrascale-playbook) for a deeper look at how these strategies work.
diff --git a/utils/.checkers_cache.json b/utils/.checkers_cache.json
@@ -0,0 +1,20 @@
+{
+  "add_dates": "2c0ef6a3ec2eb2a3cce73c6a74c0564b219e698b6f5946270661116133c9a07c",
+  "auto_mappings": "7e19518867242e07284c2f62b21f1195b762d47e1693cee727842b880493aceb",
+  "config_attributes": "b050cab4ee3f179c4d19dc8856da37a1f09385fb2b29df5c58e045ee0ec29b40",
+  "config_docstrings": "fbc006ec716f7421d51c9b245fe496184b1e7ce50627ce4e75a2ac0836d95346",
+  "copies": "28b52d9a0d557147c611907c7c5177ca3219708aa6be6f4fdbb28fd2a93f0ea3",
+  "deps_table": "dd2c3dd9c20aba4869ced10b5dbfa9dcc443b0981aace7b2a0fbf4b5e5cec2c1",
+  "docstrings": "5ba8326a194c9606de1424f1f5c1e20077545c53a9ed6b768a6f8a2e1870f7e6",
+  "doctest_list": "98897e42dabaed5c666734f12a5049f4327fc89bad4621819adf55ce3e9c2a66",
+  "dummies": "8b9eb0f2047c2e692adba8e01f4207370ffe3b4de8b83482b98cea4630b3e2ef",
+  "imports": "4e8c8768fc924f3f530debaf287bad4bb9d267e7c86728450cb63e9b7c201376",
+  "init_isort": "1d049dc690b05fad7209f1e3ccb49ebce51db3fef94b63e140ce5b69c1ab24af",
+  "inits": "13852b590793c350372c94fdedb7f16b0e081bf61ec4ed83fae13304b19e837f",
+  "modular_conversion": "8e778ff2f66849bb611c594bcdcb2be8125b467e32a2537ebbf37467f1943422",
+  "pipeline_typing": "3cb9d37a9d033222ad798914141cd056e264f5158754fb590580e6ac85128f72",
+  "ruff_check": "0bacd4bcbd205e1611d816882ed10a719f77761c3950fd6d831899c267055a23",
+  "ruff_format": "0bacd4bcbd205e1611d816882ed10a719f77761c3950fd6d831899c267055a23",
+  "sort_auto_mappings": "3d98987835c97d17679c4732a38fce3bd46edd3dc5e9f09dc659d74cc4fca3c9",
+  "update_metadata": "10a0fc570ecb47b9be79a682a831a2a67ab0cb7067cec849d1985493f969e371"
+}