tensor parallel training bug #36296

iMountTai · 2025-02-20T08:15:10Z

System Info

transformers：4.45.dev0
python：3.11
linux

Who can help?

#34194

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

torchrun --nnodes 1 --nproc_per_node 2 --master_port 27654 run_clm.py
--model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0
--dataset_name wikitext
--dataset_config_name wikitext-2-raw-v1
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--do_train
--do_eval
--tp_size 2
--output_dir /tmp/test-clm

unexpected behavior:
runtimeerror: aten._foreach_norm_Scalar: got mixed torch.tensor and DTensor， need to convert all torch.tensor to DTensor before calling distributed operators.

Expected behavior

autoTP training

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-02-20T13:48:05Z

cc @kmehant @SunMarc @muellerzr

bursteratom · 2025-02-20T19:28:44Z

@iMountTai for now we will need to set max_grad_norm=-1

iMountTai · 2025-02-21T02:17:33Z

Thank you very much for your prompt reply. The issue was resolved after setting max_grad_norm to -1. However, when using TP + DDP, the following error occurs.

torchrun --nnodes 1 --nproc_per_node 4 --master_port 27654 run_clm.py \
--model_name_or_path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--do_train \
--do_eval \
--tp_size 2 \
--max_grad_norm -1 \
--output_dir /tmp/test-clm

in torch/distributed/device_mesh.py,line 721，in get_group_find_pg_by_ranks_and_tag(*self._dim_group_infos[mesh_dim][:2]) # type: ignore[index]
indexerror: list index out of range

Additionally, does the current TP training support the following traning:

TP + DDP
Loading LoRA training on unquantized models
Loading LoRA training on GPTQ models

github-actions · 2025-03-23T08:03:33Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

iMountTai added the bug label Feb 20, 2025

kmehant mentioned this issue Feb 20, 2025

fix: support grad clipping for TP through replicating non-sharded modules #36132

Open

5 tasks

iMountTai mentioned this issue Feb 24, 2025

TP + DP training error huggingface/peft#2394

Closed

4 tasks

github-actions bot closed this as completed Mar 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tensor parallel training bug #36296

tensor parallel training bug #36296

iMountTai commented Feb 20, 2025

Rocketknight1 commented Feb 20, 2025

bursteratom commented Feb 20, 2025

iMountTai commented Feb 21, 2025

github-actions bot commented Mar 23, 2025

tensor parallel training bug #36296

tensor parallel training bug #36296

Comments

iMountTai commented Feb 20, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Feb 20, 2025

bursteratom commented Feb 20, 2025

iMountTai commented Feb 21, 2025

github-actions bot commented Mar 23, 2025