Skip to content

[XPU] Support XCCL on deepspeed side #7113

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 59 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
51f174e
support XCCL on deepspeed side
ys950902 Mar 6, 2025
eca4150
Update gaudi2 nightly,ci to latest 1.20.0 build (#7093)
raza-sikander Mar 7, 2025
0d64032
fix keep_module_on_host (#7112)
inkcherry Mar 10, 2025
64791b9
Add sequential pytest mark to TestNVMeCheckpointing to resolve pytest…
loadams Mar 11, 2025
c424ed8
Training multiple models (#7018)
tjruwase Mar 11, 2025
0f27e9c
Update CONTRIBUTING.md to reflect changes from CLA to DCO (#7135)
loadams Mar 14, 2025
2e60410
Avoid missing attr error (#7133)
tjruwase Mar 14, 2025
b1bdfb9
Add conditional expression (#7119)
A-transformer Mar 14, 2025
ffbc40b
Unpin transformers version for most workflows (#7139)
loadams Mar 14, 2025
3acf2ce
Conditionally quote env vars (#7071)
saurabhkoshatwar Mar 17, 2025
4fe3efe
Correct the BACKWARD_PREFETCH_SUBMIT mismatch (#7120)
A-transformer Mar 17, 2025
86e3a51
Enhance Gaudi2 CI/Nightly Coverage with Model Parallelism and Linear …
raza-sikander Mar 18, 2025
9557466
Update container version that runs on A6000 tests. (#7153)
loadams Mar 19, 2025
7da6def
fix leak of z3 buffer
tohtana Mar 20, 2025
92e1668
hf tp+zero training doc. (#7151)
inkcherry Mar 20, 2025
7ff9b3f
Avoid graph break by removing redundant requires_grad attr change (#7…
deepcharm Mar 24, 2025
add65b2
Add destroy to tests to free memory (#7160)
tohtana Mar 24, 2025
6fc960c
[NFC] Typo fix in SP layer. (#7152)
c8ef Mar 24, 2025
86f2e31
Link AutoTP blog in the front page (#7167)
hwchen2017 Mar 25, 2025
39a219a
fix `seq_parallel_communication_data_type` constant. (#7175)
stas00 Mar 26, 2025
86c1d9d
Fix typos in GDS blog (#7177)
loadams Mar 26, 2025
24e32b1
Variable batch size and LR scheduler (#7104)
bm-synth Mar 27, 2025
1b0f96f
Update version.txt after 0.16.5 release (#7180)
loadams Mar 27, 2025
2208e9b
Cross layer overlapping for domino (#7178)
hwchen2017 Mar 28, 2025
435439e
async tp allreduce (#7115)
inkcherry Mar 28, 2025
3927096
Fix issue #5242 grad_norm and loss is nan (#7171)
Glaceon-Hyy Mar 29, 2025
eda5079
Add qwen3 autotp support (#7187)
Yejing-Lai Apr 1, 2025
53d03d0
Update to new torch grad hook API: BF16Optimizer and Stage2 (#7189)
deepcharm Apr 1, 2025
0481596
Reland perf fix for nan inf check (#7184)
nelyahu Apr 2, 2025
90abe89
Update to fix pydantic warning (#7193)
loadams Apr 3, 2025
5d6f160
update dependencies version info (#7206)
inkcherry Apr 8, 2025
6374ccd
HPU accelerator memory mapping is broken because of torch fill uninit…
oelayan7 Apr 8, 2025
7a4d298
Support complicated use cases with TiedLayerSpec (#7208)
limjcst Apr 9, 2025
03ea7da
Add defence for offload_states and reload_states w/o optimizer (#7211)
HollowMan6 Apr 10, 2025
fee32e6
DeepCompile for enhanced compiler integration (#7154)
tohtana Apr 16, 2025
ac9da77
Update version.txt after 0.16.6 release (#7218)
loadams Apr 16, 2025
c662e15
Fix release links (#7219)
tjruwase Apr 16, 2025
b3f9adf
Fix pass for z3 and profiler (#7222)
tohtana Apr 17, 2025
c0cd426
Fix build on AMD GPUs (related to DeepCompile) (#7224)
HollowMan6 Apr 17, 2025
bcd899d
Add defence for DeepCompile w/o optimizer (#7225)
HollowMan6 Apr 17, 2025
fa1f688
Pass `with_cuda` arg for jit_load in OpBuilder (#7226)
HollowMan6 Apr 17, 2025
eb37c20
Make sure it's not None before offloading contiguous_grad_buffer (#7227)
HollowMan6 Apr 18, 2025
22b46cf
Update version.txt after 0.16.7 release (#7232)
loadams Apr 18, 2025
d87acac
Recommend using latest (#7233)
tohtana Apr 18, 2025
e0ee4ea
[NFC] Fix comment related to SP group (#7234)
c8ef Apr 21, 2025
0343a57
Add cpu accelerator fp16 dtype support (#7207)
Yejing-Lai Apr 21, 2025
49c6937
Update torch cpu test version
loadams Apr 23, 2025
862d4a2
Revert "Update torch cpu test version"
loadams Apr 23, 2025
86f44d4
Update CPU torch version to 2.7 (#7241)
loadams Apr 23, 2025
7a81da3
Update README.md (#7246)
jizhang02 Apr 25, 2025
a523e29
Fix compile error for nv_bloat162 (#7248)
loscrossos Apr 27, 2025
8a0e979
add `Makefile` to ease maintenance (#7267)
stas00 May 7, 2025
0e737f6
Fix fp8 gemm (#7265)
RezaYazdaniAminabadi May 8, 2025
f407e05
[XPU] update xpu-max1100 CI workflow to torch 2.7 (#7284)
Liangliang-Ma May 15, 2025
ad4dc62
Fix issues XPU tests hit with extra-index-url (#7291)
loadams May 17, 2025
c4edbba
Temporarily skip AIO tests due to an issue with runners (#7288)
loadams May 18, 2025
16c6f44
rollback #6726 (#7258)
delock May 19, 2025
e702877
Update patch version after 0.16.8 release (#7296)
loadams May 19, 2025
49e1407
fix non-torch failure, if the torch version is too old
ys950902 May 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/cpu-torch-latest.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
git clone https://github.com/huggingface/transformers
cd transformers
# if needed switch to the last known good SHA until transformers@master is fixed
git checkout 981c276
# git checkout 981c276
git rev-parse --short HEAD
pip install .

Expand All @@ -59,5 +59,5 @@ jobs:
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.6"
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -m 'sequential' unit/ --torch_ver="2.6"
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -n 4 unit/ --torch_ver="2.7"
HF_HOME=/tmp/hf_home/ pytest $PYTEST_OPTS -m 'sequential' unit/ --torch_ver="2.7"
4 changes: 3 additions & 1 deletion .github/workflows/hpu-gaudi2-nightly.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ jobs:
# The type of runner that the job will run on
runs-on: [self-hosted, intel, gaudi2]
container:
image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
image: vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
ports:
- 80
options: --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice
Expand All @@ -45,6 +45,8 @@ jobs:
test_zero_leaf_module.py
test_zero_offloadpp.py
test_zero_tiled.py
test_autotp_training.py
test_ulysses.py

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
Expand Down
6 changes: 4 additions & 2 deletions .github/workflows/hpu-gaudi2.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
# The type of runner that the job will run on
runs-on: [self-hosted, intel, gaudi2]
container:
image: vault.habana.ai/gaudi-docker/1.19.0/ubuntu22.04/habanalabs/pytorch-installer-2.5.1:latest
image: vault.habana.ai/gaudi-docker/1.20.0/ubuntu22.04/habanalabs/pytorch-installer-2.6.0:latest
ports:
- 80
options: --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice
Expand Down Expand Up @@ -94,6 +94,8 @@ jobs:
test_zero_nesting_init.py
test_zeropp.py
(test_zero.py and (TestZero3ParamPartitioningLargeParam or TestZero3ParamPartitioningLargeParam))
(test_linear.py and (TestLoRALinear or TestBasicLinear))
(test_ctx.py and TestEngine)

# Steps represent a sequence of tasks that will be executed as part of the job
steps:
Expand All @@ -112,7 +114,7 @@ jobs:
git clone https://github.com/huggingface/transformers
cd transformers
# if needed switch to the last known good SHA until transformers@master is fixed
git checkout 981c276
# git checkout 981c276
git rev-parse --short HEAD
pip install .

Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/nv-a6000.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ jobs:
unit-tests:
runs-on: [self-hosted, nvidia, a6000]
container:
image: nvcr.io/nvidia/pytorch:24.09-py3
image: nvcr.io/nvidia/pytorch:24.12-py3
ports:
- 80
options: --gpus all --shm-size "8G"
Expand All @@ -43,7 +43,7 @@ jobs:
git clone https://github.com/huggingface/transformers
cd transformers
# if you need to use an older transformers version temporarily in case of breakage
git checkout 981c276
# git checkout 981c276
git rev-parse --short HEAD
python -m pip install .
- name: Install deepspeed
Expand All @@ -58,8 +58,8 @@ jobs:
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2' unit/ --torch_ver="2.5" --cuda_ver="12"
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2_ops' unit/ --torch_ver="2.5" --cuda_ver="12"
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2' unit/ --torch_ver="2.6" --cuda_ver="12"
python -m pytest --color=yes --durations=0 --verbose -rF -m 'inference_v2_ops' unit/ --torch_ver="2.6" --cuda_ver="12"
- name: MII unit tests
run: |
BRANCH="main"
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/nv-flash-attn.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ jobs:
unit-tests:
runs-on: [self-hosted, nvidia, a6000]
container:
image: nvcr.io/nvidia/pytorch:24.09-py3
image: nvcr.io/nvidia/pytorch:24.12-py3
ports:
- 80
options: --gpus all --shm-size "8G"
Expand Down Expand Up @@ -53,7 +53,7 @@ jobs:
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
python -m pytest --color=yes --durations=0 --verbose -rF unit/sequence_parallelism/test_ulysses.py --torch_ver="2.5" --cuda_ver="12"
python -m pytest --color=yes --durations=0 --verbose -rF unit/sequence_parallelism/test_ulysses.py --torch_ver="2.6" --cuda_ver="12"
- name: Open GitHub issue if nightly CI fails
if: ${{ failure() && (github.event_name == 'schedule') }}
uses: JasonEtco/create-an-issue@v2
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/nv-human-eval.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ jobs:
unit-tests:
runs-on: [self-hosted, nvidia, a6000]
container:
image: nvcr.io/nvidia/pytorch:24.09-py3
image: nvcr.io/nvidia/pytorch:24.12-py3
ports:
- 80
options: --gpus all --shm-size "8G"
Expand Down Expand Up @@ -50,4 +50,4 @@ jobs:
run: |
unset TORCH_CUDA_ARCH_LIST # only jit compile for current arch
cd tests
python -m pytest --color=yes --durations=0 --verbose -rF -m 'evaluation' -k "test_human_eval" unit/ --torch_ver="2.5" --cuda_ver="12"
python -m pytest --color=yes --durations=0 --verbose -rF -m 'evaluation' -k "test_human_eval" unit/ --torch_ver="2.6" --cuda_ver="12"
2 changes: 1 addition & 1 deletion .github/workflows/nv-pre-compile-ops.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
#python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
- name: Compile DeepSpeed Ops
run: |
DS_ACCELERATOR=cuda DS_ENABLE_NINJA=1 TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 DS_BUILD_FP_QUANTIZER=0 DS_BUILD_CUTLASS_OPS=0 DS_BUILD_GDS=0 DS_BUILD_RAGGED_DEVICE_OPS=0 DS_BUILD_EVOFORMER_ATTN=0 pip3 install .
DS_ACCELERATOR=cuda DS_ENABLE_NINJA=1 TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0" DS_BUILD_OPS=1 DS_BUILD_SPARSE_ATTN=0 DS_BUILD_FP_QUANTIZER=0 DS_BUILD_CUTLASS_OPS=0 DS_BUILD_GDS=0 DS_BUILD_RAGGED_DEVICE_OPS=0 DS_BUILD_EVOFORMER_ATTN=0 DS_BUILD_DEEP_COMPILE=0 pip3 install .
- name: DS Report
run: |
ds_report
2 changes: 1 addition & 1 deletion .github/workflows/nv-torch-latest-v100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:

- name: Install deepspeed
run: |
pip install .[dev,1bit,autotuning]
pip install .[dev,1bit,autotuning,deepcompile]
ds_report

- name: Python environment
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/nv-torch-nightly-v100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ jobs:
git clone https://github.com/huggingface/transformers
cd transformers
# if needed switch to the last known good SHA until transformers@master is fixed
git checkout 981c276
# git checkout 981c276
git rev-parse --short HEAD
pip install .

Expand Down
4 changes: 3 additions & 1 deletion .github/workflows/setup-venv/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ runs:
- id: update-env
run: |
sudo apt-get update
sudo apt-get install -y libaio-dev
# Temporary disable nvme UTs
# sudo apt-get install -y libaio-dev
sudo apt remove -y libaio-dev
python -m pip install --user --upgrade pip
python -m pip install --user --upgrade virtualenv
shell: bash
Expand Down
19 changes: 8 additions & 11 deletions .github/workflows/xpu-max1100.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ jobs:
unit-tests:
runs-on: [self-hosted, intel, xpu]
container:
image: intel/oneapi-basekit:2025.0.1-0-devel-ubuntu24.04
image: intel/oneapi-basekit:2025.0.2-0-devel-ubuntu22.04
ports:
- 80
options: --privileged -it --rm --device /dev/dri:/dev/dri -v /dev/dri/by-path:/dev/dri/by-path --ipc=host --cap-add=ALL
Expand All @@ -47,20 +47,16 @@ jobs:
shell: bash
run: |
apt-get update
apt-get install clinfo libaio-dev python3-pip python3.12-venv -y
python3 -m venv ~/ds_env
source ~/ds_env/bin/activate
pip install torch==2.5.1 -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/torch/
pip install intel-extension-for-pytorch==2.5.10+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/intel-extension-for-pytorch/
pip install oneccl_bind_pt==2.5.0+xpu -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/oneccl-bind-pt/
pip install torchvision==0.20.1 -f https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/torchvision/
pip install py-cpuinfo numpy
apt-get install -y python3.11 python3.11-dev python3-pip clinfo libaio-dev
pip install --upgrade pip
pip install py-cpuinfo
pip install torch==2.7.0 torchvision==0.22.0 torchaudio==2.7.0 --index-url https://download.pytorch.org/whl/xpu
pip install intel-extension-for-pytorch==2.7.10+xpu oneccl_bind_pt==2.7.0+xpu --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us
pip install .[dev,autotuning]

- name: Check container state
shell: bash
run: |
source ~/ds_env/bin/activate
ldd --version
ds_report
python3 -c "import torch; print('torch:', torch.__version__, torch)"
Expand All @@ -71,8 +67,9 @@ jobs:
- name: Unit tests
shell: bash
run: |
source ~/ds_env/bin/activate
cd tests/unit
export FI_PROVIDER="tcp"
export I_MPI_SHM=off
pytest --verbose accelerator/*
pytest --verbose autotuning/*
pytest --verbose checkpoint/test_reshape_checkpoint.py
Expand Down
28 changes: 19 additions & 9 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,12 @@ If a formatting test fails, it will fix the modified code in place and abort
the `git commit`. After looking over the changes, you can `git add <modified files>`
and then repeat the previous `git commit` command.

You can also run:
```
make format
```
which will do the same as above, and it'll also automatically build a `venv` python environment if you
don't already have one, which will isolate the requirements of this project from requirements of other projects.

## Testing
DeepSpeed tracks two types of tests: unit tests and more costly model convergence tests.
Expand All @@ -38,6 +44,11 @@ You can also provide the `-v` flag to `pytest` to see additional information abo
tests. Note that [pytest-forked](https://github.com/pytest-dev/pytest-forked) and the
`--forked` flag are required to test CUDA functionality in distributed tests.

You can also run:
```
make test
```

### Model Tests
To execute model tests, first [install DeepSpeed](#installation). The
[DeepSpeedExamples](https://github.com/deepspeedai/DeepSpeedExamples/) repository is cloned
Expand All @@ -48,16 +59,15 @@ pytest run_sanity_check.py
```
Note that the `--forked` flag is not necessary for the model tests.

## Contributor License Agreement
This project welcomes contributions and suggestions. Most contributions require you to
agree to a Contributor License Agreement (CLA) declaring that you have the right to, and
actually do, grant us the rights to use your contribution. For details, visit
https://cla.opensource.microsoft.com.
## Developer Certificate of Origin
This project welcomes contributions and suggestions. All contributions to deepspeedai projects
require commits to be signed off with a [Developer Certificate of Origin](https://en.wikipedia.org/wiki/Developer_Certificate_of_Origin)
(DCO) declaring that you have the right to, and actually do, grant us the rights to use your contribution.

When you submit a pull request, the DCO app will check for the presence of signed commits.
Information about how this check works is here: https://github.com/dcoapp/app?tab=readme-ov-file#how-it-works

When you submit a pull request, a CLA bot will automatically determine whether you need
to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply
follow the instructions provided by the bot. You will only need to do this once across
all repos using our CLA.
To sign commits, you will need to include `-s` when running `git commit`. For example, `git commit -s -m "Commit message"`. One note, creating PRs via the GitHub interface do not appear to include this option. If you forget this, clicking on the failing check in your PR will point you to commands you can run to rebase and sign previous commits.

## Code of Conduct
This project has adopted the [Microsoft Open Source Code of
Expand Down
23 changes: 23 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# usage: make help

.PHONY: help test format
.DEFAULT_GOAL := help

help: ## this help
@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n make \033[36m<target>\033[0m\n"} /^[0-9a-zA-Z_-]+:.*?##/ { printf " \033[36m%-22s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)
echo $(MAKEFILE_LIST)

test: ## run tests
pytest --forked tests/unit/

format: ## fix formatting
@if [ ! -d "venv" ]; then \
python -m venv venv; \
. venv/bin/activate; \
pip install pre-commit -U; \
pre-commit clean; \
pre-commit uninstall; \
pre-commit install; \
deactivate; \
fi
. venv/bin/activate && pre-commit run --files $$(git diff --name-only master) && deactivate
Loading