Skip to content

Commit 7828c00

Browse files
committed
Update base for Update on "[rfc][dynamo] "skip_guard_eval" stance for power users"
# Motivation We have spent quite some time this year on improving guard performance and soundness. Nevertheless, guards STILL take time. We have seen multiple requests/evidences from power users where they want to have almost 0% guard overhead. First, we saw this in vLLM where even 1% overhead is bad. And recently we saw this in hqq (low precision LLM generation) - #138386. To put some numbers for perspective, low precision LLM inference reaches around 250 tokens/second, i.e, each token takes a mere 4 milliseconds. If guard overhead is even 200 us, its still 5% overhead in total. Here, users ask - "we can guarantee that there will no more recompilations in the steady state, give us the lowest guard overhead" # Design A must-have consideration is to support fast inference where the model has recompiled, i.e., has multiple cache entries for a code object (could be because of dynamism, or just tensor dtype change in the case of hqq). So, we still have to run the guards to figure out which compiled graph to run. What we need is the "minimal set of differentiating guards" - i.e. minimals set of guards that we can run to choose the compiled graph. Note that this works ONLY with the assumption that users really guarantee no more recompilation scenarios (no more mutations, no more dynamism after the model has been warmed up). It is possible that if user violates this assumption, and it is not covered by the diff guard set, we will choose a wrong compiled graph to run. When we designed C++ guards, Ed and Voz suggested to use Trie-structure to directly represent this "diff guard set". But due to complexity, we went for tree structure and relied on a GuardManager state - "fail_count" - to fail fast. I realized that we can rely on this "fail_count" to find the diff guard set. If we recompile, this means that all the cache line guard eval check_fns have failed. Whenever a guard check_fn fails, we increment the counter in the failing node (and propagate it to the root node) to do faster fail next time. If we want to run the "guard diff set", we just have to run only those nodes in the tree which have "fail_count > 0". This PR relies on this observation to introduce a new stance - "skip_guard_eval". The idea is that user will warm up their model with torch.compile, and the run the steady state with this stance. This stance go through the existing cache lines for the intercepted code object but only runs the diff guard set. This dramatically reduces the guard overhead. In case, all guards fail, we fall back to eager (however if this happens then user is violating the assumption, so we should perhaps hard error, I need to fix some silly issue from _dynamo.disable to hard error here). A bonus point here is that this "theoretically" works with graph breaks as well. But, I need more testing to convince myself about this. # Evaluation I tried the hqq model in #138386. With very small changes in the user code ([hqq PR](mobiusml/hqq#127)), I see the throughput increase from **160 tokens/sec to 174 tokens/sec**. cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx chenyang78 kadeng chauhang amjames rec [ghstack-poisoned]
2 parents 2a65052 + 854be65 commit 7828c00

File tree

893 files changed

+23173
-11364
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

893 files changed

+23173
-11364
lines changed

.ci/docker/android/AndroidManifest.xml

-1
This file was deleted.

.ci/docker/android/build.gradle

-66
This file was deleted.

.ci/docker/build.sh

+1-18
Original file line numberDiff line numberDiff line change
@@ -244,16 +244,6 @@ case "$image" in
244244
CONDA_CMAKE=yes
245245
ONNX=yes
246246
;;
247-
pytorch-linux-focal-py3-clang9-android-ndk-r21e)
248-
ANACONDA_PYTHON_VERSION=3.9
249-
CLANG_VERSION=9
250-
LLVMDEV=yes
251-
PROTOBUF=yes
252-
ANDROID=yes
253-
ANDROID_NDK_VERSION=r21e
254-
GRADLE_VERSION=6.8.3
255-
NINJA_VERSION=1.9.0
256-
;;
257247
pytorch-linux-focal-py3.9-clang10)
258248
ANACONDA_PYTHON_VERSION=3.9
259249
CLANG_VERSION=10
@@ -275,6 +265,7 @@ case "$image" in
275265
SWIFTSHADER=yes
276266
CONDA_CMAKE=yes
277267
TRITON=yes
268+
GRAPHVIZ=yes
278269
;;
279270
pytorch-linux-focal-py3.9-gcc9)
280271
ANACONDA_PYTHON_VERSION=3.9
@@ -414,9 +405,6 @@ case "$image" in
414405
DB=yes
415406
VISION=yes
416407
CONDA_CMAKE=yes
417-
# snadampal: skipping sccache due to the following issue
418-
# https://github.com/pytorch/pytorch/issues/121559
419-
SKIP_SCCACHE_INSTALL=yes
420408
# snadampal: skipping llvm src build install because the current version
421409
# from pytorch/llvm:9.0.1 is x86 specific
422410
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@@ -429,9 +417,6 @@ case "$image" in
429417
DB=yes
430418
VISION=yes
431419
CONDA_CMAKE=yes
432-
# snadampal: skipping sccache due to the following issue
433-
# https://github.com/pytorch/pytorch/issues/121559
434-
SKIP_SCCACHE_INSTALL=yes
435420
# snadampal: skipping llvm src build install because the current version
436421
# from pytorch/llvm:9.0.1 is x86 specific
437422
SKIP_LLVM_SRC_BUILD_INSTALL=yes
@@ -508,8 +493,6 @@ docker build \
508493
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
509494
--build-arg "CUDNN_VERSION=${CUDNN_VERSION}" \
510495
--build-arg "TENSORRT_VERSION=${TENSORRT_VERSION}" \
511-
--build-arg "ANDROID=${ANDROID}" \
512-
--build-arg "ANDROID_NDK=${ANDROID_NDK_VERSION}" \
513496
--build-arg "GRADLE_VERSION=${GRADLE_VERSION}" \
514497
--build-arg "VULKAN_SDK_VERSION=${VULKAN_SDK_VERSION}" \
515498
--build-arg "SWIFTSHADER=${SWIFTSHADER}" \
+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
16b633b4daa7f3d3442be62a3589bd60b2f7fdc7
1+
91c382df0d2b2ef383d57998a61187cfefcb26e3

.ci/docker/common/install_android.sh

-112
This file was deleted.

.ci/docker/common/install_cache.sh

+44-7
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,12 @@ install_ubuntu() {
99
# Instead use lib and headers from OpenSSL1.1 installed in `install_openssl.sh``
1010
apt-get install -y cargo
1111
echo "Checking out sccache repo"
12-
git clone https://github.com/pytorch/sccache
12+
if [ -n "$CUDA_VERSION" ]; then
13+
# TODO: Remove this
14+
git clone https://github.com/pytorch/sccache
15+
else
16+
git clone https://github.com/mozilla/sccache -b v0.8.2
17+
fi
1318
cd sccache
1419
echo "Building sccache"
1520
cargo build --release
@@ -19,6 +24,10 @@ install_ubuntu() {
1924
rm -rf sccache
2025
apt-get remove -y cargo rustc
2126
apt-get autoclean && apt-get clean
27+
28+
echo "Downloading old sccache binary from S3 repo for PCH builds"
29+
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /opt/cache/bin/sccache-0.2.14a
30+
chmod 755 /opt/cache/bin/sccache-0.2.14a
2231
}
2332

2433
install_binary() {
@@ -36,18 +45,46 @@ if [ -n "$ROCM_VERSION" ]; then
3645
curl --retry 3 http://repo.radeon.com/misc/.sccache_amd/sccache -o /opt/cache/bin/sccache
3746
else
3847
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
39-
# TODO: Install the pre-built binary from S3 as building from source
40-
# https://github.com/pytorch/sccache has started failing mysteriously
41-
# in which sccache server couldn't start with the following error:
42-
# sccache: error: Invalid argument (os error 22)
43-
install_binary
48+
if [ -n "$CUDA_VERSION" ]; then
49+
# TODO: Install the pre-built binary from S3 as building from source
50+
# https://github.com/pytorch/sccache has started failing mysteriously
51+
# in which sccache server couldn't start with the following error:
52+
# sccache: error: Invalid argument (os error 22)
53+
install_binary
54+
else
55+
install_ubuntu
56+
fi
4457
fi
4558
chmod a+x /opt/cache/bin/sccache
4659

4760
function write_sccache_stub() {
4861
# Unset LD_PRELOAD for ps because of asan + ps issues
4962
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
50-
printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n exec sccache $(which $1) \"\$@\"\nelse\n exec $(which $1) \"\$@\"\nfi" > "/opt/cache/bin/$1"
63+
if [ $1 == "gcc" ]; then
64+
# Do not call sccache recursively when dumping preprocessor argument
65+
# For some reason it's very important for the first cached nvcc invocation
66+
cat > "/opt/cache/bin/$1" <<EOF
67+
#!/bin/sh
68+
69+
if [ "\$1" = "-E" ] || [ "\$2" = "-E" ]; then
70+
exec $(which $1) "\$@"
71+
elif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
72+
exec sccache $(which $1) "\$@"
73+
else
74+
exec $(which $1) "\$@"
75+
fi
76+
EOF
77+
else
78+
cat > "/opt/cache/bin/$1" <<EOF
79+
#!/bin/sh
80+
81+
if [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then
82+
exec sccache $(which $1) "\$@"
83+
else
84+
exec $(which $1) "\$@"
85+
fi
86+
EOF
87+
fi
5188
chmod a+x "/opt/cache/bin/$1"
5289
}
5390

.ci/docker/common/install_graphviz.sh

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
#!/bin/bash
2+
3+
set -ex
4+
5+
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
6+
7+
if [ -n "${UBUNTU_VERSION}" ]; then
8+
apt update
9+
apt-get install -y graphviz
10+
elif [ -n "${CENTOS_VERSION}" ]; then
11+
dnf update
12+
dnf install -y graphviz
13+
else
14+
echo "Unsupported Linux distribution"
15+
exit 1
16+
fi

.ci/docker/common/install_user.sh

+7
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,13 @@
22

33
set -ex
44

5+
# Since version 24 the system ships with user 'ubuntu' that has id 1000
6+
# We need a work-around to enable id 1000 usage for this script
7+
if [[ $UBUNTU_VERSION == 24.04 ]]; then
8+
# touch is used to disable harmless error message
9+
touch /var/mail/ubuntu && chown ubuntu /var/mail/ubuntu && userdel -r ubuntu
10+
fi
11+
512
# Mirror jenkins user in container
613
# jenkins user as ec2-user should have the same user-id
714
echo "jenkins:x:1000:1000::/var/lib/jenkins:" >> /etc/passwd

.ci/docker/requirements-ci.txt

+6-2
Original file line numberDiff line numberDiff line change
@@ -128,8 +128,7 @@ numba==0.55.2 ; python_version == "3.10"
128128
#test_nn.py, test_namedtensor.py, test_linalg.py, test_jit_cuda_fuser.py,
129129
#test_jit.py, test_indexing.py, test_datapipe.py, test_dataloader.py,
130130
#test_binary_ufuncs.py
131-
numpy==1.21.2; python_version == "3.9"
132-
numpy==1.22.4; python_version == "3.10"
131+
numpy==1.22.4; python_version == "3.9" or python_version == "3.10"
133132
numpy==1.26.2; python_version == "3.11" or python_version == "3.12"
134133
numpy==2.1.2; python_version >= "3.13"
135134

@@ -206,6 +205,11 @@ xdoctest==1.1.0
206205
#Pinned versions: 1.1.0
207206
#test that import:
208207

208+
pydot==3.0.1
209+
#Description: Needed for testing FxGraphDrawer
210+
#Pinned versions:
211+
#test that import:
212+
209213
pygments==2.15.0
210214
#Description: support doctest highlighting
211215
#Pinned versions: 2.12.0

0 commit comments

Comments
 (0)