-
Notifications
You must be signed in to change notification settings - Fork 459
[CI] Add multi_node CI #2749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[CI] Add multi_node CI #2749
Conversation
Signed-off-by: wangli <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a multi-node continuous integration (CI) setup using Kubernetes and LeaderWorkerSet for testing vLLM on Ascend NPUs. The changes include new benchmark dependencies, an installer script, launch scripts for leader/worker nodes, and a Kubernetes manifest. While this is a valuable addition for ensuring multi-node stability, my review has identified several critical and high-severity issues. There are critical flaws in the CI logic that prevent it from testing the actual code changes in a pull request. Additionally, a key script for setting up a Ray cluster contains a bug that will cause it to hang. Other issues include hardcoded values that reduce script robustness and a missing readiness probe in the Kubernetes configuration that could lead to flaky tests. These issues should be addressed to make the new CI pipeline reliable and effective.
checkout_src() { | ||
echo "====> Checkout source code" | ||
mkdir -p "$SRC_DIR" | ||
|
||
# vllm-ascend | ||
if [ ! -d "$SRC_DIR/vllm-ascend" ]; then | ||
git clone --depth 1 https://github.com/vllm-project/vllm-ascend.git "$SRC_DIR/vllm-ascend" | ||
fi | ||
|
||
# vllm | ||
if [ ! -d "$SRC_DIR/vllm" ]; then | ||
git clone -b v0.10.1.1 https://github.com/vllm-project/vllm.git "$SRC_DIR/vllm" | ||
fi | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checkout_src
function clones vllm-ascend
from its main repository. When running in a CI environment for a pull request, this will ignore the changes from the PR and test against the main
branch instead. The CI system should be responsible for checking out the correct version of the code, and these scripts should use that version. This defeats the purpose of running CI on pull requests.
command: | ||
- sh | ||
- -c | ||
- "bash /root/.cache/tests/multi_node_mp/launch_server_leader.sh" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The container's command executes a script from /root/.cache
, which appears to be a path on a persistent volume. This is a critical flaw in a CI setup, as it means the CI is running a cached version of the script, not the one from the pull request. Changes to the launch scripts in a PR will not be tested. The same issue exists for the worker template (lines 72-75). The command should execute the script from the CI workspace where the PR's source code is checked out.
command:
- sh
- -c
- "bash $WORKSPACE/tests/e2e/multi_nodes/multi_node_mp/launch_server_leader.sh"
|
||
# Retry until the worker node connects to the head node or the timeout expires. | ||
for (( i=0; i < $ray_init_timeout; i+=5 )); do | ||
ray start --address=$ray_address:$ray_port --block "${start_params[@]}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The worker
subcommand uses ray start --block
, which will cause the script to hang. The --block
flag prevents the command from returning until the process is manually terminated. This means the retry loop will never continue, and the script will not function as intended.
ray start --address=$ray_address:$ray_port --block "${start_params[@]}" | |
ray start --address=$ray_address:$ray_port "${start_params[@]}" |
# this obtained through ifconfig | ||
# nic_name is the network interface name corresponding to local_ip | ||
local_ip=$(hostname -I | awk '{print $1}') | ||
nic_name=eth0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The network interface name nic_name
is hardcoded to eth0
. This is not guaranteed to be correct in all environments, including Kubernetes pods, which can lead to script failures if the interface has a different name. It's better to determine the interface name dynamically.
nic_name=eth0 | |
nic_name=$(ip -o addr show | awk -v ip="$local_ip" '/inet / && $4 ~ ip {print $2}') |
#!/bin/bash | ||
|
||
local_ip=$(hostname -I | awk '{print $1}') | ||
nic_name=eth0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The network interface name nic_name
is hardcoded to eth0
. This is brittle and will fail in environments where the interface has a different name. The interface name should be determined dynamically based on the local IP address.
nic_name=eth0 | |
nic_name=$(ip -o addr show | awk -v ip="$local_ip" '/inet / && $4 ~ ip {print $2}') |
# readinessProbe: | ||
# tcpSocket: | ||
# port: 8080 | ||
# initialDelaySeconds: 15 | ||
# periodSeconds: 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The readinessProbe
for the leader container is commented out. Without it, Kubernetes cannot determine when the server is ready to accept requests. This can lead to flaky CI jobs if subsequent steps try to connect to the server before it's fully initialized. A readiness probe is essential for a reliable service.
readinessProbe:
tcpSocket:
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
failureThreshold: 30
Signed-off-by: wangli <[email protected]>
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
What this PR does / why we need it?
The purpose of this PR is to add multi-nodes CI integrating with k8s, for now, with 2 * 16 A3 nodes testing deepseek multi-dp, and I will explain the main implementation ideas:
kubectl apply -f lws.yaml
to launch the serverThe env
LWS_LEADER_ADDRESS
is injected in advance before the pod starts, where the worker pod can access the leader node's IPDoes this PR introduce any user-facing change?
How was this patch tested?