-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Add Mistral Large 3 to nightly CI tests #14459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Add nightly performance test for mistralai/Mistral-Large-3-675B-Instruct-2512 model with: - TP=8 - trtllm_mla attention backend - Multithread loading enabled - Mistral chat template - SGLANG_ENABLE_JIT_DEEPGEMM=0
0bc2af2 to
0233520
Compare
|
testing at: https://github.com/sgl-project/sglang/actions/runs/19956632530/job/57227329802 testing (with accuracy test) at: https://github.com/sgl-project/sglang/actions/runs/19961215892/job/57241902940 |
- Move Mistral test to run immediately after dependency installation - Add `if: always()` to DeepSeek v3.1 since it's no longer first - Remove duplicate Mistral test from end of workflow This allows faster debugging of the Mistral test and ensures other tests continue running regardless of Mistral test outcome.
Multiple TP ranks calling find_local_hf_snapshot_dir simultaneously would all detect corrupted/missing files and race to delete the same cache directory. One rank would succeed, others would fail with "No such file or directory" errors. Fix by wrapping the validation and cleanup logic in a file lock (using suffix "-validation" to avoid conflict with download lock). After acquiring the lock, re-check if the snapshot dir still exists since another process may have already cleaned it up.
Download the model before running the test to avoid server launch timeout issues. The 120-minute timeout allows for full download on first run. Subsequent runs will use the cached model and skip quickly.
| if "SGLANG_ENABLE_JIT_DEEPGEMM" in os.environ: | ||
| del os.environ["SGLANG_ENABLE_JIT_DEEPGEMM"] | ||
|
|
||
| def test_bench_one_batch(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also add accuracy test for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, i can do that. What score threshold should we expect? How do we determine this value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just added test_accuracy_mgsm method using the same eval framework as other text models. Set initial threshold to 0.90 (placeholder). Will calibrate the threshold after observing actual model performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MGSM accuracy for mistralai/Mistral-Large-3-675B-Instruct-2512: 0.972
https://github.com/sgl-project/sglang/actions/runs/19961215892/job/57241902940#step:5:2471
| # validating and cleaning up the same model cache simultaneously. | ||
| # This prevents race conditions where multiple ranks detect corruption | ||
| # and try to delete the same files at the same time. | ||
| with get_lock(model_name_or_path, cache_dir, suffix="-validation"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this change for Mistral large 3 model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's a general bug fix discovered while debugging the Mistral test. When model weights are corrupted/missing, all 8 TP ranks detect the issue simultaneously and race to delete the cache directory. One rank succeeds, others fail with "No such file or directory" errors. This was causing CI failures.
the fix adds a file lock so only one rank performs validation and cleanup at a time. other models just happened to have complete caches so the bug wasn't triggered.
Added test_accuracy_mgsm method using the same eval framework as other text models. Set initial threshold to 0.90 (placeholder). Will calibrate the threshold after observing actual model performance.
Summary
mistralai/Mistral-Large-3-675B-Instruct-2512--attention-backend trtllm_mla--model-loader-extra-config '{"enable_multithread_load": true}'--chat-template mistralSGLANG_ENABLE_JIT_DEEPGEMM=0Test plan