-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Fix safetensors validation to catch corruption after download #14465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
/tag-and-rerun-ci |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
## Problem The safetensors validation for corrupted files only ran: 1. When `SGLANG_IS_IN_CI=true` was set (missing from GPU workflows) 2. Only for cached files, not for freshly downloaded files This caused CI failures like: ``` safetensors_rust.SafetensorError: Error while deserializing header: invalid JSON in header: EOF while parsing a value at line 1 column 0 ``` ## Solution 1. **Always validate local cache first** - Removed the `is_in_ci()` check around `find_local_hf_snapshot_dir()` so validation runs regardless of environment 2. **Add post-download validation** - New `_validate_weights_after_download()` function validates safetensors files immediately after `snapshot_download()` completes, catching truncated downloads or network corruption 3. **Add SGLANG_IS_IN_CI to GPU workflows** - Added the environment variable to pr-test.yml and nightly-test-nvidia.yml for consistency with NPU workflows ## Performance Impact Minimal - validation only reads safetensors headers (few KB), not tensor data. For a 19-shard model, validation takes ~1-2 seconds.
4748d17 to
4e5b4be
Compare
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Local Test ResultsTested both BEFORE download (cache validation) and AFTER download (post-download validation) scenarios: Local test for safetensors validation logic - both before and after download.TESTSOutput:The validation correctly catches the exact error seen in CI: |
Summary
is_in_ci()gate around validation)_validate_weights_after_download()functionSGLANG_IS_IN_CI=trueto GPU CI workflows (pr-test.yml, nightly-test-nvidia.yml)Problem
The safetensors validation only ran when
SGLANG_IS_IN_CI=truewas set, but this env var was missing from GPU workflows. Additionally, validation only checked cached files - freshly downloaded files were never validated.This caused CI failures like:
Related CI failure: https://github.com/sgl-project/sglang/actions/runs/19948236909/job/57203231359
Test plan