Commit 0c81b6f
committed
Fix safetensors validation to catch corruption after download
## Problem
The safetensors validation for corrupted files only ran:
1. When `SGLANG_IS_IN_CI=true` was set (missing from GPU workflows)
2. Only for cached files, not for freshly downloaded files
This caused CI failures like:
```
safetensors_rust.SafetensorError: Error while deserializing header:
invalid JSON in header: EOF while parsing a value at line 1 column 0
```
## Solution
1. **Always validate local cache first** - Removed the `is_in_ci()` check
around `find_local_hf_snapshot_dir()` so validation runs regardless
of environment
2. **Add post-download validation** - New `_validate_weights_after_download()`
function validates safetensors files immediately after `snapshot_download()`
completes, catching truncated downloads or network corruption
3. **Add SGLANG_IS_IN_CI to GPU workflows** - Added the environment variable
to pr-test.yml and nightly-test-nvidia.yml for consistency with NPU workflows
## Performance Impact
Minimal - validation only reads safetensors headers (few KB), not tensor data.
For a 19-shard model, validation takes ~1-2 seconds.1 parent 4c5074e commit 0c81b6f
File tree
3 files changed
+70
-9
lines changed- .github/workflows
- python/sglang/srt/model_loader
3 files changed
+70
-9
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
47 | 47 | | |
48 | 48 | | |
49 | 49 | | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
50 | 53 | | |
51 | 54 | | |
52 | 55 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
25 | 25 | | |
26 | 26 | | |
27 | 27 | | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
28 | 31 | | |
29 | 32 | | |
30 | 33 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
421 | 421 | | |
422 | 422 | | |
423 | 423 | | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
| 427 | + | |
| 428 | + | |
| 429 | + | |
| 430 | + | |
| 431 | + | |
| 432 | + | |
| 433 | + | |
| 434 | + | |
| 435 | + | |
| 436 | + | |
| 437 | + | |
| 438 | + | |
| 439 | + | |
| 440 | + | |
| 441 | + | |
| 442 | + | |
| 443 | + | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
| 456 | + | |
| 457 | + | |
| 458 | + | |
| 459 | + | |
| 460 | + | |
| 461 | + | |
| 462 | + | |
| 463 | + | |
| 464 | + | |
| 465 | + | |
| 466 | + | |
| 467 | + | |
| 468 | + | |
| 469 | + | |
| 470 | + | |
| 471 | + | |
| 472 | + | |
424 | 473 | | |
425 | 474 | | |
426 | 475 | | |
| |||
446 | 495 | | |
447 | 496 | | |
448 | 497 | | |
449 | | - | |
450 | | - | |
451 | | - | |
452 | | - | |
453 | | - | |
454 | | - | |
455 | | - | |
456 | | - | |
| 498 | + | |
| 499 | + | |
| 500 | + | |
| 501 | + | |
| 502 | + | |
| 503 | + | |
| 504 | + | |
| 505 | + | |
457 | 506 | | |
| 507 | + | |
| 508 | + | |
458 | 509 | | |
459 | | - | |
| 510 | + | |
460 | 511 | | |
461 | 512 | | |
462 | 513 | | |
| |||
480 | 531 | | |
481 | 532 | | |
482 | 533 | | |
| 534 | + | |
| 535 | + | |
| 536 | + | |
| 537 | + | |
483 | 538 | | |
484 | 539 | | |
485 | 540 | | |
| |||
0 commit comments