Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels#28214
Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels#28214titaiwangms wants to merge 10 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR completes the RotaryEmbedding position_ids bounds-checking security fix for the WebGPU EP and the JS WebGPU implementation, aligning them with the earlier CPU/CUDA hardening work.
Changes:
- Add host-side
position_idsbounds validation to WebGPU RotaryEmbedding kernels (ONNX + contrib/MS domain), plus shader-side pass-through on OOB as defense-in-depth. - Add TypeScript-side
position_idsbounds validation for the JS WebGPU RotaryEmbedding op. - Add new WebGPU-focused unit tests covering negative and out-of-range
position_idscases.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc | Adds host-side position_ids bounds validation and forces CPU input for validation. |
| onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc | Adds host-side bounds validation, marks position_ids as CPU input, and adds shader-side OOB pass-through. |
| js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts | Adds JS-side bounds checks for position_ids using BigInt64 reads. |
| onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc | Adds ONNX-domain WebGPU EP failure tests for OOB/negative position_ids. |
| onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc | Adds contrib/MS-domain WebGPU EP failure tests for OOB/negative position_ids. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
Review Summary
The validation logic and shader-side defense-in-depth checks are well-designed and mirror the CPU/CUDA approach correctly. The test coverage is thorough. However, both kernel registrations have a critical issue that will cause a runtime crash.
Critical: InputMemoryType(OrtMemTypeCPUInput) + AddInputs() incompatibility
Both kernel registrations add .InputMemoryType(OrtMemTypeCPUInput, ...) to keep position_ids on CPU for host-side validation. However, ComputeInternal still passes position_ids to AddInputs() for the shader. The WebGPU program execution path in webgpu_context.cc (around L464) does:
bind_buffers.push_back(reinterpret_cast<WGPUBuffer>(const_cast<void*>(inputs[i].tensor->DataRaw())));This assumes all program inputs have GPU buffer handles in DataRaw(). A CPU-resident tensor's DataRaw() returns a heap pointer, producing an invalid WGPUBuffer handle that will crash or corrupt at dispatch.
No existing WebGPU kernel combines InputMemoryType(OrtMemTypeCPUInput) with AddInputs() on the same tensor. Existing kernels (Reduce, Pad, Resize, GQA, etc.) read CPU-pinned inputs for host-side configuration only and never submit them as shader bindings.
Recommended fix
The simplest approach is to remove both InputMemoryType directives and the host-side C++ validation, relying solely on the shader-side defense-in-depth checks (which are already correct in this PR). This matches CUDA's approach: device-side bounds checking with pass-through on OOB. The TS validation in rotary-embedding.ts can remain since it runs on CPU during validateInputs() before shader dispatch.
If host-side rejection is desired, the kernel would need to explicitly copy position_ids from GPU to CPU (via DataTransferManager) for validation, then pass the original GPU tensor to AddInputs().
What's good
- Shader-side OOB checks:
raw_pos < 0catches negative i32 from truncated int64;position_id >= max_positioncatches OOB after u32 conversion + offset. Pass-through behavior matches CUDA. FusedQKRotaryEmbeddingProgrambounds check on computedposition_idcorrectly handles both Q and K outputs, including the conditional K-head guard.- TS validation using
getBigInt64Array()avoids int64 truncation issues. - 7 new tests cover format-0, format-1, negative, and batch OOB comprehensively.
d1cc33c to
5920aa7
Compare
tianleiwu
left a comment
There was a problem hiding this comment.
Round-2 review (head 9744bfb).
Summary
The high-severity concern from the prior round — InputMemoryType(OrtMemTypeCPUInput) + AddInputs(position_ids) would crash because the same tensor cannot be both CPU-resident and bound as a WebGPU storage buffer — has been resolved. The author removed the host-side C++ value scanning and InputMemoryType directives and now relies on shader-side defense-in-depth (OOB → pass-through input) for both contrib and ONNX-domain WebGPU kernels, plus host-side validation on the JS path via getBigInt64Array() in validateInputs. Tests were converted to kExpectSuccess with output == input. The pass-through semantics now match the CUDA kernel for cross-EP consistency.
Resolving my two prior threads:
contrib_ops/webgpu/bert/rotary_embedding.cc(shader as primary protection) — addressed.test/contrib_ops/rotary_embedding_op_test.cc(tests must verify pass-through, not failure) — addressed.
Remaining items (all nits, non-blocking)
1. Shader comment understates what raw_pos < 0 catches — onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc
The WebGPU shader storage representation truncates int64 to i32 (see ProgramVariableDataType::Int64 → value type i32 in core/providers/webgpu/shader_variable.cc). Any positive int64 value greater than INT32_MAX will alias to a negative i32 and be caught by the raw_pos < 0 branch. That is safe (still pass-through), but the inline comment only mentions "negative position_ids". Consider expanding the comment to note that very large positive int64 values also fall into this branch by virtue of i32 truncation, so a future maintainer doesn't think the check is incomplete.
2. JSEP getBigInt64Array alignment fallback returns a read-only copy — js/web/lib/wasm/jsep/init.ts and js/web/lib/wasm/jsep/tensor-view.ts
The inline comment in init.ts correctly notes that the unaligned path returns a read-only copy (mutations don't propagate to the WASM heap). All current call sites (pad.ts, split.ts, reduce.ts, slice.ts, tile.ts, expand.ts, cumsum.ts, resize.ts, gather-block-quantized.ts, the new rotary-embedding.ts) are read-only iterations, so this is safe today. Consider promoting that read-only caveat into the TensorView.getBigInt64Array() JSDoc in tensor-view.ts so future consumers don't accidentally write through the returned view expecting WASM-heap propagation.
3. Format-0 detection in TS is contrib-op-specific — js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts
positionIdsElementCount === 1 triggers the format-0 base-offset branch. The ONNX-domain RotaryEmbedding schema requires rank-2 position_ids (no base-offset mode). If this validateInputs is shared with the ONNX-domain op (now or later), the format-0 branch should be guarded by op identity. A short comment confirming the JSEP path scope would prevent silent acceptance of malformed shapes against the ONNX-domain op.
4. Pre-existing related gap (informational) — separately noted by the Copilot reviewer at L159 of contrib_ops/webgpu/bert/rotary_embedding.cc: when position_ids is provided, CPU/CUDA reject sequence_length > max_sequence_length with NOT_IMPLEMENTED, while the WebGPU contrib path (after this PR) silently produces pass-through for the OOB tokens. This was the case before this PR too. Worth tracking as a follow-up if cross-EP behavior parity is desired here, but it does not need to block this security fix.
Verdict
Looks good. The shader bounds checks are correctly designed (both for the standalone RotaryEmbeddingProgram and the FusedQKRotaryEmbeddingProgram Q+K pass-through path), tests are appropriate, and the JS validation is functionally correct.
Add shader-side bounds checks to the WebGPU RotaryEmbedding and FusedQKRotaryEmbedding GPU shaders to prevent out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. For RotaryEmbeddingProgram: - Check raw_pos < 0 to catch negative position_ids (i32 from truncated int64 avoids u32 wraparound bypass) - Check position_id >= cos_cache_shape[0] after u32 conversion and sequence offset addition - On OOB, pass through input unchanged (matches CUDA kernel behavior) For FusedQKRotaryEmbeddingProgram: - Check position_id >= cos_cache_shape[0] before accessing cos/sin cache - On OOB, pass through both Q and K inputs unchanged This complements the CPU and CUDA fixes from PR #27597 (commit 056bab3) which missed the WebGPU execution provider. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>
Add host-side validation of position_ids values before shader dispatch
in all three WebGPU RotaryEmbedding implementations. This prevents
out-of-bounds reads from cos_cache/sin_cache when position_ids values
exceed the cache dimensions.
Changes:
1. contrib_ops/webgpu/bert/rotary_embedding.cc:
- Add InputMemoryType(OrtMemTypeCPUInput, 1) to keep position_ids
on CPU for validation
- Add bounds checking in ComputeInternal() before shader dispatch:
format 0 (scalar): base_pos in [0, max_seq_len - seq_len]
format 1 (2D array): each value in [0, max_sequence_length)
- Returns INVALID_ARGUMENT error on violation
- Shader-side bounds checks remain as defense-in-depth
2. core/providers/webgpu/llm/rotary_embedding.cc:
- Add InputMemoryType(OrtMemTypeCPUInput, 3) for optional
position_ids input
- Add bounds checking in the position_ids != nullptr branch
- Returns INVALID_ARGUMENT error on violation
3. js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts:
- Add value validation in validateInputs() using getBigInt64Array()
- Validates both format 0 (scalar offset) and format 1 (2D array)
- Throws Error with descriptive message on violation
All three implementations follow the same validation pattern as the
CPU contrib fix (PR #27597), returning errors rather than silently
passing through.
Co-authored-by: Copilot <[email protected]>
Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
Add WebGPU-targeted OOB unit tests to both contrib (kMSDomain) and ONNX domain test files. Tests verify that out-of-bounds, negative, and format-0 overflow position_ids values are rejected with INVALID_ARGUMENT, matching the host-side validation added to the WebGPU RotaryEmbedding kernels. Tests gracefully skip via GTEST_SKIP() when WebGPU EP is not available. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>
Address readability review feedback: - FusedQK shader: clarify why no negative check (position_id derived from past_seqlen + sequence_idx, always non-negative) - ONNX domain kernel: clarify why no format 0 check (ONNX RotaryEmbedding always uses explicit position_ids, no base-offset mode) Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>
Address code review finding: the shared shader treats single-element position_ids as format 0 (base offset + sequence_idx), so the ONNX domain host-side validation must also check that base_pos + sequence_length - 1 < max_sequence_length. Also add corresponding format-0 OOB WebGPU test case. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>
Fix Web CI precheck: run prettier on the TypeScript file to match the project's JS/TS formatting requirements. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>
TensorViewImpl.getBigInt64Array() threw RangeError when the WASM heap offset was not 8-byte aligned. Fix by detecting unaligned offsets and copying bytes into an aligned buffer before creating the BigInt64Array. This fixes the Web CI failure where all RotaryEmbedding tests failed with 'start offset of BigInt64Array should be a multiple of 8'. Co-authored-by: Copilot <[email protected]>
Remove InputMemoryType(OrtMemTypeCPUInput) from both WebGPU kernel registrations (contrib and ONNX domain) and the associated host-side position_ids value scanning. InputMemoryType is incompatible with AddInputs() — a CPU tensor's DataRaw() would be cast to WGPUBuffer, causing a crash at dispatch time. Defense strategy is now: - Shader-side: WGSL bounds checks pass through input unchanged on OOB (same as CUDA kernel behavior) - JSEP/browser: TypeScript validation in rotary-embedding.ts catches OOB before shader dispatch - init.ts: getBigInt64Array() handles unaligned WASM heap offsets WebGPU OOB tests changed from kExpectFailure to kExpectSuccess, verifying pass-through behavior (output equals input on OOB). ONNX domain tests updated to use rank-2 position_ids for cross-EP consistency. TS validation reordered per Copilot review: sequence_length check before per-value bounds validation. Co-authored-by: Copilot <[email protected]>
position_ids tensor data is GPU-resident in the WebGPU/JSEP path — its TensorView.data field is a GPU buffer ID, not a WASM heap pointer. Calling getBigInt64Array() reads from a random WASM heap location, returning garbage values that cause ALL existing RotaryEmbedding JSEP tests to fail validation. Other ops (pad, split, tile, etc.) can safely call getBigInt64Array() because their int64 inputs use InputMemoryType(OrtMemTypeCPU) which keeps data on CPU with valid WASM heap pointers. RotaryEmbedding cannot use InputMemoryType for position_ids because AddInputs() requires GPU buffers. Defense strategy: shader-side WGSL bounds checks handle OOB position_ids (pass-through behavior), which works regardless of CPU/GPU residency. Co-authored-by: Copilot <[email protected]>
5c2800a to
e5a824e
Compare
This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR #27597 (commit 056bab3) which covered CPU and CUDA.
Problem
The
com.microsoft::RotaryEmbeddingkernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR #27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads).Changes
Security
Addresses the same vulnerability as #27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers.
Testing