Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels by titaiwangms · Pull Request #28214 · microsoft/onnxruntime

titaiwangms · 2026-04-23T20:38:08Z

This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR #27597 (commit 056bab3) which covered CPU and CUDA.

Problem

The com.microsoft::RotaryEmbedding kernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR #27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads).

Changes

contrib_ops/webgpu/bert/rotary_embedding.cc: Host-side validation (ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB)
core/providers/webgpu/llm/rotary_embedding.cc: Host-side validation with format-0 awareness
js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts: TypeScript validation using getBigInt64Array
7 new C++ OOB test cases across contrib and ONNX domains targeting WebGPU EP

Security

Addresses the same vulnerability as #27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers.

Testing

7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when WebGPU EP unavailable
JS/TS error tests not feasible with current JSONC test format (documented)
Build environment lacks C++20/emsdk for full compilation verification; validated structurally

github-actions

You can commit the suggested changes from lintrunner.

Copilot

Pull request overview

This PR completes the RotaryEmbedding position_ids bounds-checking security fix for the WebGPU EP and the JS WebGPU implementation, aligning them with the earlier CPU/CUDA hardening work.

Changes:

Add host-side position_ids bounds validation to WebGPU RotaryEmbedding kernels (ONNX + contrib/MS domain), plus shader-side pass-through on OOB as defense-in-depth.
Add TypeScript-side position_ids bounds validation for the JS WebGPU RotaryEmbedding op.
Add new WebGPU-focused unit tests covering negative and out-of-range position_ids cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc	Adds host-side `position_ids` bounds validation and forces CPU input for validation.
onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc	Adds host-side bounds validation, marks `position_ids` as CPU input, and adds shader-side OOB pass-through.
js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts	Adds JS-side bounds checks for `position_ids` using BigInt64 reads.
onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc	Adds ONNX-domain WebGPU EP failure tests for OOB/negative `position_ids`.
onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc	Adds contrib/MS-domain WebGPU EP failure tests for OOB/negative `position_ids`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tianleiwu

Review Summary

The validation logic and shader-side defense-in-depth checks are well-designed and mirror the CPU/CUDA approach correctly. The test coverage is thorough. However, both kernel registrations have a critical issue that will cause a runtime crash.

Critical: `InputMemoryType(OrtMemTypeCPUInput)` + `AddInputs()` incompatibility

Both kernel registrations add .InputMemoryType(OrtMemTypeCPUInput, ...) to keep position_ids on CPU for host-side validation. However, ComputeInternal still passes position_ids to AddInputs() for the shader. The WebGPU program execution path in webgpu_context.cc (around L464) does:

bind_buffers.push_back(reinterpret_cast<WGPUBuffer>(const_cast<void*>(inputs[i].tensor->DataRaw())));

This assumes all program inputs have GPU buffer handles in DataRaw(). A CPU-resident tensor's DataRaw() returns a heap pointer, producing an invalid WGPUBuffer handle that will crash or corrupt at dispatch.

No existing WebGPU kernel combines InputMemoryType(OrtMemTypeCPUInput) with AddInputs() on the same tensor. Existing kernels (Reduce, Pad, Resize, GQA, etc.) read CPU-pinned inputs for host-side configuration only and never submit them as shader bindings.

Recommended fix

The simplest approach is to remove both InputMemoryType directives and the host-side C++ validation, relying solely on the shader-side defense-in-depth checks (which are already correct in this PR). This matches CUDA's approach: device-side bounds checking with pass-through on OOB. The TS validation in rotary-embedding.ts can remain since it runs on CPU during validateInputs() before shader dispatch.

If host-side rejection is desired, the kernel would need to explicitly copy position_ids from GPU to CPU (via DataTransferManager) for validation, then pass the original GPU tensor to AddInputs().

What's good

Shader-side OOB checks: raw_pos < 0 catches negative i32 from truncated int64; position_id >= max_position catches OOB after u32 conversion + offset. Pass-through behavior matches CUDA.
FusedQKRotaryEmbeddingProgram bounds check on computed position_id correctly handles both Q and K outputs, including the conditional K-head guard.
TS validation using getBigInt64Array() avoids int64 truncation issues.
7 new tests cover format-0, format-1, negative, and batch OOB comprehensively.

tianleiwu

Round-2 review (head 9744bfb).

Summary

The high-severity concern from the prior round — InputMemoryType(OrtMemTypeCPUInput) + AddInputs(position_ids) would crash because the same tensor cannot be both CPU-resident and bound as a WebGPU storage buffer — has been resolved. The author removed the host-side C++ value scanning and InputMemoryType directives and now relies on shader-side defense-in-depth (OOB → pass-through input) for both contrib and ONNX-domain WebGPU kernels, plus host-side validation on the JS path via getBigInt64Array() in validateInputs. Tests were converted to kExpectSuccess with output == input. The pass-through semantics now match the CUDA kernel for cross-EP consistency.

Resolving my two prior threads:

contrib_ops/webgpu/bert/rotary_embedding.cc (shader as primary protection) — addressed.
test/contrib_ops/rotary_embedding_op_test.cc (tests must verify pass-through, not failure) — addressed.

Remaining items (all nits, non-blocking)

1. Shader comment understates what raw_pos < 0 catches — onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc

The WebGPU shader storage representation truncates int64 to i32 (see ProgramVariableDataType::Int64 → value type i32 in core/providers/webgpu/shader_variable.cc). Any positive int64 value greater than INT32_MAX will alias to a negative i32 and be caught by the raw_pos < 0 branch. That is safe (still pass-through), but the inline comment only mentions "negative position_ids". Consider expanding the comment to note that very large positive int64 values also fall into this branch by virtue of i32 truncation, so a future maintainer doesn't think the check is incomplete.

2. JSEP getBigInt64Array alignment fallback returns a read-only copy — js/web/lib/wasm/jsep/init.ts and js/web/lib/wasm/jsep/tensor-view.ts

The inline comment in init.ts correctly notes that the unaligned path returns a read-only copy (mutations don't propagate to the WASM heap). All current call sites (pad.ts, split.ts, reduce.ts, slice.ts, tile.ts, expand.ts, cumsum.ts, resize.ts, gather-block-quantized.ts, the new rotary-embedding.ts) are read-only iterations, so this is safe today. Consider promoting that read-only caveat into the TensorView.getBigInt64Array() JSDoc in tensor-view.ts so future consumers don't accidentally write through the returned view expecting WASM-heap propagation.

3. Format-0 detection in TS is contrib-op-specific — js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts

positionIdsElementCount === 1 triggers the format-0 base-offset branch. The ONNX-domain RotaryEmbedding schema requires rank-2 position_ids (no base-offset mode). If this validateInputs is shared with the ONNX-domain op (now or later), the format-0 branch should be guarded by op identity. A short comment confirming the JSEP path scope would prevent silent acceptance of malformed shapes against the ONNX-domain op.

4. Pre-existing related gap (informational) — separately noted by the Copilot reviewer at L159 of contrib_ops/webgpu/bert/rotary_embedding.cc: when position_ids is provided, CPU/CUDA reject sequence_length > max_sequence_length with NOT_IMPLEMENTED, while the WebGPU contrib path (after this PR) silently produces pass-through for the OOB tokens. This was the case before this PR too. Worth tracking as a follow-up if cross-EP behavior parity is desired here, but it does not need to block this security fix.

Verdict

Looks good. The shader bounds checks are correctly designed (both for the standalone RotaryEmbeddingProgram and the FusedQKRotaryEmbeddingProgram Q+K pass-through path), tests are appropriate, and the JS validation is functionally correct.

Add shader-side bounds checks to the WebGPU RotaryEmbedding and FusedQKRotaryEmbedding GPU shaders to prevent out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. For RotaryEmbeddingProgram: - Check raw_pos < 0 to catch negative position_ids (i32 from truncated int64 avoids u32 wraparound bypass) - Check position_id >= cos_cache_shape[0] after u32 conversion and sequence offset addition - On OOB, pass through input unchanged (matches CUDA kernel behavior) For FusedQKRotaryEmbeddingProgram: - Check position_id >= cos_cache_shape[0] before accessing cos/sin cache - On OOB, pass through both Q and K inputs unchanged This complements the CPU and CUDA fixes from PR #27597 (commit 056bab3) which missed the WebGPU execution provider. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

Add host-side validation of position_ids values before shader dispatch in all three WebGPU RotaryEmbedding implementations. This prevents out-of-bounds reads from cos_cache/sin_cache when position_ids values exceed the cache dimensions. Changes: 1. contrib_ops/webgpu/bert/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 1) to keep position_ids on CPU for validation - Add bounds checking in ComputeInternal() before shader dispatch: format 0 (scalar): base_pos in [0, max_seq_len - seq_len] format 1 (2D array): each value in [0, max_sequence_length) - Returns INVALID_ARGUMENT error on violation - Shader-side bounds checks remain as defense-in-depth 2. core/providers/webgpu/llm/rotary_embedding.cc: - Add InputMemoryType(OrtMemTypeCPUInput, 3) for optional position_ids input - Add bounds checking in the position_ids != nullptr branch - Returns INVALID_ARGUMENT error on violation 3. js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts: - Add value validation in validateInputs() using getBigInt64Array() - Validates both format 0 (scalar offset) and format 1 (2D array) - Throws Error with descriptive message on violation All three implementations follow the same validation pattern as the CPU contrib fix (PR #27597), returning errors rather than silently passing through. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

Add WebGPU-targeted OOB unit tests to both contrib (kMSDomain) and ONNX domain test files. Tests verify that out-of-bounds, negative, and format-0 overflow position_ids values are rejected with INVALID_ARGUMENT, matching the host-side validation added to the WebGPU RotaryEmbedding kernels. Tests gracefully skip via GTEST_SKIP() when WebGPU EP is not available. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

Address readability review feedback: - FusedQK shader: clarify why no negative check (position_id derived from past_seqlen + sequence_idx, always non-negative) - ONNX domain kernel: clarify why no format 0 check (ONNX RotaryEmbedding always uses explicit position_ids, no base-offset mode) Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

Address code review finding: the shared shader treats single-element position_ids as format 0 (base offset + sequence_idx), so the ONNX domain host-side validation must also check that base_pos + sequence_length - 1 < max_sequence_length. Also add corresponding format-0 OOB WebGPU test case. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

Fix Web CI precheck: run prettier on the TypeScript file to match the project's JS/TS formatting requirements. Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

TensorViewImpl.getBigInt64Array() threw RangeError when the WASM heap offset was not 8-byte aligned. Fix by detecting unaligned offsets and copying bytes into an aligned buffer before creating the BigInt64Array. This fixes the Web CI failure where all RotaryEmbedding tests failed with 'start offset of BigInt64Array should be a multiple of 8'. Co-authored-by: Copilot <[email protected]>

Remove InputMemoryType(OrtMemTypeCPUInput) from both WebGPU kernel registrations (contrib and ONNX domain) and the associated host-side position_ids value scanning. InputMemoryType is incompatible with AddInputs() — a CPU tensor's DataRaw() would be cast to WGPUBuffer, causing a crash at dispatch time. Defense strategy is now: - Shader-side: WGSL bounds checks pass through input unchanged on OOB (same as CUDA kernel behavior) - JSEP/browser: TypeScript validation in rotary-embedding.ts catches OOB before shader dispatch - init.ts: getBigInt64Array() handles unaligned WASM heap offsets WebGPU OOB tests changed from kExpectFailure to kExpectSuccess, verifying pass-through behavior (output equals input on OOB). ONNX domain tests updated to use rank-2 position_ids for cross-EP consistency. TS validation reordered per Copilot review: sequence_length check before per-value bounds validation. Co-authored-by: Copilot <[email protected]>

position_ids tensor data is GPU-resident in the WebGPU/JSEP path — its TensorView.data field is a GPU buffer ID, not a WASM heap pointer. Calling getBigInt64Array() reads from a random WASM heap location, returning garbage values that cause ALL existing RotaryEmbedding JSEP tests to fail validation. Other ops (pad, split, tile, etc.) can safely call getBigInt64Array() because their int64 inputs use InputMemoryType(OrtMemTypeCPU) which keeps data on CPU with valid WASM heap pointers. RotaryEmbedding cannot use InputMemoryType for position_ids because AddInputs() requires GPU buffers. Defense strategy: shader-side WGSL bounds checks handle OOB position_ids (pass-through behavior), which works regardless of CPU/GPU residency. Co-authored-by: Copilot <[email protected]>

github-actions Bot reviewed Apr 23, 2026

View reviewed changes

titaiwangms requested review from Copilot and tianleiwu April 23, 2026 21:14

Copilot started reviewing on behalf of titaiwangms April 23, 2026 21:14 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Megatron2546-sudo approved these changes Apr 23, 2026

View reviewed changes

tianleiwu requested changes Apr 23, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc

Comment thread onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc

titaiwangms force-pushed the titaiwang/fix-rotary-embedding-oob-read branch from d1cc33c to 5920aa7 Compare April 23, 2026 22:14

This comment was marked as duplicate.

Sign in to view

tianleiwu reviewed Apr 24, 2026

View reviewed changes

titaiwangms and others added 10 commits April 27, 2026 21:00

Apply lintrunner clang-format auto-fixes

d38b5e3

Co-authored-by: Copilot <[email protected]> Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6] Co-authored-by: Copilot <[email protected]>

titaiwangms force-pushed the titaiwang/fix-rotary-embedding-oob-read branch from 5c2800a to e5a824e Compare April 27, 2026 21:03

guschmue added the ep:WebGPU ort-web webgpu provider label Apr 28, 2026

titaiwangms requested a review from tianleiwu April 28, 2026 16:56

Conversation

titaiwangms commented Apr 23, 2026

Problem

Changes

Security

Testing

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Review Summary

Critical: InputMemoryType(OrtMemTypeCPUInput) + AddInputs() incompatibility

Recommended fix

What's good

Uh oh!

Uh oh!

Uh oh!

This comment was marked as duplicate.

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Summary

Remaining items (all nits, non-blocking)

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Critical: `InputMemoryType(OrtMemTypeCPUInput)` + `AddInputs()` incompatibility