Skip to content

Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels#28214

Open
titaiwangms wants to merge 10 commits intomainfrom
titaiwang/fix-rotary-embedding-oob-read
Open

Add position_ids bounds validation to WebGPU/JS RotaryEmbedding kernels#28214
titaiwangms wants to merge 10 commits intomainfrom
titaiwang/fix-rotary-embedding-oob-read

Conversation

@titaiwangms
Copy link
Copy Markdown
Contributor

This PR adds position_ids bounds checking to WebGPU and JS RotaryEmbedding implementations, completing the security fix started in PR #27597 (commit 056bab3) which covered CPU and CUDA.

Problem

The com.microsoft::RotaryEmbedding kernel uses position_ids as row indices into cos_cache/sin_cache without bounds validation. While PR #27597 fixed CPU and CUDA paths, WebGPU and JS implementations were still missing bounds checks, which could produce silently wrong results (WebGPU hardware clamps OOB reads).

Changes

  • contrib_ops/webgpu/bert/rotary_embedding.cc: Host-side validation (ORT_MAKE_STATUS) + shader-side defense-in-depth (pass-through on OOB)
  • core/providers/webgpu/llm/rotary_embedding.cc: Host-side validation with format-0 awareness
  • js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts: TypeScript validation using getBigInt64Array
  • 7 new C++ OOB test cases across contrib and ONNX domains targeting WebGPU EP

Security

Addresses the same vulnerability as #27597 (OOB read via position_ids, CVSS 7.5-9.1) for WebGPU/JS execution providers.

Testing

  • 7 new unit tests (3 contrib + 4 ONNX domain) with GTEST_SKIP when WebGPU EP unavailable
  • JS/TS error tests not feasible with current JSONC test format (documented)
  • Build environment lacks C++20/emsdk for full compilation verification; validated structurally

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the RotaryEmbedding position_ids bounds-checking security fix for the WebGPU EP and the JS WebGPU implementation, aligning them with the earlier CPU/CUDA hardening work.

Changes:

  • Add host-side position_ids bounds validation to WebGPU RotaryEmbedding kernels (ONNX + contrib/MS domain), plus shader-side pass-through on OOB as defense-in-depth.
  • Add TypeScript-side position_ids bounds validation for the JS WebGPU RotaryEmbedding op.
  • Add new WebGPU-focused unit tests covering negative and out-of-range position_ids cases.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc Adds host-side position_ids bounds validation and forces CPU input for validation.
onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Adds host-side bounds validation, marks position_ids as CPU input, and adds shader-side OOB pass-through.
js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts Adds JS-side bounds checks for position_ids using BigInt64 reads.
onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc Adds ONNX-domain WebGPU EP failure tests for OOB/negative position_ids.
onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc Adds contrib/MS-domain WebGPU EP failure tests for OOB/negative position_ids.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc
Comment thread onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc Outdated
Comment thread onnxruntime/test/providers/cpu/llm/rotary_embedding_op_test.cc Outdated
Comment thread js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts Outdated
Comment thread onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc Outdated
Comment thread onnxruntime/core/providers/webgpu/llm/rotary_embedding.cc
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc Outdated
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

The validation logic and shader-side defense-in-depth checks are well-designed and mirror the CPU/CUDA approach correctly. The test coverage is thorough. However, both kernel registrations have a critical issue that will cause a runtime crash.

Critical: InputMemoryType(OrtMemTypeCPUInput) + AddInputs() incompatibility

Both kernel registrations add .InputMemoryType(OrtMemTypeCPUInput, ...) to keep position_ids on CPU for host-side validation. However, ComputeInternal still passes position_ids to AddInputs() for the shader. The WebGPU program execution path in webgpu_context.cc (around L464) does:

bind_buffers.push_back(reinterpret_cast<WGPUBuffer>(const_cast<void*>(inputs[i].tensor->DataRaw())));

This assumes all program inputs have GPU buffer handles in DataRaw(). A CPU-resident tensor's DataRaw() returns a heap pointer, producing an invalid WGPUBuffer handle that will crash or corrupt at dispatch.

No existing WebGPU kernel combines InputMemoryType(OrtMemTypeCPUInput) with AddInputs() on the same tensor. Existing kernels (Reduce, Pad, Resize, GQA, etc.) read CPU-pinned inputs for host-side configuration only and never submit them as shader bindings.

Recommended fix

The simplest approach is to remove both InputMemoryType directives and the host-side C++ validation, relying solely on the shader-side defense-in-depth checks (which are already correct in this PR). This matches CUDA's approach: device-side bounds checking with pass-through on OOB. The TS validation in rotary-embedding.ts can remain since it runs on CPU during validateInputs() before shader dispatch.

If host-side rejection is desired, the kernel would need to explicitly copy position_ids from GPU to CPU (via DataTransferManager) for validation, then pass the original GPU tensor to AddInputs().

What's good

  • Shader-side OOB checks: raw_pos < 0 catches negative i32 from truncated int64; position_id >= max_position catches OOB after u32 conversion + offset. Pass-through behavior matches CUDA.
  • FusedQKRotaryEmbeddingProgram bounds check on computed position_id correctly handles both Q and K outputs, including the conditional K-head guard.
  • TS validation using getBigInt64Array() avoids int64 truncation issues.
  • 7 new tests cover format-0, format-1, negative, and batch OOB comprehensively.

Comment thread onnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc
Comment thread onnxruntime/test/contrib_ops/rotary_embedding_op_test.cc
@titaiwangms titaiwangms force-pushed the titaiwang/fix-rotary-embedding-oob-read branch from d1cc33c to 5920aa7 Compare April 23, 2026 22:14
tianleiwu

This comment was marked as duplicate.

Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Round-2 review (head 9744bfb).

Summary

The high-severity concern from the prior round — InputMemoryType(OrtMemTypeCPUInput) + AddInputs(position_ids) would crash because the same tensor cannot be both CPU-resident and bound as a WebGPU storage buffer — has been resolved. The author removed the host-side C++ value scanning and InputMemoryType directives and now relies on shader-side defense-in-depth (OOB → pass-through input) for both contrib and ONNX-domain WebGPU kernels, plus host-side validation on the JS path via getBigInt64Array() in validateInputs. Tests were converted to kExpectSuccess with output == input. The pass-through semantics now match the CUDA kernel for cross-EP consistency.

Resolving my two prior threads:

  • contrib_ops/webgpu/bert/rotary_embedding.cc (shader as primary protection) — addressed.
  • test/contrib_ops/rotary_embedding_op_test.cc (tests must verify pass-through, not failure) — addressed.

Remaining items (all nits, non-blocking)

1. Shader comment understates what raw_pos < 0 catchesonnxruntime/contrib_ops/webgpu/bert/rotary_embedding.cc

The WebGPU shader storage representation truncates int64 to i32 (see ProgramVariableDataType::Int64 → value type i32 in core/providers/webgpu/shader_variable.cc). Any positive int64 value greater than INT32_MAX will alias to a negative i32 and be caught by the raw_pos < 0 branch. That is safe (still pass-through), but the inline comment only mentions "negative position_ids". Consider expanding the comment to note that very large positive int64 values also fall into this branch by virtue of i32 truncation, so a future maintainer doesn't think the check is incomplete.

2. JSEP getBigInt64Array alignment fallback returns a read-only copyjs/web/lib/wasm/jsep/init.ts and js/web/lib/wasm/jsep/tensor-view.ts

The inline comment in init.ts correctly notes that the unaligned path returns a read-only copy (mutations don't propagate to the WASM heap). All current call sites (pad.ts, split.ts, reduce.ts, slice.ts, tile.ts, expand.ts, cumsum.ts, resize.ts, gather-block-quantized.ts, the new rotary-embedding.ts) are read-only iterations, so this is safe today. Consider promoting that read-only caveat into the TensorView.getBigInt64Array() JSDoc in tensor-view.ts so future consumers don't accidentally write through the returned view expecting WASM-heap propagation.

3. Format-0 detection in TS is contrib-op-specificjs/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts

positionIdsElementCount === 1 triggers the format-0 base-offset branch. The ONNX-domain RotaryEmbedding schema requires rank-2 position_ids (no base-offset mode). If this validateInputs is shared with the ONNX-domain op (now or later), the format-0 branch should be guarded by op identity. A short comment confirming the JSEP path scope would prevent silent acceptance of malformed shapes against the ONNX-domain op.

4. Pre-existing related gap (informational) — separately noted by the Copilot reviewer at L159 of contrib_ops/webgpu/bert/rotary_embedding.cc: when position_ids is provided, CPU/CUDA reject sequence_length > max_sequence_length with NOT_IMPLEMENTED, while the WebGPU contrib path (after this PR) silently produces pass-through for the OOB tokens. This was the case before this PR too. Worth tracking as a follow-up if cross-EP behavior parity is desired here, but it does not need to block this security fix.

Verdict

Looks good. The shader bounds checks are correctly designed (both for the standalone RotaryEmbeddingProgram and the FusedQKRotaryEmbeddingProgram Q+K pass-through path), tests are appropriate, and the JS validation is functionally correct.

titaiwangms and others added 10 commits April 27, 2026 21:00
Add shader-side bounds checks to the WebGPU RotaryEmbedding and
FusedQKRotaryEmbedding GPU shaders to prevent out-of-bounds reads
from cos_cache/sin_cache when position_ids values exceed the cache
dimensions.

For RotaryEmbeddingProgram:
- Check raw_pos < 0 to catch negative position_ids (i32 from
  truncated int64 avoids u32 wraparound bypass)
- Check position_id >= cos_cache_shape[0] after u32 conversion
  and sequence offset addition
- On OOB, pass through input unchanged (matches CUDA kernel behavior)

For FusedQKRotaryEmbeddingProgram:
- Check position_id >= cos_cache_shape[0] before accessing cos/sin
  cache
- On OOB, pass through both Q and K inputs unchanged

This complements the CPU and CUDA fixes from PR #27597 (commit
056bab3) which missed the WebGPU execution provider.

Co-authored-by: Copilot <[email protected]>

Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
Add host-side validation of position_ids values before shader dispatch
in all three WebGPU RotaryEmbedding implementations. This prevents
out-of-bounds reads from cos_cache/sin_cache when position_ids values
exceed the cache dimensions.

Changes:

1. contrib_ops/webgpu/bert/rotary_embedding.cc:
   - Add InputMemoryType(OrtMemTypeCPUInput, 1) to keep position_ids
     on CPU for validation
   - Add bounds checking in ComputeInternal() before shader dispatch:
     format 0 (scalar): base_pos in [0, max_seq_len - seq_len]
     format 1 (2D array): each value in [0, max_sequence_length)
   - Returns INVALID_ARGUMENT error on violation
   - Shader-side bounds checks remain as defense-in-depth

2. core/providers/webgpu/llm/rotary_embedding.cc:
   - Add InputMemoryType(OrtMemTypeCPUInput, 3) for optional
     position_ids input
   - Add bounds checking in the position_ids != nullptr branch
   - Returns INVALID_ARGUMENT error on violation

3. js/web/lib/wasm/jsep/webgpu/ops/rotary-embedding.ts:
   - Add value validation in validateInputs() using getBigInt64Array()
   - Validates both format 0 (scalar offset) and format 1 (2D array)
   - Throws Error with descriptive message on violation

All three implementations follow the same validation pattern as the
CPU contrib fix (PR #27597), returning errors rather than silently
passing through.

Co-authored-by: Copilot <[email protected]>

Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
Add WebGPU-targeted OOB unit tests to both contrib (kMSDomain) and ONNX
domain test files. Tests verify that out-of-bounds, negative, and
format-0 overflow position_ids values are rejected with
INVALID_ARGUMENT, matching the host-side validation added to the
WebGPU RotaryEmbedding kernels.

Tests gracefully skip via GTEST_SKIP() when WebGPU EP is not available.

Co-authored-by: Copilot <[email protected]>

Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
Address readability review feedback:
- FusedQK shader: clarify why no negative check (position_id derived from
  past_seqlen + sequence_idx, always non-negative)
- ONNX domain kernel: clarify why no format 0 check (ONNX RotaryEmbedding
  always uses explicit position_ids, no base-offset mode)

Co-authored-by: Copilot <[email protected]>

Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
Address code review finding: the shared shader treats single-element
position_ids as format 0 (base offset + sequence_idx), so the ONNX
domain host-side validation must also check that base_pos +
sequence_length - 1 < max_sequence_length.

Also add corresponding format-0 OOB WebGPU test case.

Co-authored-by: Copilot <[email protected]>

Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>

Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
Fix Web CI precheck: run prettier on the TypeScript file to match
the project's JS/TS formatting requirements.

Co-authored-by: Copilot <[email protected]>

Agent-signed-off: Developer (4fe56e20) [claude-opus-4.6]
Co-authored-by: Copilot <[email protected]>
TensorViewImpl.getBigInt64Array() threw RangeError when the WASM heap
offset was not 8-byte aligned. Fix by detecting unaligned offsets and
copying bytes into an aligned buffer before creating the BigInt64Array.

This fixes the Web CI failure where all RotaryEmbedding tests failed
with 'start offset of BigInt64Array should be a multiple of 8'.

Co-authored-by: Copilot <[email protected]>
Remove InputMemoryType(OrtMemTypeCPUInput) from both WebGPU kernel
registrations (contrib and ONNX domain) and the associated host-side
position_ids value scanning. InputMemoryType is incompatible with
AddInputs() — a CPU tensor's DataRaw() would be cast to WGPUBuffer,
causing a crash at dispatch time.

Defense strategy is now:
- Shader-side: WGSL bounds checks pass through input unchanged on OOB
  (same as CUDA kernel behavior)
- JSEP/browser: TypeScript validation in rotary-embedding.ts catches
  OOB before shader dispatch
- init.ts: getBigInt64Array() handles unaligned WASM heap offsets

WebGPU OOB tests changed from kExpectFailure to kExpectSuccess,
verifying pass-through behavior (output equals input on OOB).
ONNX domain tests updated to use rank-2 position_ids for cross-EP
consistency. TS validation reordered per Copilot review: sequence_length
check before per-value bounds validation.

Co-authored-by: Copilot <[email protected]>
position_ids tensor data is GPU-resident in the WebGPU/JSEP path — its
TensorView.data field is a GPU buffer ID, not a WASM heap pointer.
Calling getBigInt64Array() reads from a random WASM heap location,
returning garbage values that cause ALL existing RotaryEmbedding JSEP
tests to fail validation.

Other ops (pad, split, tile, etc.) can safely call getBigInt64Array()
because their int64 inputs use InputMemoryType(OrtMemTypeCPU) which
keeps data on CPU with valid WASM heap pointers. RotaryEmbedding cannot
use InputMemoryType for position_ids because AddInputs() requires GPU
buffers.

Defense strategy: shader-side WGSL bounds checks handle OOB position_ids
(pass-through behavior), which works regardless of CPU/GPU residency.

Co-authored-by: Copilot <[email protected]>
@titaiwangms titaiwangms force-pushed the titaiwang/fix-rotary-embedding-oob-read branch from 5c2800a to e5a824e Compare April 27, 2026 21:03
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Apr 28, 2026
@titaiwangms titaiwangms requested a review from tianleiwu April 28, 2026 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants