Skip to content

[xpu] Fix conv2d incorrect results and alignment errors for non-64-byte-aligned tensors on XPU.#2

Draft
BBBela wants to merge 92 commits intomainfrom
bbela/issue-2238-part-6
Draft

[xpu] Fix conv2d incorrect results and alignment errors for non-64-byte-aligned tensors on XPU.#2
BBBela wants to merge 92 commits intomainfrom
bbela/issue-2238-part-6

Conversation

@BBBela
Copy link
Copy Markdown
Owner

@BBBela BBBela commented Mar 19, 2026

This PR is fix for: intel/torch-xpu-ops#3105

Problem description

oneDNN requires data pointers to be 64-byte aligned due to constraints on Intel Data Center GPU. When a tensor is freshly allocated, the allocator guarantees this alignment. However, when a tensor view is created (e.g., via slicing A[1], indexing, or any op that produces a non-zero storage_offset), the effective data pointer (base_ptr + offset * element_size) may fall on a non-64-byte boundary.

When data pointer is not 64-byte aligned, then oneDNN conv2d kernels return wrong results.

Proposed solution

Proposed solution is to apply the fix on the PyTorch side.
This issue was discussed with the oneDNN team (see MFDNN-14791 and MFDNN-14042 Jiras). They are aware of the problem but have converted it to a feature request with no committed timeline — a proper fix inside oneDNN is non-trivial and is not planned for the near future.

Since PyTorch knows the data pointer before invoking oneDNN, the simplest correct fix is to detect misaligned tensors on the PyTorch side and make sure that data is 64-byte aligned before invoking oneDNN kernel.

Description of changes

The fix introduces two helpers in Utils.cpp/h:

  • is_64_bytes_aligned(tensor) — checks whether tensor.data_ptr() is divisible by 64 using pointer-to-integer arithmetic (reinterpret_cast<uintptr_t>).
  • make_contiguous_and_aligned(tensor, memory_format) — calls .contiguous() first (which may still produce a non-zero-offset view if the input already was contiguous), then calls .clone() only if the result has a non-zero storage offset and is not 64-byte aligned. The clone allocates a fresh buffer at a guaranteed-aligned address.

Conv.cpp is updated to call make_contiguous_and_aligned instead of plain .contiguous() for input, weight, and bias in both the forward and backward convolution paths.

The existing alignment guard in is_onednn_matmul_strides() is also simplified to use the new is_64_bytes_aligned helper, removing the inline duplication.

Note: The d2d copy introduced by .clone() only occurs in the uncommon case where a tensor slice with a misaligned pointer is passed directly into operator. Standard use of freshly-allocated tensors is unaffected.

pytorchmergebot and others added 30 commits March 19, 2026 14:53
This reverts commit 778cf35.

Reverted pytorch#173783 on behalf of https://github.com/yangw-dev due to sorry internal test error: .cpp:59:51: error: no member named 'BComplex32', please contact a internal meta folks to resolve this ([comment](pytorch#173783 (comment)))
…77308)"

This reverts commit 9a7ae22.

Reverted pytorch#177308 on behalf of https://github.com/yangw-dev due to sorry the pr breaks internal test RiskExtrapolationModuleStabilityTest, please fix it and reland D97174518 ([comment](pytorch#175819 (comment)))
pytorch#175819)"

This reverts commit 3989076.

Reverted pytorch#175819 on behalf of https://github.com/yangw-dev due to sorry the pr breaks internal test RiskExtrapolationModuleStabilityTest, please fix it and reland D97174518 ([comment](pytorch#175819 (comment)))
## Summary
We identify two issues in:
  1. aten/src/ATen/native/Onehot.cpp — The Meta/FakeTensor path (used by torch.compile) implements
  one_hot via eq(self.unsqueeze(-1), arange(num_classes)). This comparison-based approach silently
  produces wrong results (all zeros) when indices exceed num_classes, unlike the eager CUDA path which
  uses scatter_ which triggers device-side asserts.
  2. torch/_inductor/scheduler.py — Inductor's dead code elimination removes _assert_async nodes from
  unfused SchedulerNodes because has_side_effects() always returns False on the base class. Only
  FusedSchedulerNode checked for device_assert_async

Fixes:
  - Onehot.cpp: Added a decomposition using scatter_ to better match eager path
  - native functions: Updated functional-assert-msg to work with CompositeAutograd dispatch key as opposed to just CPU, and updated lowering to return dep so that we do not DCE in code
  - test/dynamo/test_repros.py: Added test_one_hot_bounds_check_compiled covering out-of-bounds
  indices, negative indices, and valid input under torch.compile.

### Test Plan
```bash
python test/dynamo/test_misc.py MiscTests.test_dynamic_one_hot
```
and
```bash
python test/inductor/test_torchinductor.py -k test_one_hot
```

Authored with Claude Opus 4.6

Pull Request resolved: pytorch#177160
Approved by: https://github.com/shunting314, https://github.com/mlazos, https://github.com/jansel
After following PR:pytorch#174753
The new approach extracts libtorch from wheels via ``.ci/libtorch/extract_libtorch_from_wheel.py`` instead of using dedicated Docker images to build libtorch. We don't need these Docker builds anymore.
Pull Request resolved: pytorch#177802
Approved by: https://github.com/huydhn, https://github.com/malfet
…ch#177609)

Fixes pytorch#177140

# Summary
Adds a check to validate that the macOS wheel platform tag matches the actual minos (minimum OS version) in the Mach-O binaries.

For example, given a wheel like:
```
torch-2.12.0.dev20260316-cp313-cp313t-macosx_11_0_arm64.whl
```

The platform tag claims macosx_11_0_arm64, meaning minos should be 11.0. This script runs otool -l on each installed .dylib and verifies the minos from LC_BUILD_VERSION (or LC_VERSION_MIN_MACOSX) matches 11.0.

Pull Request resolved: pytorch#177609
Approved by: https://github.com/atalman
pytorch#177745)

1. Validates that some important keys are present in the metadata for kernel events in the output json. Previously we had an issue where "grid" was not being returned in JSON, so we check that all of "device", "stream", "correlation", "grid", "block" are present.
2. Modifies the `payload()` used in the profiler tests to configure tensor size. We were experiencing issues on ROCm `test_disable_external_correlation` because gpu_memcpy events were not showing up in the trace. Half of this was fixed in pytorch/kineto#1295, but gpu_memcpy is only triggered when the kernel being copied is larger than some size, so we increase the tensor size to fix the test.
Pull Request resolved: pytorch#177745
Approved by: https://github.com/jiannanWang, https://github.com/divyanshk
Adds `KinetoEvent::externalId()` that mirrors `ChromeTraceLogger::handleActivity()` "External id" logic, so users can access the same correlation information via the Python `events()` API. Also expose `linked_correlation_id` to `FunctionEvent` to expose to `events()` -- it previously stopped just a layer before at `_KinetoEvent`.

TODO in the near-future is to unify this logic in Kineto (pull it out of the ChromeTraceLogger perhaps) so that we can share the logic.

For a simple mm:

```
import torch
from torch.profiler import profile, ProfilerActivity
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof:
	x = torch.randn(32, 32, device="cuda")
	torch.mm(x, x)
	torch.cuda.synchronize()

e = prof.events()

>>> e[6]
<FunctionEvent id=2 name=aten::empty overload_name= device_type=DeviceType.CPU node_id=-1 cpu_time=405.935ms start_us=11045.151 end_us=416980.621 cpu_children=[14, 17, 18] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False>
>>> e[6].external_id
2
>>> for i in e:
. . .       if i.external_id == 2:
. . .           print(i.name)

aten::empty
cudaDeviceGetStreamPriorityRange
cudaStreamIsCapturing
cudaMalloc
```

In JSON, the same events have the same external id:

{
    "ph": "X", "cat": "cpu_op", "name": "aten::empty", "pid": 3487801, "tid": 3487801,
    "ts": 6577507473121.457, "dur": 400427.670,
    "args": {
      "External id": 2,"Record function id": 0, "Ev Idx": 1
    }
},
{
    "ph": "X", "cat": "cuda_runtime", "name": "cudaDeviceGetStreamPriorityRange", "pid": 3487801, "tid": 3487801,
    "ts": 6577507473171.267, "dur": 399812.634,
    "args": {
      "External id": 2,      "cbid": 205, "correlation": 14
    }
},
{
    "ph": "X", "cat": "cuda_runtime", "name": "cudaStreamIsCapturing", "pid": 3487801, "tid": 3487801,
    "ts": 6577507873057.320, "dur": 25.941,
    "args": {
      "External id": 2,      "cbid": 317, "correlation": 17
    }
},
{
    "ph": "X", "cat": "cpu_op", "name": "aten::empty", "pid": 3487801, "tid": 3487801,
    "ts": 6577507473121.457, "dur": 400427.670,
    "args": {
      "External id": 2,"Record function id": 0, "Ev Idx": 1
    }
}
Pull Request resolved: pytorch#177662
Approved by: https://github.com/scotts
…ersions (pytorch#172696)

Fixes pytorch#172684
Updated to use single_dim_strategy.
Type conversion to int/bool on Partial(sum) incorrectly preserved the Partial placement, producing wrong results. trunc(a+b) != trunc(a) + trunc(b).

This adds a custom strategy for _to_copy that checks if the dtype conversion is linear for the reduce operation before preserving Partial.

This PR is offered in support of the Partial correctness stabilization efforts.
Pull Request resolved: pytorch#172696
Approved by: https://github.com/wconstab
pytorch#175806)

`test_original_aten_preserved_split_addmm` uses templates and they need a big gpu to run. Otherwise, the next error is raised:
```python
File "/build/work/d72ad510f627427d60bb45dd0704233cdb7a/google3/runfiles/google3/third_party/py/torch/_inductor/select_algorithm.py", [line 3062](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/_inductor/select_algorithm.py?l=3062&ws=ddelgadovargas/5324&snapshot=30587), in __call__
    raise self.create_no_valid_choices(name, "No choices exist for backend.")
torch._inductor.exc.InductorError: LoweringException: NoValidChoicesError: No choices to select. Provided reason: No choices exist for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice.
  target: aten.mm.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float32, size=[16, 24], stride=[24, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[24, 32], stride=[32, 1]))
  ))NoValidChoicesError: No choices to select. Provided reason: No choices exist for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice.
  target: aten.mm.default
  args[0]: TensorBox(StorageBox(
    InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float32, size=[16, 24], stride=[24, 1]))
  ))
  args[1]: TensorBox(StorageBox(
    InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[24, 32], stride=[32, 1]))
  ))
Found from :
   File "/build/work/d72ad510f627427d60bb45dd0704233cdb7a/google3/runfiles/google3/third_party/py/torch/test/inductor/test_pattern_matcher.py", [line 1307](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/test/inductor/test_pattern_matcher.py?l=1307&ws=ddelgadovargas/5324&snapshot=30587), in fn
    return torch.addmm(z, x, y).sin()
```

Pull Request resolved: pytorch#175806
Approved by: https://github.com/eellison
pytorch#176714 and pytorch#176713 changed some dispatch macro strings which, apparently, is breaking some internal builds (S636314).

This PR reverts the dispatch macro strings to how they were.

Pull Request resolved: pytorch#177835
Approved by: https://github.com/kirklandsign, https://github.com/Skylion007
This reverts commit d183a29.

Reverted pytorch#176567 on behalf of https://github.com/yangw-dev due to sorry, revert this due to there is a diff train conflict, please rebase and reland this ([comment](pytorch#176567 (comment)))
Replace prop_index with a single-dim strategy for aten.index.Tensor.
The new strategy handles sharding values on non-indexed dims, sharding
indices on broadcast dims (with correct per-tensor dim mapping for
different ndims, and Replicate for size-1 broadcast dims), and partial
passthrough from values.

Infrastructure changes to support mixed [OpStrategy, None] lists that
aten.index.Tensor produces (its indices arg is Tensor?[]):
- _op_schema: args_meta/kwargs_meta recurse into lists/tuples
- _sharding_prop: spec_to_strategy handles mixed DTensorSpec/None lists
- single_dim_strategy: _get_unique_placements and _get_num_tensor_inputs
  recurse into lists

Authored with Claude.
Pull Request resolved: pytorch#176038
Approved by: https://github.com/pianpwk
…173877)

This is needed for pytorch#166267.  Currently only autocast is supported, but it's
done in a generic way so other context managers can be easily added

Pull Request resolved: pytorch#173877
Approved by: https://github.com/guilhermeleobas, https://github.com/williamwen42
This adds support for _saved_tensors_hooks_{enable|disable} and torch.cuda._{maybe}_exchange_device.  _exchange device
is an interesting case since it returns the previous device index, which is not normally visible when using the
torch.cuda.device context manager.  I'm treating it as a constant variable, but this bakes the device index into the
graph, which is not the case when using the context manager.

Pull Request resolved: pytorch#176521
Approved by: https://github.com/williamwen42
ghstack dependencies: pytorch#173877
When this was brought back in pytorch#159157 it didn't have the executable bit set like it was previously. This caused macOS builds to fail if `CMAKE_OSX_ARCHITECTURES` was set, even if it's set to the same arch as the host.

Pull Request resolved: pytorch#165991
Approved by: https://github.com/albanD, https://github.com/Skylion007
…gation (pytorch#177798)

The Dijkstra search in sharding propagation finds a different (lower-cost)
strategy than the old full expansion path for mm(Shard(0), Shard(0)). Dijkstra
explores multi-hop redistribution paths (S(0)->R->S(1), cost ~9.3) that avoid
the +1.0 shard-to-shard penalty in the cost model, while the full expansion's
_gen_transform_infos decomposes S(0)->S(1) as a single step that hits the
penalty (cost ~10.3). This causes Dijkstra to select S(1)xS(0)->Partial(sum)
over S(0)xR->S(0), changing the comm op from all_gather to alltoall.

Use use_min_cost_redistribution_plan() in the test so the runtime redistribute
planner uses the same graph-based search as Dijkstra, ensuring the redistribute
path matches the cost model's assumptions.

Copy of pytorch#177669 which got corrupted ghstack state

Authored with Claude.
Pull Request resolved: pytorch#177798
Approved by: https://github.com/zpcore
This reverts commit 879e4c2.

Reverted pytorch#177835 on behalf of https://github.com/yangw-dev due to sorry diff train has some conflict, please rebase and reland this ([comment](pytorch#177835 (comment)))
…el.cpp (pytorch#176714)"

This reverts commit c3ec2b2.

Reverted pytorch#176714 on behalf of https://github.com/yangw-dev due to the pr reverted internally, shipit does not apply revert, revert this manually ([comment](pytorch#176714 (comment)))
…l.cpp (pytorch#176713)"

This reverts commit 5a4d390.

Reverted pytorch#176713 on behalf of https://github.com/yangw-dev due to the pr reverted internally, shipit does not apply revert, revert this manually ([comment](pytorch#176713 (comment)))
Authored with Claude. Previously we were pickling the storage address which were causing different cache key on different processes.

We should use _reduce_override protocol rather than dispatch_table protocol that AOTAutogradCache uses because the latter would require us to enumerate all subclass types manually.

@diff-train-skip-merge this change was reverted internally

Pull Request resolved: pytorch#173526
Approved by: https://github.com/aorenste, https://github.com/seemethere
This reverts commit 088c5a7.

Reverted pytorch#173330 on behalf of https://github.com/yangw-dev due to reverted internally, original:D96556656, revert diff: D96725665 ([comment](pytorch#173330 (comment)))
…pytorch#177841)

When `create_fx_graph_from_captured_output` builds the GraphModule, it
previously `.copy()`'d `_parameters` and `_buffers` from the root module.
Although `.copy()` is a shallow copy (tensor objects are shared), the
problem is that `_reparametrize_module` doesn't mutate tensors in-place—it
replaces dict entries (`module._buffers[name] = functional_tensor`). With
separate dicts, that replacement is only visible in the GraphModule's dict.
DynamoBytecodeFlatten replays bytecode that accesses the original module's
dict through a Python closure, so it still sees the raw (non-functional)
tensor. The mismatch causes "mutating a non-functional tensor with a
functional tensor" in aot_export_joint_with_descriptors.

Sharing the dicts instead of copying makes the reparametrization visible to
the closure, since both the GraphModule and the closure reference the same
dict object.

Authored with Claude.

Pull Request resolved: pytorch#177841
Approved by: https://github.com/zhxchen17
…en (pytorch#177795)

Summary: TSIA.

Test Plan: CI.

Reviewed By: malfet

Differential Revision: D95726765

Pull Request resolved: pytorch#177795
Approved by: https://github.com/malfet, https://github.com/Skylion007
…orch#177613)

Previously, Event.synchronize() was traced as a call_method proxy node
which would get silently dropped during AOTAutograd re-tracing since it's
not a dispatched operator. This adds a proper torch.ops.streams.synchronize_event
custom op (following the pattern of record_event/wait_event) and updates
EventVariable.call_method to emit it as a call_function node.

Authored with Claude.

Pull Request resolved: pytorch#177613
Approved by: https://github.com/williamwen42
pytorch#177614)

Add make_fallback for synchronize_event so Inductor can lower it, and
update wrap_all_sync_nodes_with_control_deps to handle the new op. Since
synchronize_event only takes an event_index (no stream arg), the stream
is inferred from the matching record_event via a new event_to_stream map.

In this case, we thread through dependencies that were recorded before the event and used after the syncrhonize to prevent reordering correctness around the synchronize.

Authored with Claude.

Pull Request resolved: pytorch#177614
Approved by: https://github.com/aorenste
ghstack dependencies: pytorch#177613
anijain2305 and others added 3 commits March 20, 2026 06:16
Add a `python_value_for_identity()` hook to `VariableTracker` that
returns the underlying Python object for identity (`is`) comparison.
The base implementation delegates to `as_python_constant()`; VTs that
wrap a real Python object but don't support `as_python_constant()`
(e.g. `UserDefinedObjectVariable`, `NNModuleVariable`,
`ObjectVariable`, `StreamVariable`, `EventVariable`) override it to
return `self.value` directly.

This lets `handle_is` compare any two VTs generically — including
cross-type comparisons that previously fell through to a graph break —
and removes 5 redundant type-specific handler entries that each
reimplemented the same pattern with slightly different attribute access
(`.value`, `.fn`, `get_submodule()`).

Also consolidates the duplicate `NO_SUCH_SUBOBJ` sentinel definitions
from `misc.py` and `user_defined.py` into `base.py`.

Authored with Claude.

Pull Request resolved: pytorch#177720
Approved by: https://github.com/guilhermeleobas
…176864)

When creating meta tensors for a parameter's .grad, the param's symbolic_context was reused verbatim. This caused an assertion failure when the param and grad had different view base dimensionalities — e.g. in FSDP2 where param._local_tensor is a view of an N-D padded base but grad._local_tensor is a view of a 1-D flat gradient buffer.

Build a grad-specific symbolic context via all_dynamic_symbolic_context instead of recycling the param's, resolving the TODO that was already in the code.

FIXES pytorch#176667
Pull Request resolved: pytorch#176864
Approved by: https://github.com/Lucaskabela
…ytorch#176675)

On AMD/HIP targets, annotate Triton kernel pointer arguments with tt.pointer_range=32 when the tensor's storage is provably within 2GB. This enables Triton's canonicalize_pointers pass to decompose 64-bit pointer arithmetic into (splat(base), offset) form using 32-bit offsets, which in turn allows ConvertToBufferOps to emit efficient amdgpu.buffer_load/store instructions.

This improves performance across all kernels going through the Triton compilation path with small tensors as it generates efficient buffer ops.

Pull Request resolved: pytorch#176675
Approved by: https://github.com/eellison
@BBBela BBBela changed the title Workaround for oneDNN convolution issue. [xpu] Fix conv2d incorrect results and alignment errors for non-64-byte-aligned tensors on XPU. Mar 20, 2026
pytorchmergebot and others added 25 commits March 20, 2026 12:57
…ytorch#176864)"

This reverts commit 928cada.

Reverted pytorch#176864 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#176864 (comment)))
qwen 1.7b

FA3 H100
<img width="2440" height="470" alt="image" src="https://github.com/user-attachments/assets/abd34080-ff77-4cfc-a108-3e4a0908142c" />

FA4 H100
<img width="2430" height="444" alt="image" src="https://github.com/user-attachments/assets/09048439-5e09-4729-842f-832a8b5e2b2c" />

FA4 B200
<img width="2398" height="490" alt="image" src="https://github.com/user-attachments/assets/97d590ae-47fa-49b5-a4bc-500f052398e5" />

Pull Request resolved: pytorch#177675
Approved by: https://github.com/drisspg
…#157198) (pytorch#176573)

Fixes pytorch#157144

Updated Batch, Instance and Group Norm implementations to match LayerNorm's, in the sense that a `bias: bool` argument controls whether bias is used (only relevant if `affine: bool = True`). C/C++ implementations accommodate weight is not None but bias is None, so changing the Python API is sufficient. I have left RMSNorm for now because I can't navigate the C++ files to make all the necessary changes for adding the bias parameter.

This is a duplicate of [https://github.com/pytorch/pytorch/pull/157198](https://github.com/pytorch/pytorch/pull/157198) which went stale, and was eventually closed due to inactivity.
Pull Request resolved: pytorch#176573
Approved by: https://github.com/mikaylagawarecki
…rch#177864)

AOTInductor embeds Triton cubin files into the output .so by converting
each cubin to an ELF object with `ld -r -b binary` + two `objcopy` calls.
For models with many kernels (e.g., MoE with 660+ cubins), this spawns
3 * N subprocesses sequentially, taking ~50 seconds for the conversion
alone. This is a significant portion of the total AOT compilation time.

Add `batch_convert_cubins_to_obj()` which generates a single `.S` file
with `.incbin` directives for all cubins and compiles it with one
compiler invocation. This replaces N * 3 sequential subprocess calls
with a single `gcc -c` call.

Falls back to the original per-cubin `convert_cubin_to_obj()` if the
batched assembly fails (e.g., compiler doesn't support `.incbin`).

For a 4-layer MoE model (661 cubins, 14.5 MB):
- Before: 51.4 seconds (1983 subprocess calls)
- After:  0.08 seconds (1 subprocess call)
- Speedup: ~640x for the cubin embedding phase

For large models (40-layer MoE, ~2000+ cubins), this saves several
minutes from the total export time.

No change on Windows (cubin embedding is already skipped via
`if not _IS_WINDOWS` guard).

Bit-exactness verified: extracted rodata from batched .o and
per-cubin .o files across 661 cubins (14.5 MB total). MD5 checksums
match, symbol names (_start, _end, _size) and addresses are identical.
Verified via `objdump -t` and `objcopy -O binary -j .rodata`.

Existing AOTInductor test covers the embed_kernel_binary path:
```
python test/inductor/test_aot_inductor.py -k "test_simple_embed_kernel_binary_True" -v
```

Pull Request resolved: pytorch#177864
Approved by: https://github.com/yushangdi, https://github.com/desertfire
…ading best tensor and int match search (pytorch#174993)

Fix pytorch#166926

### Summary

Fixes graph shadowing bugs where dynamic graphs steal inputs from more specialized graphs in the Dynamo cache, causing correctness issues (activation checkpointing) and performance regressions (less optimized kernels).

### Problem

Dynamo's cache is a linked list where newer entries are checked first. When `automatic_dynamic_shapes` promotes a dimension or scalar from static to dynamic, the new dynamic graph is inserted at the front. Since it accepts a wider range of inputs, it shadows all previously compiled, more specialized graphs — even for inputs those graphs were specifically compiled to handle.

**Example:** A function first called with shape `[3, 4]` compiles a static Graph 0. A second call with `[5, 4]` triggers automatic dynamic and compiles Graph 1 `(s0, 4)`. Now `[3, 4]` hits Graph 1 instead of Graph 0, because Graph 1 is checked first and accepts any `s0`. The static graph becomes dead code. **The graphs have to be free from (data-dependant) branch.**

This matters for:
- **Correctness:** Activation checkpointing relies on specific graphs having specific recompute behavior. Shadowing can route backward passes through the wrong graph.
- **Performance:** Static graphs give the compiler full shape information for better kernel selection, memory planning, and operator fusion. A dynamic graph handling a recurring static shape leaves optimization on the table.

### Design Decisions

**Why exclusion guards instead of cache reordering?**

The root cause is cache ordering — more specialized graphs should be checked first. Reordering the cache (checking static before dynamic) would eliminate the need for exclusion guards entirely. However, the cache infrastructure lives in C++ (`eval_frame.c`) and always prepends new entries. Exclusion guards are a Python-only fix that achieves the same result without touching the cache insertion logic. This is the minimal, lowest-risk approach. Cache reordering remains a future option that would simplify the guard system further.

**Why reset `excluded_sizes` fresh each merge instead of accumulating?**

Previously, `excluded_sizes` accumulated across successive merges via `__ior__`. When multiple dimensions became dynamic in separate steps, the exclusion guard used AND logic (`not all(x.size(d) == v for d, v in ...)`), requiring an input to match *every* excluded dimension simultaneously to be rejected. This meant inputs could slip past the exclusion by differing on just one dimension. Resetting fresh each merge ensures each graph's exclusion only covers dimensions that changed in that specific step, making each dimension check independent.

** Multi-tensors and scalar at the same time **

All exclusion pairs `(symbol, excluded_value)` across all tensors and scalars are flattened into a single list and guarded with `not all` semantics:
- `Or(Ne(s0, v0), Ne(s1, v1), ...)` — rejects only when ALL excluded values match simultaneously
- If the current concrete values already match every excluded value, the guard is skipped (would fail on creation)

### Changes

**`torch/_dynamo/pgo.py`**
- New `excluded_sizes` and `excluded_scalar` fields on `FrameStateSizeEntry`
- `__ior__` records old static values on transition, clears when no transition (prevents stale inheritance)

**`torch/_dynamo/variables/builder.py`**
- Passes `excluded_sizes` to `StatefulSymbolicContext` for tensors
- Passes `excluded_value` to `create_unspecified_symint_and_symbol` for scalars

**`torch/fx/experimental/symbolic_shapes.py`**
- `excluded_sizes` field on `StatefulSymbolicContext`
- `exclusion_constraints` flat list on `ShapeEnv`, populated via `record_exclusion_constraint`
- Tensor exclusion recorded in `_create_symbolic_sizes_strides_storage_offset`
- Scalar exclusion recorded in `create_unspecified_symint_and_symbol`
- Guard emission in `produce_guards_verbose`: flatten, `not all` skip, combined `Or(Ne(...))`

**`torch/_dynamo/config.py`**
- `stable_graph_selection_for_automatic_dynamic = True` (kill switch)

**`torch/utils/_sympy/printers.py`**
- `_print_Or` on `CppPrinter`

### Test Plan

- `test_automatic_dynamic_exclusive_guard_basic` — basic static → dynamic → revert
- `test_accumulated_exclusion_does_not_shadow_intermediate_graph` — tensor accumulation
- `test_4d_progressive_dynamism_cascading` — 4 graphs cascading
- `test_5d_two_rounds_of_dynamism` — non-adjacent dims
- `test_many_entries_wrong_graph_selection` — stress-test routing
- `test_multi_dim_dynamic_and_semantics` — not-all semantics with partial matches
- `test_integer_input_exclusion_basic` — scalar exclusion
- `test_integer_input_exclusion_accumulation` — scalar accumulation
- `test_two_tensor_inputs_exclusion` — multi-tensor combined Or
- `test_multi_tensor_and_scalar_accumulation` — comprehensive mixed test

Pull Request resolved: pytorch#174993
Approved by: https://github.com/laithsakka

Co-authored-by: Laith Sakka <lsakka@meta.com>
…pytorch#177484)

Allow dynamo to trace through stdlib copy.deepcopy by removing the
copy module from BUILTIN_SKIPLIST and copy.deepcopy from
_builtin_function_ids.  Dynamo now inlines the Python implementation
of deepcopy, which works for constant containers (dicts, lists,
tuples).  For tensors, __deepcopy__ triggers a graph break since
FakeTensors cannot be deep-copied.

To support id() on ConstDictVariable, ListVariable, TupleVariable,
ConstantVariable, SymNodeVariable, and SetVariable, introduce
FakeIdVariable — a compile-time-only variable that holds id(vt) and
can participate as a dict key but intentionally blocks reconstruction
across graph breaks to prevent silently baking stale ids into resumed
bytecode.

Authored with Claude.

Pull Request resolved: pytorch#177484
Approved by: https://github.com/Skylion007, https://github.com/guilhermeleobas
ghstack dependencies: pytorch#177720
…ytorch#177389)

Fixes pytorch#174891

PR pytorch#172181 guarded on `ndim == 0 and element_size != itemsize` for 0-d complex tensors. But the deduced size-1 tensor has `ndim=1` so the guard doesn't trigger for non-0-d cases. Dropped the `ndim == 0` check so we bail out whenever element sizes differ [[ref]](pytorch#172181 (comment))

Pull Request resolved: pytorch#177389
Approved by: https://github.com/aorenste
Add c10::metal::xlogy() to special_math.h and wire up the Metal kernel
functor, stub registration, and dispatch entry. Remove the old MPSGraph
implementation from BinaryOps.mm.

Authored with Claude.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pull Request resolved: pytorch#177749
Approved by: https://github.com/dcci
…RONMENT exclusions (pytorch#177914)

No need to harcode the environment, if one can do `sudo` they should, if one can not, they shoudl not
Pull Request resolved: pytorch#177914
Approved by: https://github.com/yangw-dev
Noticed recently that 2 PR tried to pattern-match something in `manylinux1-check.py`, that has been failing for quite some time, as none of PyTorch builds are manylinux1 compatible. For the same reason, deleted check_ssl.py, as all modern Python versions depend on SSL library

Renamed entire `.ci/docker/manywheel/build_scripts` folder to `s390_scripts` as those are only referenced from `Dockerfile_s390`

Co-authored-by: Claude <noreply@anthropic.com>
Pull Request resolved: pytorch#177800
Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/Skylion007
  amax backward emits view (to reshape for broadcasting), eq (to build a
  mask of max positions), and unsqueeze (to add broadcast dimensions).
  Add propagation handlers for these ops so the auto_chunker can propagate
  through graphs containing amax, such as the numerically stable softmax
  pattern (amax → unsqueeze → sub → exp).

  The view handler uses a prefix-product approach to find the chunk
  dimension's new position after reshape, and the applier gets
  corresponding shape adjustment support for view nodes.

  Testing: Adding new unit test

<!-- ps-id: 0484ba32-e6b8-4234-ac15-76602582c5d1 -->

Pull Request resolved: pytorch#176505
Approved by: https://github.com/shunting314
Since the data for the where kernel does not require any math
operations of the last two inputs, the kernel need to support
only different bitwidths. The where kernel here now supports
any data type of 1,2,4,8 or 16 bytes in size and upto the
same number in alignment.

Pull Request resolved: pytorch#174204
Approved by: https://github.com/ezyang, https://github.com/hameerabbasi, https://github.com/ngimel
ghstack dependencies: pytorch#177796
…ch#171270)"

This reverts commit 44326ea.

Reverted pytorch#171270 on behalf of https://github.com/yangw-dev due to sorry this breaks internal test due to error: undefined symbol: at::getHostAllocator(c10::DeviceType), please contact internal folks to fix this ([comment](pytorch#171270 (comment)))
Fix pytorch#147849

## Summary

1. Root cause
`_force_original_view_tracking` changes the view replay flag in `__init__`, so using it as `@decorator` mutates the global state as soon as the function is defined instead of when the wrapped function runs.

2. Proposed fix
Restore the previous view replay flag in `__call__`, set the requested flag in `__enter__`, and add a regression test that verifies decorator application preserves the ambient state until invocation.

3. Why this is the right long term fix
This follows the existing `set_grad_enabled` decorator pattern, preserves the current context-manager and function forms, and makes `_force_original_view_tracking` consistent with the rest of the grad-mode context managers.

Drafted via @codex, published after manual review by @bobrenjc93
Pull Request resolved: pytorch#177676
Approved by: https://github.com/aorenste, https://github.com/soulitzer
Add pointwise tag to elu_, celu_, selu_, hardsigmoid_, hardtanh_, leaky_relu, leaky_relu_, and _conj_physical for consistency with their non-inplace variants.
Pull Request resolved: pytorch#173869
Approved by: https://github.com/isuruf
…ype mappings (pytorch#177529) (pytorch#177529)

Summary:

Register the new MTIA_COUNTERS activity type in PyTorch's kineto_shim.cpp:
- Add to kMtiaTypes set so the counter activities are collected during profiling
- Add to the device type switch so counter events map to DeviceType::MTIA

Depends on D95963886 which adds the MTIA_COUNTERS enum to libkineto.

Test Plan:
```
buck2 build fbcode//caffe2:torch-cpp-cpu
```

Differential Revision: D96575878

Pull Request resolved: pytorch#177529
Approved by: https://github.com/scotts
Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>
Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>
Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>
Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>
@pytorchmergebot pytorchmergebot force-pushed the bbela/issue-2238-part-6 branch from a08d537 to 7cb8437 Compare March 24, 2026 08:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.