[xpu] Fix conv2d incorrect results and alignment errors for non-64-byte-aligned tensors on XPU. by BBBela · Pull Request #2 · BBBela/pytorch

BBBela · 2026-03-19T15:51:25Z

This PR is fix for: intel/torch-xpu-ops#3105

Problem description

oneDNN requires data pointers to be 64-byte aligned due to constraints on Intel Data Center GPU. When a tensor is freshly allocated, the allocator guarantees this alignment. However, when a tensor view is created (e.g., via slicing A[1], indexing, or any op that produces a non-zero storage_offset), the effective data pointer (base_ptr + offset * element_size) may fall on a non-64-byte boundary.

When data pointer is not 64-byte aligned, then oneDNN conv2d kernels return wrong results.

Proposed solution

Proposed solution is to apply the fix on the PyTorch side.
This issue was discussed with the oneDNN team (see MFDNN-14791 and MFDNN-14042 Jiras). They are aware of the problem but have converted it to a feature request with no committed timeline — a proper fix inside oneDNN is non-trivial and is not planned for the near future.

Since PyTorch knows the data pointer before invoking oneDNN, the simplest correct fix is to detect misaligned tensors on the PyTorch side and make sure that data is 64-byte aligned before invoking oneDNN kernel.

Description of changes

The fix introduces two helpers in Utils.cpp/h:

is_64_bytes_aligned(tensor) — checks whether tensor.data_ptr() is divisible by 64 using pointer-to-integer arithmetic (reinterpret_cast<uintptr_t>).
make_contiguous_and_aligned(tensor, memory_format) — calls .contiguous() first (which may still produce a non-zero-offset view if the input already was contiguous), then calls .clone() only if the result has a non-zero storage offset and is not 64-byte aligned. The clone allocates a fresh buffer at a guaranteed-aligned address.

Conv.cpp is updated to call make_contiguous_and_aligned instead of plain .contiguous() for input, weight, and bias in both the forward and backward convolution paths.

The existing alignment guard in is_onednn_matmul_strides() is also simplified to use the new is_64_bytes_aligned helper, removing the inline duplication.

Note: The d2d copy introduced by .clone() only occurs in the uncommon case where a tensor slice with a misaligned pointer is passed directly into operator. Standard use of freshly-allocated tensors is unaffected.

This reverts commit 778cf35. Reverted pytorch#173783 on behalf of https://github.com/yangw-dev due to sorry internal test error: .cpp:59:51: error: no member named 'BComplex32', please contact a internal meta folks to resolve this ([comment](pytorch#173783 (comment)))

…77308)" This reverts commit 9a7ae22. Reverted pytorch#177308 on behalf of https://github.com/yangw-dev due to sorry the pr breaks internal test RiskExtrapolationModuleStabilityTest, please fix it and reland D97174518 ([comment](pytorch#175819 (comment)))

pytorch#175819)" This reverts commit 3989076. Reverted pytorch#175819 on behalf of https://github.com/yangw-dev due to sorry the pr breaks internal test RiskExtrapolationModuleStabilityTest, please fix it and reland D97174518 ([comment](pytorch#175819 (comment)))

## Summary We identify two issues in: 1. aten/src/ATen/native/Onehot.cpp — The Meta/FakeTensor path (used by torch.compile) implements one_hot via eq(self.unsqueeze(-1), arange(num_classes)). This comparison-based approach silently produces wrong results (all zeros) when indices exceed num_classes, unlike the eager CUDA path which uses scatter_ which triggers device-side asserts. 2. torch/_inductor/scheduler.py — Inductor's dead code elimination removes _assert_async nodes from unfused SchedulerNodes because has_side_effects() always returns False on the base class. Only FusedSchedulerNode checked for device_assert_async Fixes: - Onehot.cpp: Added a decomposition using scatter_ to better match eager path - native functions: Updated functional-assert-msg to work with CompositeAutograd dispatch key as opposed to just CPU, and updated lowering to return dep so that we do not DCE in code - test/dynamo/test_repros.py: Added test_one_hot_bounds_check_compiled covering out-of-bounds indices, negative indices, and valid input under torch.compile. ### Test Plan ```bash python test/dynamo/test_misc.py MiscTests.test_dynamic_one_hot ``` and ```bash python test/inductor/test_torchinductor.py -k test_one_hot ``` Authored with Claude Opus 4.6 Pull Request resolved: pytorch#177160 Approved by: https://github.com/shunting314, https://github.com/mlazos, https://github.com/jansel

After following PR:pytorch#174753 The new approach extracts libtorch from wheels via ``.ci/libtorch/extract_libtorch_from_wheel.py`` instead of using dedicated Docker images to build libtorch. We don't need these Docker builds anymore. Pull Request resolved: pytorch#177802 Approved by: https://github.com/huydhn, https://github.com/malfet

Pull Request resolved: pytorch#177446 Approved by: https://github.com/Skylion007

…ch#177609) Fixes pytorch#177140 # Summary Adds a check to validate that the macOS wheel platform tag matches the actual minos (minimum OS version) in the Mach-O binaries. For example, given a wheel like： ``` torch-2.12.0.dev20260316-cp313-cp313t-macosx_11_0_arm64.whl ``` The platform tag claims macosx_11_0_arm64, meaning minos should be 11.0. This script runs otool -l on each installed .dylib and verifies the minos from LC_BUILD_VERSION (or LC_VERSION_MIN_MACOSX) matches 11.0. Pull Request resolved: pytorch#177609 Approved by: https://github.com/atalman

pytorch#177745) 1. Validates that some important keys are present in the metadata for kernel events in the output json. Previously we had an issue where "grid" was not being returned in JSON, so we check that all of "device", "stream", "correlation", "grid", "block" are present. 2. Modifies the `payload()` used in the profiler tests to configure tensor size. We were experiencing issues on ROCm `test_disable_external_correlation` because gpu_memcpy events were not showing up in the trace. Half of this was fixed in pytorch/kineto#1295, but gpu_memcpy is only triggered when the kernel being copied is larger than some size, so we increase the tensor size to fix the test. Pull Request resolved: pytorch#177745 Approved by: https://github.com/jiannanWang, https://github.com/divyanshk

Adds `KinetoEvent::externalId()` that mirrors `ChromeTraceLogger::handleActivity()` "External id" logic, so users can access the same correlation information via the Python `events()` API. Also expose `linked_correlation_id` to `FunctionEvent` to expose to `events()` -- it previously stopped just a layer before at `_KinetoEvent`. TODO in the near-future is to unify this logic in Kineto (pull it out of the ChromeTraceLogger perhaps) so that we can share the logic. For a simple mm: ``` import torch from torch.profiler import profile, ProfilerActivity with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA]) as prof: x = torch.randn(32, 32, device="cuda") torch.mm(x, x) torch.cuda.synchronize() e = prof.events() >>> e[6] <FunctionEvent id=2 name=aten::empty overload_name= device_type=DeviceType.CPU node_id=-1 cpu_time=405.935ms start_us=11045.151 end_us=416980.621 cpu_children=[14, 17, 18] cuda_time=0.000us name=aten::empty thread=1 input_shapes=[] cpu_memory_usage=0 cuda_memory_usage=0 is_async=False is_remote=False seq_nr=-1 is_legacy=False> >>> e[6].external_id 2 >>> for i in e: . . . if i.external_id == 2: . . . print(i.name) aten::empty cudaDeviceGetStreamPriorityRange cudaStreamIsCapturing cudaMalloc ``` In JSON, the same events have the same external id: { "ph": "X", "cat": "cpu_op", "name": "aten::empty", "pid": 3487801, "tid": 3487801, "ts": 6577507473121.457, "dur": 400427.670, "args": { "External id": 2,"Record function id": 0, "Ev Idx": 1 } }, { "ph": "X", "cat": "cuda_runtime", "name": "cudaDeviceGetStreamPriorityRange", "pid": 3487801, "tid": 3487801, "ts": 6577507473171.267, "dur": 399812.634, "args": { "External id": 2, "cbid": 205, "correlation": 14 } }, { "ph": "X", "cat": "cuda_runtime", "name": "cudaStreamIsCapturing", "pid": 3487801, "tid": 3487801, "ts": 6577507873057.320, "dur": 25.941, "args": { "External id": 2, "cbid": 317, "correlation": 17 } }, { "ph": "X", "cat": "cpu_op", "name": "aten::empty", "pid": 3487801, "tid": 3487801, "ts": 6577507473121.457, "dur": 400427.670, "args": { "External id": 2,"Record function id": 0, "Ev Idx": 1 } } Pull Request resolved: pytorch#177662 Approved by: https://github.com/scotts

…ersions (pytorch#172696) Fixes pytorch#172684 Updated to use single_dim_strategy. Type conversion to int/bool on Partial(sum) incorrectly preserved the Partial placement, producing wrong results. trunc(a+b) != trunc(a) + trunc(b). This adds a custom strategy for _to_copy that checks if the dtype conversion is linear for the reduce operation before preserving Partial. This PR is offered in support of the Partial correctness stabilization efforts. Pull Request resolved: pytorch#172696 Approved by: https://github.com/wconstab

This reverts commit 2a86e11. Reverted pytorch#177446 on behalf of https://github.com/yangw-dev due to the pr breaks lint test, please fix it, see https://github.com/pytorch/pytorch/actions/runs/23302872205/job/67768676562 ([comment](pytorch#177446 (comment)))

pytorch#175806) `test_original_aten_preserved_split_addmm` uses templates and they need a big gpu to run. Otherwise, the next error is raised: ```python File "/build/work/d72ad510f627427d60bb45dd0704233cdb7a/google3/runfiles/google3/third_party/py/torch/_inductor/select_algorithm.py", [line 3062](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/_inductor/select_algorithm.py?l=3062&ws=ddelgadovargas/5324&snapshot=30587), in __call__ raise self.create_no_valid_choices(name, "No choices exist for backend.") torch._inductor.exc.InductorError: LoweringException: NoValidChoicesError: No choices to select. Provided reason: No choices exist for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice. target: aten.mm.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float32, size=[16, 24], stride=[24, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[24, 32], stride=[32, 1])) ))NoValidChoicesError: No choices to select. Provided reason: No choices exist for backend. please consider adding ATEN into max_autotune_gemm_backends config (defined in torch/_inductor/config.py) to allow at least one choice. target: aten.mm.default args[0]: TensorBox(StorageBox( InputBuffer(name='arg1_1', layout=FixedLayout('cuda:0', torch.float32, size=[16, 24], stride=[24, 1])) )) args[1]: TensorBox(StorageBox( InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[24, 32], stride=[32, 1])) )) Found from : File "/build/work/d72ad510f627427d60bb45dd0704233cdb7a/google3/runfiles/google3/third_party/py/torch/test/inductor/test_pattern_matcher.py", [line 1307](https://cs.corp.google.com/piper///depot/google3/third_party/py/torch/test/inductor/test_pattern_matcher.py?l=1307&ws=ddelgadovargas/5324&snapshot=30587), in fn return torch.addmm(z, x, y).sin() ``` Pull Request resolved: pytorch#175806 Approved by: https://github.com/eellison

pytorch#176714 and pytorch#176713 changed some dispatch macro strings which, apparently, is breaking some internal builds (S636314). This PR reverts the dispatch macro strings to how they were. Pull Request resolved: pytorch#177835 Approved by: https://github.com/kirklandsign, https://github.com/Skylion007

This reverts commit d183a29. Reverted pytorch#176567 on behalf of https://github.com/yangw-dev due to sorry, revert this due to there is a diff train conflict, please rebase and reland this ([comment](pytorch#176567 (comment)))

Replace prop_index with a single-dim strategy for aten.index.Tensor. The new strategy handles sharding values on non-indexed dims, sharding indices on broadcast dims (with correct per-tensor dim mapping for different ndims, and Replicate for size-1 broadcast dims), and partial passthrough from values. Infrastructure changes to support mixed [OpStrategy, None] lists that aten.index.Tensor produces (its indices arg is Tensor?[]): - _op_schema: args_meta/kwargs_meta recurse into lists/tuples - _sharding_prop: spec_to_strategy handles mixed DTensorSpec/None lists - single_dim_strategy: _get_unique_placements and _get_num_tensor_inputs recurse into lists Authored with Claude. Pull Request resolved: pytorch#176038 Approved by: https://github.com/pianpwk

…173877) This is needed for pytorch#166267. Currently only autocast is supported, but it's done in a generic way so other context managers can be easily added Pull Request resolved: pytorch#173877 Approved by: https://github.com/guilhermeleobas, https://github.com/williamwen42

This adds support for _saved_tensors_hooks_{enable|disable} and torch.cuda._{maybe}_exchange_device. _exchange device is an interesting case since it returns the previous device index, which is not normally visible when using the torch.cuda.device context manager. I'm treating it as a constant variable, but this bakes the device index into the graph, which is not the case when using the context manager. Pull Request resolved: pytorch#176521 Approved by: https://github.com/williamwen42 ghstack dependencies: pytorch#173877

When this was brought back in pytorch#159157 it didn't have the executable bit set like it was previously. This caused macOS builds to fail if `CMAKE_OSX_ARCHITECTURES` was set, even if it's set to the same arch as the host. Pull Request resolved: pytorch#165991 Approved by: https://github.com/albanD, https://github.com/Skylion007

…gation (pytorch#177798) The Dijkstra search in sharding propagation finds a different (lower-cost) strategy than the old full expansion path for mm(Shard(0), Shard(0)). Dijkstra explores multi-hop redistribution paths (S(0)->R->S(1), cost ~9.3) that avoid the +1.0 shard-to-shard penalty in the cost model, while the full expansion's _gen_transform_infos decomposes S(0)->S(1) as a single step that hits the penalty (cost ~10.3). This causes Dijkstra to select S(1)xS(0)->Partial(sum) over S(0)xR->S(0), changing the comm op from all_gather to alltoall. Use use_min_cost_redistribution_plan() in the test so the runtime redistribute planner uses the same graph-based search as Dijkstra, ensuring the redistribute path matches the cost model's assumptions. Copy of pytorch#177669 which got corrupted ghstack state Authored with Claude. Pull Request resolved: pytorch#177798 Approved by: https://github.com/zpcore

This reverts commit 879e4c2. Reverted pytorch#177835 on behalf of https://github.com/yangw-dev due to sorry diff train has some conflict, please rebase and reland this ([comment](pytorch#177835 (comment)))

…el.cpp (pytorch#176714)" This reverts commit c3ec2b2. Reverted pytorch#176714 on behalf of https://github.com/yangw-dev due to the pr reverted internally, shipit does not apply revert, revert this manually ([comment](pytorch#176714 (comment)))

…l.cpp (pytorch#176713)" This reverts commit 5a4d390. Reverted pytorch#176713 on behalf of https://github.com/yangw-dev due to the pr reverted internally, shipit does not apply revert, revert this manually ([comment](pytorch#176713 (comment)))

This reverts commit b9720a4. Reverted pytorch#176895 on behalf of https://github.com/malfet due to I could be wrong, but it broke some distribtued tests, see https://hud.pytorch.org/hud/pytorch/pytorch/5d775ad0cbde7690840ab9ebe95fd3f75b2fb156/1?per_page=50&name_filter=distributed&mergeEphemeralLF=true ([comment](pytorch#176895 (comment)))

Authored with Claude. Previously we were pickling the storage address which were causing different cache key on different processes. We should use _reduce_override protocol rather than dispatch_table protocol that AOTAutogradCache uses because the latter would require us to enumerate all subclass types manually. @diff-train-skip-merge this change was reverted internally Pull Request resolved: pytorch#173526 Approved by: https://github.com/aorenste, https://github.com/seemethere

This reverts commit 088c5a7. Reverted pytorch#173330 on behalf of https://github.com/yangw-dev due to reverted internally, original:D96556656, revert diff: D96725665 ([comment](pytorch#173330 (comment)))

…pytorch#177841) When `create_fx_graph_from_captured_output` builds the GraphModule, it previously `.copy()`'d `_parameters` and `_buffers` from the root module. Although `.copy()` is a shallow copy (tensor objects are shared), the problem is that `_reparametrize_module` doesn't mutate tensors in-place—it replaces dict entries (`module._buffers[name] = functional_tensor`). With separate dicts, that replacement is only visible in the GraphModule's dict. DynamoBytecodeFlatten replays bytecode that accesses the original module's dict through a Python closure, so it still sees the raw (non-functional) tensor. The mismatch causes "mutating a non-functional tensor with a functional tensor" in aot_export_joint_with_descriptors. Sharing the dicts instead of copying makes the reparametrization visible to the closure, since both the GraphModule and the closure reference the same dict object. Authored with Claude. Pull Request resolved: pytorch#177841 Approved by: https://github.com/zhxchen17

…en (pytorch#177795) Summary: TSIA. Test Plan: CI. Reviewed By: malfet Differential Revision: D95726765 Pull Request resolved: pytorch#177795 Approved by: https://github.com/malfet, https://github.com/Skylion007

…) (pytorch#177709) title. Pull Request resolved: pytorch#177709 Approved by: https://github.com/varun2784

…orch#177613) Previously, Event.synchronize() was traced as a call_method proxy node which would get silently dropped during AOTAutograd re-tracing since it's not a dispatched operator. This adds a proper torch.ops.streams.synchronize_event custom op (following the pattern of record_event/wait_event) and updates EventVariable.call_method to emit it as a call_function node. Authored with Claude. Pull Request resolved: pytorch#177613 Approved by: https://github.com/williamwen42

pytorch#177614) Add make_fallback for synchronize_event so Inductor can lower it, and update wrap_all_sync_nodes_with_control_deps to handle the new op. Since synchronize_event only takes an event_index (no stream arg), the stream is inferred from the matching record_event via a new event_to_stream map. In this case, we thread through dependencies that were recorded before the event and used after the syncrhonize to prevent reordering correctness around the synchronize. Authored with Claude. Pull Request resolved: pytorch#177614 Approved by: https://github.com/aorenste ghstack dependencies: pytorch#177613

Add a `python_value_for_identity()` hook to `VariableTracker` that returns the underlying Python object for identity (`is`) comparison. The base implementation delegates to `as_python_constant()`; VTs that wrap a real Python object but don't support `as_python_constant()` (e.g. `UserDefinedObjectVariable`, `NNModuleVariable`, `ObjectVariable`, `StreamVariable`, `EventVariable`) override it to return `self.value` directly. This lets `handle_is` compare any two VTs generically — including cross-type comparisons that previously fell through to a graph break — and removes 5 redundant type-specific handler entries that each reimplemented the same pattern with slightly different attribute access (`.value`, `.fn`, `get_submodule()`). Also consolidates the duplicate `NO_SUCH_SUBOBJ` sentinel definitions from `misc.py` and `user_defined.py` into `base.py`. Authored with Claude. Pull Request resolved: pytorch#177720 Approved by: https://github.com/guilhermeleobas

…176864) When creating meta tensors for a parameter's .grad, the param's symbolic_context was reused verbatim. This caused an assertion failure when the param and grad had different view base dimensionalities — e.g. in FSDP2 where param._local_tensor is a view of an N-D padded base but grad._local_tensor is a view of a 1-D flat gradient buffer. Build a grad-specific symbolic context via all_dynamic_symbolic_context instead of recycling the param's, resolving the TODO that was already in the code. FIXES pytorch#176667 Pull Request resolved: pytorch#176864 Approved by: https://github.com/Lucaskabela

…ytorch#176675) On AMD/HIP targets, annotate Triton kernel pointer arguments with tt.pointer_range=32 when the tensor's storage is provably within 2GB. This enables Triton's canonicalize_pointers pass to decompose 64-bit pointer arithmetic into (splat(base), offset) form using 32-bit offsets, which in turn allows ConvertToBufferOps to emit efficient amdgpu.buffer_load/store instructions. This improves performance across all kernels going through the Triton compilation path with small tensors as it generates efficient buffer ops. Pull Request resolved: pytorch#176675 Approved by: https://github.com/eellison

…ytorch#176864)" This reverts commit 928cada. Reverted pytorch#176864 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](pytorch#176864 (comment)))

qwen 1.7b FA3 H100 <img width="2440" height="470" alt="image" src="https://github.com/user-attachments/assets/abd34080-ff77-4cfc-a108-3e4a0908142c" /> FA4 H100 <img width="2430" height="444" alt="image" src="https://github.com/user-attachments/assets/09048439-5e09-4729-842f-832a8b5e2b2c" /> FA4 B200 <img width="2398" height="490" alt="image" src="https://github.com/user-attachments/assets/97d590ae-47fa-49b5-a4bc-500f052398e5" /> Pull Request resolved: pytorch#177675 Approved by: https://github.com/drisspg

…#157198) (pytorch#176573) Fixes pytorch#157144 Updated Batch, Instance and Group Norm implementations to match LayerNorm's, in the sense that a `bias: bool` argument controls whether bias is used (only relevant if `affine: bool = True`). C/C++ implementations accommodate weight is not None but bias is None, so changing the Python API is sufficient. I have left RMSNorm for now because I can't navigate the C++ files to make all the necessary changes for adding the bias parameter. This is a duplicate of [https://github.com/pytorch/pytorch/pull/157198](https://github.com/pytorch/pytorch/pull/157198) which went stale, and was eventually closed due to inactivity. Pull Request resolved: pytorch#176573 Approved by: https://github.com/mikaylagawarecki

…rch#177864) AOTInductor embeds Triton cubin files into the output .so by converting each cubin to an ELF object with `ld -r -b binary` + two `objcopy` calls. For models with many kernels (e.g., MoE with 660+ cubins), this spawns 3 * N subprocesses sequentially, taking ~50 seconds for the conversion alone. This is a significant portion of the total AOT compilation time. Add `batch_convert_cubins_to_obj()` which generates a single `.S` file with `.incbin` directives for all cubins and compiles it with one compiler invocation. This replaces N * 3 sequential subprocess calls with a single `gcc -c` call. Falls back to the original per-cubin `convert_cubin_to_obj()` if the batched assembly fails (e.g., compiler doesn't support `.incbin`). For a 4-layer MoE model (661 cubins, 14.5 MB): - Before: 51.4 seconds (1983 subprocess calls) - After: 0.08 seconds (1 subprocess call) - Speedup: ~640x for the cubin embedding phase For large models (40-layer MoE, ~2000+ cubins), this saves several minutes from the total export time. No change on Windows (cubin embedding is already skipped via `if not _IS_WINDOWS` guard). Bit-exactness verified: extracted rodata from batched .o and per-cubin .o files across 661 cubins (14.5 MB total). MD5 checksums match, symbol names (_start, _end, _size) and addresses are identical. Verified via `objdump -t` and `objcopy -O binary -j .rodata`. Existing AOTInductor test covers the embed_kernel_binary path: ``` python test/inductor/test_aot_inductor.py -k "test_simple_embed_kernel_binary_True" -v ``` Pull Request resolved: pytorch#177864 Approved by: https://github.com/yushangdi, https://github.com/desertfire

…ading best tensor and int match search (pytorch#174993) Fix pytorch#166926 ### Summary Fixes graph shadowing bugs where dynamic graphs steal inputs from more specialized graphs in the Dynamo cache, causing correctness issues (activation checkpointing) and performance regressions (less optimized kernels). ### Problem Dynamo's cache is a linked list where newer entries are checked first. When `automatic_dynamic_shapes` promotes a dimension or scalar from static to dynamic, the new dynamic graph is inserted at the front. Since it accepts a wider range of inputs, it shadows all previously compiled, more specialized graphs — even for inputs those graphs were specifically compiled to handle. **Example:** A function first called with shape `[3, 4]` compiles a static Graph 0. A second call with `[5, 4]` triggers automatic dynamic and compiles Graph 1 `(s0, 4)`. Now `[3, 4]` hits Graph 1 instead of Graph 0, because Graph 1 is checked first and accepts any `s0`. The static graph becomes dead code. **The graphs have to be free from (data-dependant) branch.** This matters for: - **Correctness:** Activation checkpointing relies on specific graphs having specific recompute behavior. Shadowing can route backward passes through the wrong graph. - **Performance:** Static graphs give the compiler full shape information for better kernel selection, memory planning, and operator fusion. A dynamic graph handling a recurring static shape leaves optimization on the table. ### Design Decisions **Why exclusion guards instead of cache reordering?** The root cause is cache ordering — more specialized graphs should be checked first. Reordering the cache (checking static before dynamic) would eliminate the need for exclusion guards entirely. However, the cache infrastructure lives in C++ (`eval_frame.c`) and always prepends new entries. Exclusion guards are a Python-only fix that achieves the same result without touching the cache insertion logic. This is the minimal, lowest-risk approach. Cache reordering remains a future option that would simplify the guard system further. **Why reset `excluded_sizes` fresh each merge instead of accumulating?** Previously, `excluded_sizes` accumulated across successive merges via `__ior__`. When multiple dimensions became dynamic in separate steps, the exclusion guard used AND logic (`not all(x.size(d) == v for d, v in ...)`), requiring an input to match *every* excluded dimension simultaneously to be rejected. This meant inputs could slip past the exclusion by differing on just one dimension. Resetting fresh each merge ensures each graph's exclusion only covers dimensions that changed in that specific step, making each dimension check independent. ** Multi-tensors and scalar at the same time ** All exclusion pairs `(symbol, excluded_value)` across all tensors and scalars are flattened into a single list and guarded with `not all` semantics: - `Or(Ne(s0, v0), Ne(s1, v1), ...)` — rejects only when ALL excluded values match simultaneously - If the current concrete values already match every excluded value, the guard is skipped (would fail on creation) ### Changes **`torch/_dynamo/pgo.py`** - New `excluded_sizes` and `excluded_scalar` fields on `FrameStateSizeEntry` - `__ior__` records old static values on transition, clears when no transition (prevents stale inheritance) **`torch/_dynamo/variables/builder.py`** - Passes `excluded_sizes` to `StatefulSymbolicContext` for tensors - Passes `excluded_value` to `create_unspecified_symint_and_symbol` for scalars **`torch/fx/experimental/symbolic_shapes.py`** - `excluded_sizes` field on `StatefulSymbolicContext` - `exclusion_constraints` flat list on `ShapeEnv`, populated via `record_exclusion_constraint` - Tensor exclusion recorded in `_create_symbolic_sizes_strides_storage_offset` - Scalar exclusion recorded in `create_unspecified_symint_and_symbol` - Guard emission in `produce_guards_verbose`: flatten, `not all` skip, combined `Or(Ne(...))` **`torch/_dynamo/config.py`** - `stable_graph_selection_for_automatic_dynamic = True` (kill switch) **`torch/utils/_sympy/printers.py`** - `_print_Or` on `CppPrinter` ### Test Plan - `test_automatic_dynamic_exclusive_guard_basic` — basic static → dynamic → revert - `test_accumulated_exclusion_does_not_shadow_intermediate_graph` — tensor accumulation - `test_4d_progressive_dynamism_cascading` — 4 graphs cascading - `test_5d_two_rounds_of_dynamism` — non-adjacent dims - `test_many_entries_wrong_graph_selection` — stress-test routing - `test_multi_dim_dynamic_and_semantics` — not-all semantics with partial matches - `test_integer_input_exclusion_basic` — scalar exclusion - `test_integer_input_exclusion_accumulation` — scalar accumulation - `test_two_tensor_inputs_exclusion` — multi-tensor combined Or - `test_multi_tensor_and_scalar_accumulation` — comprehensive mixed test Pull Request resolved: pytorch#174993 Approved by: https://github.com/laithsakka Co-authored-by: Laith Sakka <lsakka@meta.com>

…pytorch#177484) Allow dynamo to trace through stdlib copy.deepcopy by removing the copy module from BUILTIN_SKIPLIST and copy.deepcopy from _builtin_function_ids. Dynamo now inlines the Python implementation of deepcopy, which works for constant containers (dicts, lists, tuples). For tensors, __deepcopy__ triggers a graph break since FakeTensors cannot be deep-copied. To support id() on ConstDictVariable, ListVariable, TupleVariable, ConstantVariable, SymNodeVariable, and SetVariable, introduce FakeIdVariable — a compile-time-only variable that holds id(vt) and can participate as a dict key but intentionally blocks reconstruction across graph breaks to prevent silently baking stale ids into resumed bytecode. Authored with Claude. Pull Request resolved: pytorch#177484 Approved by: https://github.com/Skylion007, https://github.com/guilhermeleobas ghstack dependencies: pytorch#177720

…pytorch#177758) Pull Request resolved: pytorch#177758 Approved by: https://github.com/wconstab

…ytorch#177389) Fixes pytorch#174891 PR pytorch#172181 guarded on `ndim == 0 and element_size != itemsize` for 0-d complex tensors. But the deduced size-1 tensor has `ndim=1` so the guard doesn't trigger for non-0-d cases. Dropped the `ndim == 0` check so we bail out whenever element sizes differ [[ref]](pytorch#172181 (comment)) Pull Request resolved: pytorch#177389 Approved by: https://github.com/aorenste

…7831) As the title Pull Request resolved: pytorch#177831 Approved by: https://github.com/atalman

Add c10::metal::xlogy() to special_math.h and wire up the Metal kernel functor, stub registration, and dispatch entry. Remove the old MPSGraph implementation from BinaryOps.mm. Authored with Claude. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Pull Request resolved: pytorch#177749 Approved by: https://github.com/dcci

…RONMENT exclusions (pytorch#177914) No need to harcode the environment, if one can do `sudo` they should, if one can not, they shoudl not Pull Request resolved: pytorch#177914 Approved by: https://github.com/yangw-dev

Noticed recently that 2 PR tried to pattern-match something in `manylinux1-check.py`, that has been failing for quite some time, as none of PyTorch builds are manylinux1 compatible. For the same reason, deleted check_ssl.py, as all modern Python versions depend on SSL library Renamed entire `.ci/docker/manywheel/build_scripts` folder to `s390_scripts` as those are only referenced from `Dockerfile_s390` Co-authored-by: Claude <noreply@anthropic.com> Pull Request resolved: pytorch#177800 Approved by: https://github.com/seemethere, https://github.com/atalman, https://github.com/Skylion007

amax backward emits view (to reshape for broadcasting), eq (to build a mask of max positions), and unsqueeze (to add broadcast dimensions). Add propagation handlers for these ops so the auto_chunker can propagate through graphs containing amax, such as the numerically stable softmax pattern (amax → unsqueeze → sub → exp). The view handler uses a prefix-product approach to find the chunk dimension's new position after reshape, and the applier gets corresponding shape adjustment support for view nodes. Testing: Adding new unit test  Pull Request resolved: pytorch#176505 Approved by: https://github.com/shunting314

Pull Request resolved: pytorch#177796 Approved by: https://github.com/ngimel

Since the data for the where kernel does not require any math operations of the last two inputs, the kernel need to support only different bitwidths. The where kernel here now supports any data type of 1,2,4,8 or 16 bytes in size and upto the same number in alignment. Pull Request resolved: pytorch#174204 Approved by: https://github.com/ezyang, https://github.com/hameerabbasi, https://github.com/ngimel ghstack dependencies: pytorch#177796

…ch#171270)" This reverts commit 44326ea. Reverted pytorch#171270 on behalf of https://github.com/yangw-dev due to sorry this breaks internal test due to error: undefined symbol: at::getHostAllocator(c10::DeviceType), please contact internal folks to fix this ([comment](pytorch#171270 (comment)))

@codex

Fix pytorch#147849 ## Summary 1. Root cause `_force_original_view_tracking` changes the view replay flag in `__init__`, so using it as `@decorator` mutates the global state as soon as the function is defined instead of when the wrapped function runs. 2. Proposed fix Restore the previous view replay flag in `__call__`, set the requested flag in `__enter__`, and add a regression test that verifies decorator application preserves the ambient state until invocation. 3. Why this is the right long term fix This follows the existing `set_grad_enabled` decorator pattern, preserves the current context-manager and function forms, and makes `_force_original_view_tracking` consistent with the rest of the grad-mode context managers. Drafted via @codex, published after manual review by @bobrenjc93 Pull Request resolved: pytorch#177676 Approved by: https://github.com/aorenste, https://github.com/soulitzer

Add pointwise tag to elu_, celu_, selu_, hardsigmoid_, hardtanh_, leaky_relu, leaky_relu_, and _conj_physical for consistency with their non-inplace variants. Pull Request resolved: pytorch#173869 Approved by: https://github.com/isuruf

…ype mappings (pytorch#177529) (pytorch#177529) Summary: Register the new MTIA_COUNTERS activity type in PyTorch's kineto_shim.cpp: - Add to kMtiaTypes set so the counter activities are collected during profiling - Add to the device type switch so counter events map to DeviceType::MTIA Depends on D95963886 which adds the MTIA_COUNTERS enum to libkineto. Test Plan: ``` buck2 build fbcode//caffe2:torch-cpp-cpu ``` Differential Revision: D96575878 Pull Request resolved: pytorch#177529 Approved by: https://github.com/scotts

Please see: https://dev-discuss.pytorch.org/t/transitioning-pypi-cuda-wheels-to-cuda-13-0-as-the-stable-release-2-11/3325 Pull Request resolved: pytorch#177975 Approved by: https://github.com/yangw-dev

Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>

This reverts commit cc8ab28.

Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>

pytorchmergebot and others added 30 commits March 19, 2026 14:53

Fix type annotation for _sym_get_coordinate (pytorch#177446)

2a86e11

Pull Request resolved: pytorch#177446 Approved by: https://github.com/Skylion007

Revert "Fix dispatch macro strings for F.interpolate (pytorch#177835)"

a97ea44

This reverts commit 879e4c2. Reverted pytorch#177835 on behalf of https://github.com/yangw-dev due to sorry diff train has some conflict, please rebase and reland this ([comment](pytorch#177835 (comment)))

Revert "[ROCm] Enable expandable segments (pytorch#173330)"

e1456b3

This reverts commit 088c5a7. Reverted pytorch#173330 on behalf of https://github.com/yangw-dev due to reverted internally, original:D96556656, revert diff: D96725665 ([comment](pytorch#173330 (comment)))

[pytorch] address violations of warning unreachable-code-return in at…

7568326

…en (pytorch#177795) Summary: TSIA. Test Plan: CI. Reviewed By: malfet Differential Revision: D95726765 Pull Request resolved: pytorch#177795 Approved by: https://github.com/malfet, https://github.com/Skylion007

Handle div by 0 in optimization hint when fallback is 0 (pytorch#177709…

c8a08fb

…) (pytorch#177709) title. Pull Request resolved: pytorch#177709 Approved by: https://github.com/varun2784

anijain2305 and others added 3 commits March 20, 2026 06:16

BBBela changed the title ~~Workaround for oneDNN convolution issue.~~ [xpu] Fix conv2d incorrect results and alignment errors for non-64-byte-aligned tensors on XPU. Mar 20, 2026

pytorchmergebot and others added 25 commits March 20, 2026 12:57

[dtensor-compile] Emit zero paddings for uneven shardings to get SPMD (…

4723e12

…pytorch#177758) Pull Request resolved: pytorch#177758 Approved by: https://github.com/wconstab

[CI] Move ECR login into setup-xpu action for reusability (pytorch#17…

aeb3f42

…7831) As the title Pull Request resolved: pytorch#177831 Approved by: https://github.com/atalman

Fix gpu_kernel_opaque for large tensors (pytorch#177796)

25908a7

Pull Request resolved: pytorch#177796 Approved by: https://github.com/ngimel

Move cuda stable 12.8->13.0 (pytorch#177975)

958d381

Please see: https://dev-discuss.pytorch.org/t/transitioning-pypi-cuda-wheels-to-cuda-13-0-as-the-stable-release-2-11/3325 Pull Request resolved: pytorch#177975 Approved by: https://github.com/yangw-dev

Comment out xpu SHA checkout in CMakeLists.txt.

0ed6f2e

Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>

Enforce the zero-offset tensors before launching oneDNN conv2d kernel.

c0b5cf3

Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>

Revert "Comment out xpu SHA checkout in CMakeLists.txt."

ae06c2f

This reverts commit cc8ab28.

Change approach to more general.

888fc96

Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>

Adjust functions names.

7cb8437

Signed-off-by: Benedykt Bela <benedykt.bela@intel.com>

pytorchmergebot force-pushed the bbela/issue-2238-part-6 branch from a08d537 to 7cb8437 Compare March 24, 2026 08:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[xpu] Fix conv2d incorrect results and alignment errors for non-64-byte-aligned tensors on XPU.#2

[xpu] Fix conv2d incorrect results and alignment errors for non-64-byte-aligned tensors on XPU.#2
BBBela wants to merge 92 commits intomainfrom
bbela/issue-2238-part-6

BBBela commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

BBBela commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem description

Proposed solution

Description of changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

BBBela commented Mar 19, 2026 •

edited

Loading