Skip to content

Conversation

@Stonepia
Copy link
Contributor

@Stonepia Stonepia commented Nov 14, 2025

This is to fix the pytorch/pytorch#167253 . It does the following:

  1. Use index_t instead of int and dispatch kernels accordingly. (follows [CUDA] Large max pool fix pytorch/pytorch#167427)
  2. Use NHWC when output > INT_MAX (follows cuda max_pool2d: switch to NHWC when output > INT_MAX to avoid overflow pytorch/pytorch#167322)
  3. Change other related dtype (like num_wg to index_t to avoid overflow.

Details

Test case:

x = torch.zeros(74, 32, 30090, 81, device=torch.device("xpu"), dtype=torch.bfloat16)
torch.nn.functional.max_pool2d(x, kernel_size=(1,2), stride=(1,2), ceil_mode=False, padding=0)

It will throw the error:

[MaxPool2d] Input shape: [74, 32, 30090, 81] output: [74, 32, 30090, 40]
[MaxPool2d] Strides: n=77993280 c=1 h=2592 w=32
[MaxPool2d] Memory format: ChannelsLast
[MaxPool2d Forward] ChannelsLast path: numBatch=74 numPlane=32 inputH=30090 inputW=81 outputH=30090 outputW=40 index_t=int64
[MaxPool2d Forward] Using vec_size=1 num_wg=-72057583935701024
Segmentation fault from GPU at 0xff00000c04e33000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
Segmentation fault from GPU at 0xff00000c04e33000, ctx_id: 1 (CCS) type: 0 (NotPresent), level: 1 (PDE), access: 0 (Read), banned: 1, aborting.
Abort was called at 279 line in file:
./shared/source/os_interface/linux/drm_neo.cpp
[1]    77805 IOT instruction (core dumped)  python

From the above code, the num_wg is overflow to negative, thus caused segfault.

@Stonepia Stonepia marked this pull request as ready for review November 17, 2025 08:24
Copilot AI review requested due to automatic review settings November 17, 2025 08:24
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes integer overflow issues in XPU max pooling operations when handling large tensor sizes. The fix addresses a segmentation fault that occurred when output tensors exceeded INT_MAX by introducing index type templating and automatic memory format selection.

Key Changes:

  • Introduced index_t template parameter (int32_t or int64_t) for kernel functors and functions to handle both small and large tensor sizes
  • Added validation functions can_use_int32_nhwc and can_use_int32_nchw to determine when int32 is safe to use
  • Automatically switches to ChannelsLast memory format when contiguous format would exceed int32 limits

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@EikanWang EikanWang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please address the copilot's comments.

const vec_t* grad_output_vec = reinterpret_cast<const vec_t*>(gradOutput); \
vec_t* grad_input_vec = reinterpret_cast<vec_t*>(gradInput); \
auto kfn = MaxPool2dBackwardChannelLastVec<scalar_t, vec_t, vec_size>( \
auto kfn = MaxPool2dBackwardChannelLastVec<scalar_t, vec_t, vec_size, index_t>( \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls. fix the code style.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Added in 474ac9d

@github-actions
Copy link

Performance outliers, please check!

  • 🟡 [80%, 90%), may be fluctuations
Category Model Target vs. Baseline [Eager] Target vs. Baseline [Inductor]
torchbench_bfloat16_training resnext50_32x4d 0.934002 0.822647
torchbench_bfloat16_training squeezenet1_1 1.021592 0.832874

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants