Optimize roi_align on BMG #1698

jianyizh · 2025-05-26T02:33:26Z

For input [1, 2048, 50, 75], rois [1000,5], roi align takes 4.7 ms on PVC but 75 ms on BMG. Each roi will have 2048xoutput_hxoutput_w work items reading the same value from LLC, and it's very slow on BMG. After put them into shared local memory, PVC takes 4.0ms, BMG reaches 7.5ms. I also removed some if else branching by min/max. I also fix a code style issue.

EikanWang · 2025-05-26T13:26:19Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

-    XPU_KERNEL_LOOP(item, index, nthreads_) {
-      // (n, c, ph, pw) is an element in the pooled output
+    auto wg = item.get_group(0);
+    auto idx = item.get_local_id(0);


Pls. rename this variable or the variable name of line 75 - https://github.com/intel/torch-xpu-ops/pull/1698/files#diff-5d6dc19a588e273ebfc8bf9dcc23fdc67ff9e961075e5d8e1385c7e896ef3ce9R75.

I delete this variable

EikanWang · 2025-05-26T13:36:46Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

+      int item_per_rois,
+      int wg_per_roi,


Should the variable names be items_per_roi and wgs_per_roi accordingly?

The RoiAlignForwardKernel does not inherit from __SYCL_KER_CONFIG_CONVENTION__. When will the sycl_ker_config_convention be invoked? @xytintel

Modified variable names and add SYCL_KER_CONFIG_CONVENTION inherit

EikanWang · 2025-05-26T13:40:37Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

@@ -160,20 +173,25 @@ struct RoiAlignForwardKernel {
        aligned_(aligned),
        rois_(rois),
        output_(output) {}
+  void sycl_ker_config_convention(sycl::handler& cgh) {
+    cache_roi_ = sycl_local_acc_t<T>(5, cgh);


Pls. define a variable for the magic value - 5. Meanwhile, please add informative comments to elaborate on why the value should be 5 rather than other values.

I added comment

EikanWang · 2025-05-26T13:44:06Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

+          local_range = (item_per_roi + 32 - 1) / 32 *
+              32; // wg can be smaller but it better to be a mutiple of 32
+        }


Frankly speaking, I cannot quite understand what's the motivation of 32. If it represents the SIMD len, pls. define a constant variable. Please. pls. elaborate on why it is better to be a multiple of 32.

It's SIMD len, @xytintel can our block size to be a random number?

EikanWang · 2025-05-26T13:56:09Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

+        cache_roi_[3] = offset_rois[3] * spatial_scale_ - offset;
+        cache_roi_[4] = offset_rois[4] * spatial_scale_ - offset;
+      }
+      item.barrier(sycl_local_fence);


The barrier may be bypassed for some work items. If so, it will trigger hw hang. Pls. refine the logic of line 76 if (index < item_per_roi_) to ensure the barrier is not bypassed.

EikanWang · 2025-05-26T14:10:47Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

+  const int item_per_roi_;
+  const int wg_per_roi_;


Comments are required. Please add informative description for each variable.

EikanWang · 2025-05-26T14:28:53Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

+        int64_t local_range =
+            syclMaxWorkGroupSize<RoiAlignForwardKernel<scalar_t>>();
+        int item_per_roi = pooled_height * pooled_width * channels;
+        if (item_per_roi < local_range) {


The local_range is the maximum number of work items that can be in a single work group. Please assert that local_range should always be a multiplier of 32. Otherwise, the local_range may be adjusted to exceed the maximum number of work items. Or has the SYCL spec defined that the value of syclMaxWorkGroupSize always can be divided by 32?

updated, now local range will not larger than max group size

EikanWang · 2025-05-26T14:29:24Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

@@ -433,10 +447,20 @@ Tensor roi_align_kernel(
      input.scalar_type(),
      "roi_align_forward_kernel_xpu",
      [&] {
+        int64_t local_range =
+            syclMaxWorkGroupSize<RoiAlignForwardKernel<scalar_t>>();
+        int item_per_roi = pooled_height * pooled_width * channels;


item_per_roi -> items_per_roi

EikanWang · 2025-05-26T14:30:29Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

+          local_range = (item_per_roi + 32 - 1) / 32 *
+              32; // wg can be smaller but it better to be a mutiple of 32
+        }
+        int wg_per_roi = (item_per_roi + local_range - 1) / local_range;


wg_per_roi -> wgs_per_roi.

EikanWang · 2025-05-26T14:32:16Z

src/ATen/native/xpu/sycl/UpSampleBilinear2dKernels.cpp

Only format changes, right?

Yes, only format changes. auto lint changes this during build. Why our lint ci is green?

Copilot

Pull Request Overview

This PR aims to optimize the roi_align performance on BMG by reducing repeated LLC memory accesses and streamlining conditional execution. Key changes include refactoring boundaries and conditional checks in the upsample kernels, and enhancing workgroup-based caching and indexing in the roi_align implementation.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/ATen/native/xpu/sycl/UpSampleBilinear2dKernels.cpp	Refined boundary condition handling and restructured the can_optimize condition
src/ATen/native/xpu/sycl/RoiAlignKernels.cpp	Updated bilinear interpolation clamping and improved ROI workgroup indexing with shared memory caching

Comments suppressed due to low confidence (1)

src/ATen/native/xpu/sycl/UpSampleBilinear2dKernels.cpp:608

Consider refactoring this compound conditional for 'can_optimize' to improve readability and maintainability, perhaps by extracting it into a helper function if it is reused.

can_optimize = can_optimize && (align_corners || (input_width == (rwidth * output_width) &&

Copilot · 2025-05-28T05:50:23Z

src/ATen/native/xpu/sycl/RoiAlignKernels.cpp

-      int roi_batch_ind = offset_rois[0];
+    // each roi will have 5 values, batch_idx,x1,y1,x2,y2
+    constexpr int roi_size = 5;
+    auto wg = item.get_group(0);


Ensure that using the workgroup id divided by wgs_per_roi_ to compute the ROI index accurately reflects the intended work distribution; a clarifying comment here would be helpful.

Suggested change

auto wg = item.get_group(0);

auto wg = item.get_group(0);

// Compute the ROI index (n) by dividing the workgroup ID (wg) by the number of workgroups per ROI (wgs_per_roi_).

// This ensures that each ROI is processed by the correct set of workgroups.

jianyizh added 2 commits May 24, 2025 23:29

save

a56ddb6

style

3866176

jianyizh added kernel_optimization hw: BMG labels May 26, 2025

small dim

57fbe22

xytintel approved these changes May 26, 2025

View reviewed changes

Merge branch 'main' into jianyi/roi_align

60a466d

xytintel marked this pull request as ready for review May 26, 2025 04:43

EikanWang requested changes May 26, 2025

View reviewed changes

jianyizh and others added 5 commits May 26, 2025 23:25

update

378a035

style

2b07b17

fix barrier

8ea16f8

fix

834be4c

Merge branch 'main' into jianyi/roi_align

5153f32

jianyizh requested a review from Copilot May 28, 2025 05:49

Copilot AI reviewed May 28, 2025

View reviewed changes

jianyizh added 2 commits May 28, 2025 13:28

remove some if branch

a764ebe

style

1775046

Optimize roi_align on BMG #1698

Are you sure you want to change the base?

Optimize roi_align on BMG #1698

Uh oh!

Conversation

jianyizh commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EikanWang May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jianyizh commented May 26, 2025 •

edited

Loading

EikanWang May 26, 2025 •

edited

Loading