Group Norm Backward Optimization with vectorization and parallel reduction #1652

yucai-intel · 2025-05-11T14:15:00Z

Add vectorization implementations of group norm backward kernels, which increases the bandwidth of data reading and thus improves performance.
Optimize GroupReduceSum function with parallel reduction, which improves computational efficiency.

xytintel · 2025-05-12T02:01:16Z

Please show performance impact

…into yucai/gn_bw

Copilot

Pull Request Overview

This PR adds a vectorized functor version for the Group Norm Backward kernel to improve performance on systems supporting vectorized operations. Key changes include:

Addition of ComputeInternalGradientsVectorizedFunctor with vectorized reduction logic.
Conditional kernel launch based on vectorization capability.
Updated work-group size computation to accommodate the vectorized implementation.

Copilot · 2025-05-13T15:11:02Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+          sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
+          sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);


It appears that inside the inner loop the value of sum1_vec[v] is overwritten in each iteration rather than accumulated. Consider using '+=' to aggregate results across iterations if that was the intended behavior.

Suggested change

sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);

sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]);

Copilot · 2025-05-13T15:11:03Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+          sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
+          sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);


Similar to the sum1_vec update, sum2_vec[v] is overwritten on each iteration of the inner loop instead of accumulating the results. If accumulation is intended, replace '=' with '+='.

Suggested change

sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);

sum1_vec[v] += static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);

sum2_vec[v] += static_cast<T_ACC>(vec_dY_[iv]);

yucai-intel · 2025-05-15T06:16:58Z

The performance is improved by 10%-40% under different shape settings.

EikanWang · 2025-05-21T23:01:24Z

Pls. update the PR description to elaborate on why the changes can improve the performance and the detailed performance data

EikanWang

Informative PR description and comments are required.

EikanWang

In general, the optimization looks good to me. However, pls. address two common issues.

Pls. avoid using non-common abbreviations
Update the PR description by elaborating on the detailed optimization ideas and detailed performance improvements

EikanWang · 2025-05-27T06:49:45Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+  using vec_t = memory::aligned_vector<T, VEC_SIZE>;
+  using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;
+
+  [[intel::reqd_sub_group_size(SIMD)]] void operator()(


@xytintel , @fengyuan14 , @gujinghui , could you help check the behavior of [[intel::reqd_sub_group_size(SIMD)]] on the latest XE?

EikanWang · 2025-05-27T06:51:19Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+  using T_ACC = acc_type_device<T, kXPU>;
+  using vec_t = memory::aligned_vector<T, VEC_SIZE>;
+  using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;


What's the rule to use UPPER and lower to define the namespace using

EikanWang · 2025-05-27T06:52:32Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+  using T_ACC = acc_type_device<T, kXPU>;
+  using vec_t = memory::aligned_vector<T, VEC_SIZE>;
+  using vec_td = memory::aligned_vector<T_ACC, VEC_SIZE>;


What are the meanings of _t and _td accordingly?

Use acc_vec_t instead to align with the overall code.
Vec_t and acc_vec_t represent vectors created with the corresponding datatype.

EikanWang · 2025-05-27T06:56:02Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+      sycl::nd_item<1> item) const {
+    vec_td sum1_vec = {};
+    vec_td sum2_vec = {};
+    auto g_start = item.get_group(0) * VEC_SIZE;


What's the meaning of g_? group or global?

It means group, use group_start instead.

EikanWang · 2025-05-27T06:59:51Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+
+#pragma unroll
+    for (int v = 0; v < VEC_SIZE; ++v) {
+      const int64_t nc = g_start + v;


v is a variable, why is nc a constant variable?

What's the abbreviation of nc?

nc is not an abbreviation, it means n*c in NCHW, and cuda also uses this variable name in the context.
Although v is a variable, it remains unchanged in a single loop, so nc is constant.

EikanWang · 2025-05-27T07:33:04Z

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

+
+#pragma unroll
+    for (int v = 0; v < VEC_SIZE; ++v) {
+      sum1_vec[v] = GroupReduceSumWithoutBroadcast<T_ACC, SIMD>(


GroupReduceSumWithoutBroadcast represents a sum reduction within a subgroup, right? Hence, why has the function been defined as GroupXXX?

GroupReduceSumWithoutBroadcast represents a sum reduction within a group, and SubgroupReduceSumWithoutBroadcast represents a sum reduction within a subgroup.

src/ATen/native/xpu/sycl/GroupNormKernels.cpp

All the requested changes have been updated.

add vectorized version for gn backward

1e0f35c

yucai-intel and others added 2 commits May 11, 2025 22:22

format

64467b9

Merge branch 'main' into yucai/gn_bw

f502f57

xytintel added the kernel_optimization label May 12, 2025

xytintel and others added 4 commits May 12, 2025 17:38

Merge branch 'main' into yucai/gn_bw

5db3fcc

revise

fbb215f

Merge branch 'yucai/gn_bw' of https://github.com/intel/torch-xpu-ops …

1baff5d

…into yucai/gn_bw

Merge branch 'main' into yucai/gn_bw

c5b8c3a

EikanWang requested a review from Copilot May 13, 2025 15:09

Copilot AI reviewed May 13, 2025

View reviewed changes

fix err

1f95f15

yucai-intel and others added 2 commits May 15, 2025 01:33

update

fd0be99

Merge branch 'main' into yucai/gn_bw

e178bdb

EikanWang requested changes May 21, 2025

View reviewed changes

xytintel requested a review from EikanWang May 27, 2025 06:44

EikanWang previously requested changes May 27, 2025

View reviewed changes

yucai-intel and others added 2 commits May 27, 2025 01:42

format

a3d0a75

Update GroupReduceUtils.h

b990dbe

yucai-intel changed the title ~~Add vectorized functor version for Group Norm Backward~~ Group Norm Backward Optimization with vectorization and parallel reduction May 27, 2025

Merge branch 'main' into yucai/gn_bw

ca1162b

xytintel requested a review from EikanWang May 28, 2025 01:29

Update GroupReduceUtils.h

9eea326

xytintel reviewed May 29, 2025

View reviewed changes

src/ATen/native/xpu/sycl/GroupNormKernels.cpp Outdated Show resolved Hide resolved

xytintel added 2 commits May 29, 2025 13:44

Update GroupNormKernels.cpp

e8bea4e

Merge branch 'main' into yucai/gn_bw

32ba2fc

xytintel approved these changes May 29, 2025

View reviewed changes

Merge branch 'main' into yucai/gn_bw

cacb188

xytintel enabled auto-merge May 30, 2025 01:19

xytintel added this pull request to the merge queue May 30, 2025

Merged via the queue into main with commit 5907931 May 30, 2025
7 checks passed

xytintel deleted the yucai/gn_bw branch May 30, 2025 01:29

		sum1_vec[v] = static_cast<T_ACC>(vec_dY_[iv] * vec_X_[iv]);
		sum2_vec[v] = static_cast<T_ACC>(vec_dY_[iv]);

Group Norm Backward Optimization with vectorization and parallel reduction #1652

Group Norm Backward Optimization with vectorization and parallel reduction #1652

Conversation

yucai-intel commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xytintel commented May 12, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI May 13, 2025

Choose a reason for hiding this comment

Uh oh!

yucai-intel commented May 15, 2025

Uh oh!

EikanWang commented May 21, 2025

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

EikanWang left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yucai-intel commented May 11, 2025 •

edited

Loading