optimize embedding bag #1726

jianyizh · 2025-06-09T04:52:38Z

I have remove the batch kernel config so that igc will choose grf mode 128 with the sycl assert inside kernel with vec size = 8. We can now get equivalent optimization compare to previous result.

~~1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 4~~
2. I see instruction fetch stalls because of the if branches, so move them to template params.
3. I also fixed the vectorization. Previously we actually do not enable it.
4. Previously we only use 256 threads per workgroup, but workgroup size is 1024

performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum

	PVC	BMG
main branch	0.18ms	0.43ms
current change	0.08ms	0.23ms
current change + if we remove assert	0.07 ms	0.22 ms

| ~~remove sycl assert~~ | ~~0.10ms~~ | ~~0.30 ms~~ |
| ~~remove branching~~ | ~~0.08ms~~ | ~~0.28 ms~~ |
| ~~tiling~~ | ~~0.087ms~~ | ~~0.22 ms~~ |

Note: We are stalled here vec_t other = w_vec_[i_off]; when vector size is 8, the assembly is load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC]; After fix, it changes to load.ugm.d32x4. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.

PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.

pytorchxpubot · 2025-06-11T09:50:29Z

@sys_pytorchxpubot triage result for run 15561567714

Triage bot UT analaysis result for reference only, please note unique error message only report once:

third_party.torch-xpu-ops.test.xpu.test_nn_xpu.TestNN test_LayerNorm_3d_no_affine_large_feature_cuda got failed with error message

 AssertionError: Tensor-likes are not close!

Triage bot response:

{
  "similar_issue_id": 845,
  "similar_issue_state": "closed",
  "issue_owner": "daisyden",
  "issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
  "root_causes": [
    "Discrepancies in LayerNorm implementation between CUDA and XPU.",
    "Potential differences in precision or kernel behavior affecting tensor outputs.",
    "Misalignment in computation leading to inconsistent gradients."
  ],
  "suggested_solutions": [
    "Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
    "Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
    "Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
  ]
}

Copilot

Pull Request Overview

This PR optimizes the SYCL EmbeddingBag kernel by moving dynamic branches to compile-time template parameters, switching to a 1D launch with direct thread indexing, and improving vectorization and workgroup sizing for performance gains.

Added per_sample_weights_defined and padding_idx_defined template booleans to drive compile-time branching.
Replaced the old BatchKernelConfig driven 2D loop with a 1D kernel using global linear ID and explicit vectorized_feature_dim_len_.
Updated embedding_bag setup to compute vectorized_feature_dim, work_group_size, and num_work_group, and expanded macro calls for each combination of flags.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
src/ATen/native/xpu/sycl/EmbeddingBag.h	Converted kernel functor to 1D, added new template parameters, removed old batch loop and config.
src/ATen/native/xpu/sycl/EmbeddingBag.cpp	Calculated vectorized dimensions, updated workgroup launch, and expanded instantiation macros.

Comments suppressed due to low confidence (2)

src/ATen/native/xpu/sycl/EmbeddingBag.cpp:77

[nitpick] Variable name num_work_group could be more descriptive, for example num_workgroups or total_workgroups, to clearly distinguish it from work_group_size.

  int64_t num_work_group = ceil_div(

src/ATen/native/xpu/sycl/EmbeddingBag.h:24

[nitpick] Field name vectorized_feature_dim_len_ is quite verbose; consider shortening to feature_dim_len_ or vec_feature_len_ for readability and consistency.

    typename vec_idx_t,

Copilot · 2025-07-12T15:12:57Z

src/ATen/native/xpu/sycl/EmbeddingBag.cpp

+  if (per_sample_weights.defined() && padding_idx != -1)                       \
+    embedding_bag<scalar_t, accscalar_t, index_t, mode, vec_size, true, true>( \
+        output.mutable_data_ptr<scalar_t>(),                                   \
+        weight.const_data_ptr<scalar_t>(),                                     \
+        indices.const_data_ptr<index_t>(),                                     \
+        offsets.const_data_ptr<index_t>(),                                     \
+        offset2bag.mutable_data_ptr<index_t>(),                                \
+        bag_size.mutable_data_ptr<index_t>(),                                  \
+        max_indices.mutable_data_ptr<index_t>(),                               \
+        per_sample_weights.const_data_ptr<scalar_t>(),                         \
+        index_size,                                                            \
+        bag_num,                                                               \
+        feature_dim,                                                           \
+        padding_idx,                                                           \
+        ignore_offsets,                                                        \
+        num_row);                                                              \
+  else if (!per_sample_weights.defined() && padding_idx != -1)                 \


The macro expansions for all four combinations of per_sample_weights and padding_idx introduce significant duplication. Consider refactoring the instantiation logic or using a helper to reduce boilerplate and improve maintainability.

jianyizh added the kernel_optimization label Jun 9, 2025

jianyizh force-pushed the jianyi/embed_bag branch from eada1a6 to b566ba6 Compare June 9, 2025 04:53

jianyizh requested review from xytintel and EikanWang June 9, 2025 06:36

jianyizh marked this pull request as ready for review June 9, 2025 06:36

Copilot AI review requested due to automatic review settings June 9, 2025 06:36

This comment was marked as outdated.

Sign in to view

jianyizh changed the title ~~optimize embedding bag~~ [dont merge] optimize embedding bag Jun 17, 2025

jianyizh changed the title ~~[dont merge] optimize embedding bag~~ optimize embedding bag Jun 19, 2025

jianyizh force-pushed the jianyi/embed_bag branch from 111cdf5 to 5d4dd7e Compare July 11, 2025 14:48

jianyizh requested a review from Copilot July 12, 2025 15:11

Copilot AI reviewed Jul 12, 2025

View reviewed changes

jianyizh and others added 10 commits July 14, 2025 10:57

save

9d1a19d

fix vectorization

82abd39

test

536e8a3

save

1321c5e

save

ef76d4d

save

cd17f6a

save

9ad79be

save

4bee8fa

save

207c5e7

save

a53da4e

jianyizh force-pushed the jianyi/embed_bag branch from bea0116 to a53da4e Compare July 14, 2025 02:57

save

9138a67

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize embedding bag #1726

optimize embedding bag #1726

jianyizh commented Jun 9, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

pytorchxpubot commented Jun 11, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jul 12, 2025

Uh oh!

Uh oh!

optimize embedding bag #1726

Are you sure you want to change the base?

optimize embedding bag #1726

Conversation

jianyizh commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

pytorchxpubot commented Jun 11, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jul 12, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jianyizh commented Jun 9, 2025 •

edited

Loading