Skip to content

optimize embedding bag #1726

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
Open

optimize embedding bag #1726

wants to merge 11 commits into from

Conversation

jianyizh
Copy link
Contributor

@jianyizh jianyizh commented Jun 9, 2025

I have remove the batch kernel config so that igc will choose grf mode 128 with the sycl assert inside kernel with vec size = 8. We can now get equivalent optimization compare to previous result.

1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 4
2. I see instruction fetch stalls because of the if branches, so move them to template params.
3. I also fixed the vectorization. Previously we actually do not enable it.
4. Previously we only use 256 threads per workgroup, but workgroup size is 1024

performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum

PVC BMG
main branch 0.18ms 0.43ms
current change 0.08ms 0.23ms
current change + if we remove assert 0.07 ms 0.22 ms

| remove sycl assert | 0.10ms | 0.30 ms |
| remove branching | 0.08ms | 0.28 ms |
| tiling | 0.087ms | 0.22 ms |

Note: We are stalled here vec_t other = w_vec_[i_off]; when vector size is 8, the assembly is load.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC]; After fix, it changes to load.ugm.d32x4. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.

PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.

@jianyizh jianyizh force-pushed the jianyi/embed_bag branch from eada1a6 to b566ba6 Compare June 9, 2025 04:53
@jianyizh jianyizh requested review from xytintel and EikanWang June 9, 2025 06:36
@jianyizh jianyizh marked this pull request as ready for review June 9, 2025 06:36
@Copilot Copilot AI review requested due to automatic review settings June 9, 2025 06:36
Copilot

This comment was marked as outdated.

@pytorchxpubot
Copy link

@sys_pytorchxpubot triage result for run 15561567714Triage bot UT analaysis result for reference only, please note unique error message only report once:
  1. third_party.torch-xpu-ops.test.xpu.test_nn_xpu.TestNN test_LayerNorm_3d_no_affine_large_feature_cuda got failed with error message
 AssertionError: Tensor-likes are not close! 

Triage bot response:

{
  "similar_issue_id": 845,
  "similar_issue_state": "closed",
  "issue_owner": "daisyden",
  "issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
  "root_causes": [
    "Discrepancies in LayerNorm implementation between CUDA and XPU.",
    "Potential differences in precision or kernel behavior affecting tensor outputs.",
    "Misalignment in computation leading to inconsistent gradients."
  ],
  "suggested_solutions": [
    "Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
    "Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
    "Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
  ]
}

@jianyizh jianyizh changed the title optimize embedding bag [dont merge] optimize embedding bag Jun 17, 2025
@jianyizh jianyizh changed the title [dont merge] optimize embedding bag optimize embedding bag Jun 19, 2025
@jianyizh jianyizh requested a review from Copilot July 12, 2025 15:11
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the SYCL EmbeddingBag kernel by moving dynamic branches to compile-time template parameters, switching to a 1D launch with direct thread indexing, and improving vectorization and workgroup sizing for performance gains.

  • Added per_sample_weights_defined and padding_idx_defined template booleans to drive compile-time branching.
  • Replaced the old BatchKernelConfig driven 2D loop with a 1D kernel using global linear ID and explicit vectorized_feature_dim_len_.
  • Updated embedding_bag setup to compute vectorized_feature_dim, work_group_size, and num_work_group, and expanded macro calls for each combination of flags.

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/ATen/native/xpu/sycl/EmbeddingBag.h Converted kernel functor to 1D, added new template parameters, removed old batch loop and config.
src/ATen/native/xpu/sycl/EmbeddingBag.cpp Calculated vectorized dimensions, updated workgroup launch, and expanded instantiation macros.
Comments suppressed due to low confidence (2)

src/ATen/native/xpu/sycl/EmbeddingBag.cpp:77

  • [nitpick] Variable name num_work_group could be more descriptive, for example num_workgroups or total_workgroups, to clearly distinguish it from work_group_size.
  int64_t num_work_group = ceil_div(

src/ATen/native/xpu/sycl/EmbeddingBag.h:24

  • [nitpick] Field name vectorized_feature_dim_len_ is quite verbose; consider shortening to feature_dim_len_ or vec_feature_len_ for readability and consistency.
    typename vec_idx_t,

Comment on lines +126 to +142
if (per_sample_weights.defined() && padding_idx != -1) \
embedding_bag<scalar_t, accscalar_t, index_t, mode, vec_size, true, true>( \
output.mutable_data_ptr<scalar_t>(), \
weight.const_data_ptr<scalar_t>(), \
indices.const_data_ptr<index_t>(), \
offsets.const_data_ptr<index_t>(), \
offset2bag.mutable_data_ptr<index_t>(), \
bag_size.mutable_data_ptr<index_t>(), \
max_indices.mutable_data_ptr<index_t>(), \
per_sample_weights.const_data_ptr<scalar_t>(), \
index_size, \
bag_num, \
feature_dim, \
padding_idx, \
ignore_offsets, \
num_row); \
else if (!per_sample_weights.defined() && padding_idx != -1) \
Copy link
Preview

Copilot AI Jul 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The macro expansions for all four combinations of per_sample_weights and padding_idx introduce significant duplication. Consider refactoring the instantiation logic or using a helper to reduce boilerplate and improve maintainability.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants