-
Notifications
You must be signed in to change notification settings - Fork 46
optimize embedding bag #1726
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
optimize embedding bag #1726
Conversation
eada1a6
to
b566ba6
Compare
@sys_pytorchxpubot triage result for run 15561567714Triage bot UT analaysis result for reference only, please note unique error message only report once:
Triage bot response: {
"similar_issue_id": 845,
"similar_issue_state": "closed",
"issue_owner": "daisyden",
"issue_description": "The test TestNN.test_LayerNorm_3d_no_affine_large_feature_cuda failed with an AssertionError: Tensor-likes are not close! The error suggests a discrepancy in tensor values between CUDA and XPU implementations. The test involves computing outputs and gradients on both devices and asserting their closeness, which failed due to significant differences beyond the allowed tolerance.",
"root_causes": [
"Discrepancies in LayerNorm implementation between CUDA and XPU.",
"Potential differences in precision or kernel behavior affecting tensor outputs.",
"Misalignment in computation leading to inconsistent gradients."
],
"suggested_solutions": [
"Investigate and align the LayerNorm implementation across CUDA and XPU to ensure consistent results.",
"Adjust tolerance levels if the discrepancies are deemed acceptable and not indicative of a broader issue.",
"Consider skipping the test if the failure is consistent and not resolvable, similar to prior solutions for tensor comparison issues."
]
} |
111cdf5
to
5d4dd7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the SYCL EmbeddingBag kernel by moving dynamic branches to compile-time template parameters, switching to a 1D launch with direct thread indexing, and improving vectorization and workgroup sizing for performance gains.
- Added
per_sample_weights_defined
andpadding_idx_defined
template booleans to drive compile-time branching. - Replaced the old
BatchKernelConfig
driven 2D loop with a 1D kernel using global linear ID and explicitvectorized_feature_dim_len_
. - Updated
embedding_bag
setup to computevectorized_feature_dim
,work_group_size
, andnum_work_group
, and expanded macro calls for each combination of flags.
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
src/ATen/native/xpu/sycl/EmbeddingBag.h | Converted kernel functor to 1D, added new template parameters, removed old batch loop and config. |
src/ATen/native/xpu/sycl/EmbeddingBag.cpp | Calculated vectorized dimensions, updated workgroup launch, and expanded instantiation macros. |
Comments suppressed due to low confidence (2)
src/ATen/native/xpu/sycl/EmbeddingBag.cpp:77
- [nitpick] Variable name
num_work_group
could be more descriptive, for examplenum_workgroups
ortotal_workgroups
, to clearly distinguish it fromwork_group_size
.
int64_t num_work_group = ceil_div(
src/ATen/native/xpu/sycl/EmbeddingBag.h:24
- [nitpick] Field name
vectorized_feature_dim_len_
is quite verbose; consider shortening tofeature_dim_len_
orvec_feature_len_
for readability and consistency.
typename vec_idx_t,
if (per_sample_weights.defined() && padding_idx != -1) \ | ||
embedding_bag<scalar_t, accscalar_t, index_t, mode, vec_size, true, true>( \ | ||
output.mutable_data_ptr<scalar_t>(), \ | ||
weight.const_data_ptr<scalar_t>(), \ | ||
indices.const_data_ptr<index_t>(), \ | ||
offsets.const_data_ptr<index_t>(), \ | ||
offset2bag.mutable_data_ptr<index_t>(), \ | ||
bag_size.mutable_data_ptr<index_t>(), \ | ||
max_indices.mutable_data_ptr<index_t>(), \ | ||
per_sample_weights.const_data_ptr<scalar_t>(), \ | ||
index_size, \ | ||
bag_num, \ | ||
feature_dim, \ | ||
padding_idx, \ | ||
ignore_offsets, \ | ||
num_row); \ | ||
else if (!per_sample_weights.defined() && padding_idx != -1) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The macro expansions for all four combinations of per_sample_weights
and padding_idx
introduce significant duplication. Consider refactoring the instantiation logic or using a helper to reduce boilerplate and improve maintainability.
Copilot uses AI. Check for mistakes.
bea0116
to
a53da4e
Compare
I have remove the batch kernel config so that igc will choose grf mode 128 with the sycl assert inside kernel with vec size = 8. We can now get equivalent optimization compare to previous result.
1. remove SYCL_KERNEL_ASSERT, this will change grf mode from 256 to 128, but there is an existing issue #1052, I did not remove it in this pr. We should add NDEBUG flag later or use vec_size = 42. I see instruction fetch stalls because of the if branches, so move them to template params.
3. I also fixed the vectorization. Previously we actually do not enable it.
4. Previously we only use 256 threads per workgroup, but workgroup size is 1024
performance on input [409581], weight [1000000,64], offset [4096] (4096 bags), dtype = half, mode = sum
|
remove sycl assert|0.10ms|0.30 ms||
remove branching|0.08ms|0.28 ms||
tiling|0.087ms|0.22 ms|Note: We are stalled herevec_t other = w_vec_[i_off];
when vector size is 8, the assembly isload.ugm.d32.a64; load.ugm.d32.a64.flat[A+0x4]; load.ugm.d32.a64.flat[A+0x8]; load.ugm.d32.a64.flat[A+0xC];
After fix, it changes toload.ugm.d32x4
. There is no performance change on peak frequency, but when profiling on lower frequency, I see 9% faster.PVC does not benefit from tiling, in this case, there will be 32 workgroups but 64 Xe core. However, even we set vec_size =4, tiling 2 batch is still a regression. The best config is vec_size=4, and set workgroup size =512, it can reach 0.71ms. There's no benefit on BMG to set a smaller work group size.