Implement a SYCL Onesweep Radix Sort KT by mmichel11 · Pull Request #2575 · uxlfoundation/oneDPL

mmichel11 · 2026-02-06T22:07:02Z

This PR adds a SYCL KT based on Onesweep. It adapts the ESIMD implementation with the following changes:

The histogram kernel has been optimized to reduce GRF usage in order to process all data in a single pass instead of two. An SLM atomic based approach with duplicate binning is used. The speedup over large inputs scales to be ~2x faster ESIMD.
Thread (sub-group) bincount offsets are removed from the onesweep kernel due to performance reasons and programming model differences. Sub-group ballot is used to count within a sub-group.
Cooperative groups are used via the SYCL sycl_ext_oneapi_forward_progress extension and iterative decoupled lookback is performed to guarantee hardware safety. This resolves catastrophic errors encountered during BMG stress testing and slightly improves performance for smaller multi-work group cases due to the removal of the work group atomic id counter.
For single work-group, oneDPL sort is used with some small sub-group size changes to avoid work-group size limitations on PVC (encountered at runtime).

Other relevant details:

The diff is very large in part due to the restructuring of the ESIMD kernels. Please note most of these changes are just indentation and the only "real" ESIMD changes are the conversions from functions to structs with tag dispatch and the unification of dispatchers / submitters to share code with the SYCL version.
Testing is unified between ESIMD / SYCL.

Copilot

Pull request overview

This PR adds a new SYCL implementation of the onesweep radix sort kernel template, refactors shared ESIMD/SYCL infrastructure, and unifies the KT test harness to run against either backend.

Changes:

Introduces SYCL onesweep radix sort implementation and integrates it into the kernel templates public header.
Refactors ESIMD radix sort internals into shared dispatcher/submitter/kernel components with tag-based dispatch.
Updates KT tests and CMake generation to build/run both ESIMD and SYCL variants via a unified test source set.

Reviewed changes

Copilot reviewed 22 out of 22 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
test/kt/single_pass_scan.cpp	Switches test include to unified radix sort KT test utilities header.
test/kt/radix_sort_utils.h	Adds backend namespace aliases and new backend-aware SLM sizing logic for test skipping.
test/kt/radix_sort_out_of_place.cpp	Updates tests to call backend-selected namespace (ESIMD vs SYCL) and removes local can_run_test.
test/kt/radix_sort_by_key_out_of_place.cpp	Updates by-key out-of-place tests to call backend-selected namespace.
test/kt/radix_sort_by_key.cpp	Updates by-key in-place tests to call backend-selected namespace.
test/kt/radix_sort.cpp	Updates key-only tests to backend-selected namespace and adds ESIMD-only deprecated-namespace coverage path.
test/kt/CMakeLists.txt	Generalizes sort test generation to support both ESIMD and SYCL variants with backend compile definitions.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h	Extends subgroup radix sort to support explicit destination output range.
include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort.h	Updates callers to pass a destination range to one-work-group radix sort path.
include/oneapi/dpl/experimental/kt/sycl_radix_sort.h	Adds the public SYCL KT API surface for radix_sort and radix_sort_by_key (in-/out-of-place).
include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h	Adds SYCL onesweep kernels (global histogram + onesweep reorder/lookback) implementation.
include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h	Updates sub-group scan helper to accept either lazy storage or plain types and changes backend include.
include/oneapi/dpl/experimental/kt/internal/radix_sort_utils.h	Adds shared SYCL scalar utilities, tags, and parameter validation used by both backends.
include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h	Adds shared submitters/launch logic for ESIMD and SYCL kernels, including one-WG fallback.
include/oneapi/dpl/experimental/kt/internal/radix_sort_kernels.h	Adds shared forward declarations of kernel functors for tag-based compilation.
include/oneapi/dpl/experimental/kt/internal/radix_sort_dispatchers.h	Refactors dispatchers into shared namespace with tag dispatch for ESIMD vs SYCL.
include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h	Moves ESIMD internals into shared kt::gpu::__impl namespace for unified use.
include/oneapi/dpl/experimental/kt/internal/esimd_defs.h	Moves ESIMD constants into shared kt::gpu::__impl namespace for unified use.
include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_kernels.h	Converts ESIMD free functions into tagged kernel functors and unifies namespace/internals.
include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_submitters.h	Removes the now-replaced ESIMD-only submitters header (replaced by unified submitters).
include/oneapi/dpl/experimental/kt/esimd_radix_sort.h	Updates ESIMD public API to route through shared dispatchers with explicit __esimd_tag.
include/oneapi/dpl/experimental/kernel_templates	Exposes the new SYCL radix sort KT header in the kernel templates umbrella include.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

test/kt/radix_sort_utils.h

test/kt/CMakeLists.txt

include/oneapi/dpl/experimental/kt/sycl_radix_sort.h

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h

danhoeflinger

For others: I suggest turning on "hide whitespace" to review. The esimd changes are badly rendered otherwise. These had to be massaged a bit to reuse the shared submitter infrastructure but in reality had minimal changes.

include/oneapi/dpl/experimental/kt/internal/radix_sort_dispatchers.h

include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_submitters.h

include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

* Implements two component histogram: SLM -> global memory with duplicate SLM bins to reduce atomic contention Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

…ons, and fix slm test bug

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

include/oneapi/dpl/experimental/kt/internal/radix_sort_utils.h

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

test/kt/radix_sort_utils.h

include/oneapi/dpl/experimental/kt/sycl_radix_sort.h

danhoeflinger · 2026-02-27T21:54:15Z

include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h

+    __get_num_work_groups(const sycl::kernel& __kernel, sycl::queue& __q, std::uint32_t __tile_count,
+                          std::uint32_t __slm_size_bytes) const


It seems like this function a generic workaround utility which is not specific to the onesweep kernel. Do we want to put it somewhere more accessible, where other cooperative launch kernels could use it?

Also, am I correct that once the bug is fixed we would merely need the first statement of the function?

Do we want to put it somewhere more accessible, where other cooperative launch kernels could use it?

It is a generic utility that can be used for any algorithm that needs cooperative kernels / work group forward progress. The question from my side is where to put it. I think it should stay in the kt directory probably in some generic header called kt_utils.h. Do you think something like this is worth adding now or later once we have a secondary use case (e.g. inclusive_scan)?

Also, am I correct that once the bug is fixed we would merely need the first statement of the function?

Yes, everything beyond that is manually calculating things the driver should be handling.

I think I'd be in favor of adding it, but its your call. The risk of leaving it here is the future implementer who may want it (or may not know they need it) wouldn't know it exists.

I decided to go ahead and add this new header and make it a free function

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_utils.h

danhoeflinger · 2026-03-03T19:58:46Z

include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h

+    __get_num_work_groups(const sycl::kernel& __kernel, sycl::queue& __q, std::uint32_t __tile_count,
+                          std::uint32_t __slm_size_bytes) const


I think I'd be in favor of adding it, but its your call. The risk of leaving it here is the future implementer who may want it (or may not know they need it) wouldn't know it exists.

danhoeflinger · 2026-03-03T19:59:53Z

include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h

+        assert(__slm_granularity_it != std::cend(__slm_granularity_table));
+        const std::uint32_t __true_slm_size_bytes = *__slm_granularity_it;


Nitpick perhaps, but can we just check this at runtime rather than having it be an assertion?
If we go over 128kb SLM due to a user input of data per work item, then this would be UB dereferencing OOB memory.

I guess we should probably be protecting against this case at compile time using the kernel params (I don't think we currently do this).

We currently rely on SYCL throwing an exception if the kt params reserve more SLM than possible. We could:

Just cap __slm_size_bytes to 128 KiB in this query and let SYCL thrown an exception later

Document and throw our own exception for an invalid SLM reservation.

My preference for now is 1.

static assert is not the best solution for SLM check in my opinion because future devices may have more SLM which would force us to check at runtime.

Yeah, I think 1 is fine. Its the responsibility of the user to ensure that their workload fits on the card. I think for esimd we document the memory requirements (at least roughly).

Done. It caps the provided value at 128 KiB.

danhoeflinger · 2026-03-03T20:07:49Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+                        sycl::atomic_ref<_GlobOffsetT, sycl::memory_order::relaxed, sycl::memory_scope::device,
+                                         sycl::access::address_space::local_space>;


I think the sycl::memory_scope::work_group fits better here than device for SLM memory.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger · 2026-03-03T20:33:21Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+        // When we reorder into SLM there are indexing offsets between bins due to contiguous storage that should not be reflected in global output as any given bin's
+        // total global offset is defined in __slm_global_incoming. We account for this by subtracting each bin's incoming slm index offset
+        // from __slm_global_incoming so that later adding the reorderered key's slm index to the fixed global offset yields the correct output index in the final stage.
+        //
+        //
+        // The sequence of computations for the fixed global offset is shown below, showing how we yield a valid output index in __reorder_slm_to_glob.
+        // For demonstration, slm_global_fix is separated from slm_global_incoming which can actually be modified in-place.
+        // slm_global_fix[bin] = slm_global_incoming[bin] - slm_group_hist[bin]
+        // slm_idx[key]        = slm_group_hist[bin] + key offset within bin
+        // out_idx[key]        = slm_global_fix[bin] + slm_idx[key]
+        //                     = slm_global_incoming[bin] - slm_group_hist[bin] + slm_group_hist[bin] + key offset within bin
+        //                     = slm_global_incoming[bin] + key offset within bin
+        //
+        // The case where __slm_group_hist[_i] > __slm_global_incoming[__i] is valid resulting in
+        // the difference yielding a large number due to guaranteed wrap around behavior with unsigned integers in the C++ spec.
+        // When this global fix is added to the reordered offset index the wraparound is undone, yielding the valid output index shown above.


I was a bit confused by this explanation, so I did my best to rewrite my own, take from it what you wish. Also, line lengths of the current set of comments are very long, our clang format won't fix this for you to preserve the formatting of comment blocks as its sometimes intentional.

// To avoid fully scattered global writes, we write data first grouped by bin to SLM, // then write in a partially coalesced manner from SLM to global memory. // When writing from SLM to global memory, we dont want to have to store or recalculate // the global offset for each item, but instead we can obtain it from its SLMIndex and // a pre-calculated constant offset per bin. // GlobalIndex = GlobalBaseOffset[bin] + LocalOffsetWithinBin // SLMIndex = SLMBaseOffset[bin] + LocalOffsetWithinBin // By isolating LocalOffsetWithinBin, we can express the GlobalIndex as: // GlobalIndex = GlobalBaseOffset[bin] + (SLMIndex - SLMBaseOffset[bin]) // GlobalIndex = (GlobalBaseOffset[bin] - SLMBaseOffset[bin]) + SLMIndex // To save instructions during the final global write, we pre-calculate this constant // offset "fix" per bin. We overwrite __slm_global_incoming (GlobalBaseOffset) by subtracting // __slm_group_hist (SLMBaseOffset). // Later, during the global scatter, threads simply calculate: // GlobalIndex = __slm_global_incoming[bin] + SLMIndex // // Note: Due to standard C++ unsigned integer guaranteed wrap-around (two's complement), this math // works even if SLMBaseOffset > GlobalBaseOffset.

Your explanation is better than mine :) I switched to it with some small adjustments.

danhoeflinger · 2026-03-03T21:00:49Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+    static constexpr std::uint32_t __bit_count = sizeof(_KeyT) * 8;
+    static constexpr std::uint32_t __stage_count =
+        oneapi::dpl::__internal::__dpl_ceiling_div(__bit_count, __radix_bits);
+    static constexpr std::uint32_t __hist_data_per_sub_group = 128;


Is there any situations where this hardcoded value should change?

I know that we have multiple kernels so _KernelParam::__data_per_work_item is already obligated to the onesweep kernel. Should we consider extending _KernelParams for sort to include more of these hardcoded parameters?

This was one of the open questions with ESIMD sort (histogram tuning) if I remember. I suggest we hold onto it for when we consider some of the other KT design aspects (e.g. single work group) and just provide the same interface as ESIMD for now. My only concern to doing it now is exposing too many different parameters where only a few make a big performance difference.

It may be possible to fine-tune histogram further. However, since the initial algorithmic optimizations, histogram makes up <15% of execution time so the benefits of tuning will be small.

Works for me.

danhoeflinger · 2026-03-04T00:22:29Z

include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h

+        assert(__slm_granularity_it != std::cend(__slm_granularity_table));
+        const std::uint32_t __true_slm_size_bytes = *__slm_granularity_it;


Yeah, I think 1 is fine. Its the responsibility of the user to ensure that their workload fits on the card. I think for esimd we document the memory requirements (at least roughly).

danhoeflinger · 2026-03-04T00:39:46Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+    __match_bins(sycl::sub_group __sub_group, std::uint32_t __bin)
+    {
+        // start with all bits 1
+        sycl::ext::oneapi::sub_group_mask __matched_bins = sycl::ext::oneapi::group_ballot(__sub_group);


Should we guard this with _ONEDPL_LIBSYCL_SUB_GROUP_MASK_PRESENT, and provide an easily readable error otherwise?

I went ahead and factored this into the _ONEDPL_ENABLE_SYCL_RADIX_SORT_KT since we do not provide an alternative if it is not available. In practice with oneAPI, the forward progress extension being present means sub-group mask is present as it was added several years prior.

I also decided to add kt_defs.h as I was originally defining these macros in the utils file..

danhoeflinger · 2026-03-04T00:44:13Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+        // TODO: This exists in the ESIMD KT and was ported but are we not limiting max input size to
+        // 2^30 ~ 1 billion elements? We use 32-bit indexing / histogram which may already be too small
+        // but are then reserving the two upper bits for lookback flags.


In the short term for this, we probably just need a known limitation here, and perhaps an assert checking against the size of the sequence at the public interface level.

In the long term, we could consider an alternative API / setting to enable larger sized sequences which uses a separated implementation for lookback flags, but we are probably not able to fit that in at this point.

Added this runtime check into __check_sycl_sort_params.

Yep, in the long-term we need some KT option or API to support larger types. There are two options:

Just provide the option to use 64-bit histograms instead of 32. This would support inputs up to 2^62 (far larger than any device memory)

Add an option to support up to 2^32 with 32-bit histogram but with separate status and lookback value flags. This could also be used to reduce the number of atomics so it could even be considered as a default path (depending on how atomic cost compares with extra traffic and larger lookback allocation).

danhoeflinger · 2026-03-04T01:00:15Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+        // but are then reserving the two upper bits for lookback flags.
+        constexpr std::uint32_t __global_accumulated = 0x40000000;
+        constexpr std::uint32_t __hist_updated = 0x80000000;
+        constexpr std::uint32_t __global_offset_mask = 0x3fffffff;


Maybe this is a good way to do it, but its a little strange to me to define these constants here and then pass them to the helpers as template arguments, rather than having them at the struct level and perhaps labeling them more clearly in the name as flags / mask.

I suppose this way it allows them to be defined close to where they are used / originating, but a little odd to me.

If you want to keep them here, perhaps add a comment that these are flags shared with the helpers via template arguments.

Making these static constexpr is better imo and I added this change along with calling them *_mask. Originally, this was one large monolithic function in ESIMD where defining them in the function made sense. But with separate functions, they should just be class members.

danhoeflinger · 2026-03-04T01:10:48Z

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h

+            _LocOffsetT __group_incoming = __slm_group_hist[__bin];
+            _LocOffsetT __offset_in_bin =
+                (__sub_group_id == 0) ? 0 : __slm_subgroup_hists[(__sub_group_id - 1) * __bin_count + __bin];
+            _LocOffsetT __offset_across_bins = __group_incoming;


Suggested change

_LocOffsetT __group_incoming = __slm_group_hist[__bin];

_LocOffsetT __offset_in_bin =

(__sub_group_id == 0) ? 0 : __slm_subgroup_hists[(__sub_group_id - 1) * __bin_count + __bin];

_LocOffsetT __offset_across_bins = __group_incoming;

_LocOffsetT __offset_across_bins = __slm_group_hist[__bin];

_LocOffsetT __offset_in_bin =

(__sub_group_id == 0) ? 0 : __slm_subgroup_hists[(__sub_group_id - 1) * __bin_count + __bin];

Looks like an unnecessary extra variable.

Thanks, removed

danhoeflinger · 2026-03-04T01:43:54Z

I'm mostly just finding cosmetic stuff. I still need to do a little more looking at workgroup chained scan, and some of the intra/ inter scan bits of onesweep but its looking quite good so far.

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 added this to the 2022.12.0 milestone Feb 6, 2026

mmichel11 changed the title ~~Implement a SYCL Onesweep KT~~ Implement a SYCL Onesweep Radix Sort KT Feb 6, 2026

mmichel11 marked this pull request as ready for review February 9, 2026 14:31

mmichel11 requested review from MikeDvorskiy, SergeyKopienko, adamfidel, akukanov, Copilot, danhoeflinger and dmitriy-sobolev February 9, 2026 15:26

Copilot AI reviewed Feb 9, 2026

View reviewed changes

danhoeflinger reviewed Feb 10, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/sycl_radix_sort.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Feb 10, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Feb 10, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/internal/sub_group/sub_group_scan.h Show resolved Hide resolved

danhoeflinger reviewed Feb 10, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/internal/radix_sort_dispatchers.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Feb 10, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/internal/esimd_radix_sort_submitters.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Feb 10, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/internal/radix_sort_submitters.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Feb 11, 2026

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Feb 11, 2026

View reviewed changes

include/oneapi/dpl/pstl/hetero/dpcpp/parallel_backend_sycl_radix_sort_one_wg.h Outdated Show resolved Hide resolved

danhoeflinger and others added 10 commits February 24, 2026 19:33

initial esimd impl

faf6ba0

refactor with shared infrastructure

9be2bf1

minor fix

d3a21a0

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

filling in testing infrastructure for sycl impl

ba8c913

Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>

revert erroneous change

8893215

onesweep: implement SYCL histogram kernel (#2559)

1f0e492

* Implements two component histogram: SLM -> global memory with duplicate SLM bins to reduce atomic contention Signed-off-by: Matthew Michel <matthew.michel@intel.com>

onesweep feature branch: implement SYCL onesweep kernel (#2567)

af8f49f

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

initial cleanup, break-up onesweep into smaller understandable functi…

785e40f

…ons, and fix slm test bug

single work group with subgroup sort

1d70133

extend single work group dpl sort to support out of place

5052372

mmichel11 added 18 commits February 24, 2026 19:33

Remove atomic id ptr allocation

f1e6b18

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove work group size of 256 for now

29a58a6

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

copyright updates

952db23

clang-format and minor cleanup

294d795

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove root_sync property as we do not need it

89b0b19

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add version checking for required oneAPI extensions

4350725

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

remove unneeded slm offset query

2ae1ac6

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

remove unused variables

902daa1

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Add workaround for static_assert(false) pre-C++23

60628ca

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Update comment on global fix and compute global fix in place

e658de9

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Remove single work group sort

acac65d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

fix typo in __always_false_v usage

3b78a1d

Add helper struct for histogram kernel params

6c0ab45

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Refactor __scan_input_t

2f1af13

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

::std -> std and missing sub-group barriers

9a1dc4b

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

Safer CMake string comparison

a2a452b

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

reduce kernel bundle overhead for multi-device systems

088bb09

__get_sycl_range changes

f93c19d

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

mmichel11 force-pushed the dev/mmichel11/onesweep_kt_coop branch from 558b76b to f93c19d Compare February 25, 2026 03:50

Account for fixed SLM size granularities in __get_num_work_groups

948d72f

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger reviewed Feb 26, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/internal/radix_sort_utils.h Outdated Show resolved Hide resolved

include/oneapi/dpl/experimental/kt/internal/sycl_radix_sort_kernels.h Show resolved Hide resolved

test/kt/radix_sort_utils.h Outdated Show resolved Hide resolved

Merge branch 'main' into dev/mmichel11/onesweep_kt_coop

1fa3931

danhoeflinger reviewed Feb 27, 2026

View reviewed changes

include/oneapi/dpl/experimental/kt/sycl_radix_sort.h Outdated Show resolved Hide resolved

danhoeflinger reviewed Feb 27, 2026

View reviewed changes

Address review feedback

761d233

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger reviewed Mar 3, 2026

View reviewed changes

Address review feedback

b9eb2fa

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

danhoeflinger reviewed Mar 4, 2026

View reviewed changes

Address minor suggestions and fix macro checks

096003e

Signed-off-by: Matthew Michel <matthew.michel@intel.com>

		__get_num_work_groups(const sycl::kernel& __kernel, sycl::queue& __q, std::uint32_t __tile_count,
		std::uint32_t __slm_size_bytes) const

		assert(__slm_granularity_it != std::cend(__slm_granularity_table));
		const std::uint32_t __true_slm_size_bytes = *__slm_granularity_it;

		sycl::atomic_ref<_GlobOffsetT, sycl::memory_order::relaxed, sycl::memory_scope::device,
		sycl::access::address_space::local_space>;

Conversation

mmichel11 commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danhoeflinger left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmichel11 Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mmichel11 commented Feb 6, 2026 •

edited

Loading

mmichel11 Mar 4, 2026 •

edited

Loading