Skip to content

Commit c9c04f0

Browse files
committed
Add intrinsic for launch-sized workgroup memory on GPUs
Workgroup memory is a memory region that is shared between all threads in a workgroup on GPUs. Workgroup memory can be allocated statically or after compilation, when launching a gpu-kernel. The intrinsic added here returns the pointer to the memory that is allocated at launch-time. # Interface With this change, workgroup memory can be accessed in Rust by calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T` intrinsic. It returns the pointer to workgroup memory guaranteeing that it is aligned to at least the alignment of `T`. The pointer is dereferencable for the size specified when launching the current gpu-kernel (which may be the size of `T` but can also be larger or smaller or zero). All calls to this intrinsic return a pointer to the same address. See the intrinsic documentation for more details. ## Alternative Interfaces It was also considered to expose dynamic workgroup memory as extern static variables in Rust, like they are represented in LLVM IR. However, due to the pointer not being guaranteed to be dereferencable (that depends on the allocated size at runtime), such a global must be zero-sized, which makes global variables a bad fit. # Implementation Details Workgroup memory in amdgpu and nvptx lives in address space 3. Workgroup memory from a launch is implemented by creating an external global variable in address space 3. The global is declared with size 0, as the actual size is only known at runtime. It is defined behavior in LLVM to access an external global outside the defined size. There is no similar way to get the allocated size of launch-sized workgroup memory on amdgpu an nvptx, so users have to pass this out-of-band or rely on target specific ways for now.
1 parent 83e49b7 commit c9c04f0

File tree

11 files changed

+169
-7
lines changed

11 files changed

+169
-7
lines changed

compiler/rustc_abi/src/lib.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1719,6 +1719,9 @@ pub struct AddressSpace(pub u32);
17191719
impl AddressSpace {
17201720
/// LLVM's `0` address space.
17211721
pub const ZERO: Self = AddressSpace(0);
1722+
/// The address space for workgroup memory on nvptx and amdgpu.
1723+
/// See e.g. the `gpu_launch_sized_workgroup_mem` intrinsic for details.
1724+
pub const GPU_WORKGROUP: Self = AddressSpace(3);
17221725
}
17231726

17241727
/// The way we represent values to the backend

compiler/rustc_codegen_llvm/src/declare.rs

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
use std::borrow::Borrow;
1515

1616
use itertools::Itertools;
17+
use rustc_abi::AddressSpace;
1718
use rustc_codegen_ssa::traits::TypeMembershipCodegenMethods;
1819
use rustc_data_structures::fx::FxIndexSet;
1920
use rustc_middle::ty::{Instance, Ty};
@@ -97,6 +98,28 @@ impl<'ll, CX: Borrow<SCx<'ll>>> GenericCx<'ll, CX> {
9798
)
9899
}
99100
}
101+
102+
/// Declare a global value in a specific address space.
103+
///
104+
/// If there’s a value with the same name already declared, the function will
105+
/// return its Value instead.
106+
pub(crate) fn declare_global_in_addrspace(
107+
&self,
108+
name: &str,
109+
ty: &'ll Type,
110+
addr_space: AddressSpace,
111+
) -> &'ll Value {
112+
debug!("declare_global(name={name:?}, addrspace={addr_space:?})");
113+
unsafe {
114+
llvm::LLVMRustGetOrInsertGlobalInAddrspace(
115+
(**self).borrow().llmod,
116+
name.as_c_char_ptr(),
117+
name.len(),
118+
ty,
119+
addr_space.0,
120+
)
121+
}
122+
}
100123
}
101124

102125
impl<'ll, 'tcx> CodegenCx<'ll, 'tcx> {

compiler/rustc_codegen_llvm/src/intrinsic.rs

Lines changed: 42 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
use std::assert_matches::assert_matches;
22
use std::cmp::Ordering;
33

4-
use rustc_abi::{Align, BackendRepr, ExternAbi, Float, HasDataLayout, Primitive, Size};
4+
use rustc_abi::{
5+
AddressSpace, Align, BackendRepr, ExternAbi, Float, HasDataLayout, Primitive, Size,
6+
};
57
use rustc_codegen_ssa::base::{compare_simd_types, wants_msvc_seh, wants_wasm_eh};
68
use rustc_codegen_ssa::codegen_attrs::autodiff_attrs;
79
use rustc_codegen_ssa::common::{IntPredicate, TypeKind};
@@ -20,7 +22,7 @@ use rustc_session::config::CrateType;
2022
use rustc_span::{Span, Symbol, sym};
2123
use rustc_symbol_mangling::{mangle_internal_symbol, symbol_name_for_instance_in_crate};
2224
use rustc_target::callconv::PassMode;
23-
use rustc_target::spec::Os;
25+
use rustc_target::spec::{Arch, Os};
2426
use tracing::debug;
2527

2628
use crate::abi::FnAbiLlvmExt;
@@ -553,6 +555,44 @@ impl<'ll, 'tcx> IntrinsicCallBuilderMethods<'tcx> for Builder<'_, 'll, 'tcx> {
553555
return Ok(());
554556
}
555557

558+
sym::gpu_launch_sized_workgroup_mem => {
559+
// The name of the global variable is not relevant, the important properties are.
560+
// 1. The global is in the address space for workgroup memory
561+
// 2. It is an extern global
562+
// All instances of extern addrspace(gpu_workgroup) globals are merged in the LLVM backend.
563+
// Generate an unnamed global per intrinsic call, so that different kernels can have
564+
// different minimum alignments.
565+
// See https://docs.nvidia.com/cuda/cuda-c-programming-guide/#shared
566+
// FIXME Workaround an nvptx backend issue that extern globals must have a name
567+
let name = if tcx.sess.target.arch == Arch::Nvptx64 {
568+
"gpu_launch_sized_workgroup_mem"
569+
} else {
570+
""
571+
};
572+
let global = self.declare_global_in_addrspace(
573+
name,
574+
self.type_array(self.type_i8(), 0),
575+
AddressSpace::GPU_WORKGROUP,
576+
);
577+
let ty::RawPtr(inner_ty, _) = result.layout.ty.kind() else { unreachable!() };
578+
// The alignment of the global is used to specify the *minimum* alignment that
579+
// must be obeyed by the GPU runtime.
580+
// When multiple of these global variables are used by a kernel, the maximum alignment is taken.
581+
// See https://github.com/llvm/llvm-project/blob/a271d07488a85ce677674bbe8101b10efff58c95/llvm/lib/Target/AMDGPU/AMDGPULowerModuleLDSPass.cpp#L821
582+
let alignment = self.align_of(*inner_ty).bytes() as u32;
583+
unsafe {
584+
// FIXME Workaround the above issue by taking maximum alignment if the global existed
585+
if tcx.sess.target.arch == Arch::Nvptx64 {
586+
if alignment > llvm::LLVMGetAlignment(global) {
587+
llvm::LLVMSetAlignment(global, alignment);
588+
}
589+
} else {
590+
llvm::LLVMSetAlignment(global, alignment);
591+
}
592+
}
593+
self.cx().const_pointercast(global, self.type_ptr())
594+
}
595+
556596
_ if name.as_str().starts_with("simd_") => {
557597
// Unpack non-power-of-2 #[repr(packed, simd)] arguments.
558598
// This gives them the expected layout of a regular #[repr(simd)] vector.

compiler/rustc_codegen_llvm/src/llvm/ffi.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2017,6 +2017,13 @@ unsafe extern "C" {
20172017
NameLen: size_t,
20182018
T: &'a Type,
20192019
) -> &'a Value;
2020+
pub(crate) fn LLVMRustGetOrInsertGlobalInAddrspace<'a>(
2021+
M: &'a Module,
2022+
Name: *const c_char,
2023+
NameLen: size_t,
2024+
T: &'a Type,
2025+
AddressSpace: c_uint,
2026+
) -> &'a Value;
20202027
pub(crate) fn LLVMRustGetNamedValue(
20212028
M: &Module,
20222029
Name: *const c_char,

compiler/rustc_codegen_ssa/src/mir/intrinsic.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -111,6 +111,7 @@ impl<'a, 'tcx, Bx: BuilderMethods<'a, 'tcx>> FunctionCx<'a, 'tcx, Bx> {
111111
sym::abort
112112
| sym::unreachable
113113
| sym::cold_path
114+
| sym::gpu_launch_sized_workgroup_mem
114115
| sym::breakpoint
115116
| sym::assert_zero_valid
116117
| sym::assert_mem_uninitialized_valid

compiler/rustc_hir_analysis/src/check/intrinsic.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -132,6 +132,7 @@ fn intrinsic_operation_unsafety(tcx: TyCtxt<'_>, intrinsic_id: LocalDefId) -> hi
132132
| sym::forget
133133
| sym::frem_algebraic
134134
| sym::fsub_algebraic
135+
| sym::gpu_launch_sized_workgroup_mem
135136
| sym::is_val_statically_known
136137
| sym::log2f16
137138
| sym::log2f32
@@ -293,6 +294,7 @@ pub(crate) fn check_intrinsic_type(
293294
sym::offset_of => (1, 0, vec![tcx.types.u32, tcx.types.u32], tcx.types.usize),
294295
sym::rustc_peek => (1, 0, vec![param(0)], param(0)),
295296
sym::caller_location => (0, 0, vec![], tcx.caller_location_ty()),
297+
sym::gpu_launch_sized_workgroup_mem => (1, 0, vec![], Ty::new_mut_ptr(tcx, param(0))),
296298
sym::assert_inhabited | sym::assert_zero_valid | sym::assert_mem_uninitialized_valid => {
297299
(1, 0, vec![], tcx.types.unit)
298300
}

compiler/rustc_llvm/llvm-wrapper/RustWrapper.cpp

Lines changed: 16 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -261,10 +261,10 @@ extern "C" LLVMValueRef LLVMRustGetOrInsertFunction(LLVMModuleRef M,
261261
.getCallee());
262262
}
263263

264-
extern "C" LLVMValueRef LLVMRustGetOrInsertGlobal(LLVMModuleRef M,
265-
const char *Name,
266-
size_t NameLen,
267-
LLVMTypeRef Ty) {
264+
extern "C" LLVMValueRef
265+
LLVMRustGetOrInsertGlobalInAddrspace(LLVMModuleRef M, const char *Name,
266+
size_t NameLen, LLVMTypeRef Ty,
267+
unsigned AddressSpace) {
268268
Module *Mod = unwrap(M);
269269
auto NameRef = StringRef(Name, NameLen);
270270

@@ -275,10 +275,21 @@ extern "C" LLVMValueRef LLVMRustGetOrInsertGlobal(LLVMModuleRef M,
275275
GlobalVariable *GV = Mod->getGlobalVariable(NameRef, true);
276276
if (!GV)
277277
GV = new GlobalVariable(*Mod, unwrap(Ty), false,
278-
GlobalValue::ExternalLinkage, nullptr, NameRef);
278+
GlobalValue::ExternalLinkage, nullptr, NameRef,
279+
nullptr, GlobalValue::NotThreadLocal, AddressSpace);
279280
return wrap(GV);
280281
}
281282

283+
extern "C" LLVMValueRef LLVMRustGetOrInsertGlobal(LLVMModuleRef M,
284+
const char *Name,
285+
size_t NameLen,
286+
LLVMTypeRef Ty) {
287+
Module *Mod = unwrap(M);
288+
unsigned AddressSpace = Mod->getDataLayout().getDefaultGlobalsAddressSpace();
289+
return LLVMRustGetOrInsertGlobalInAddrspace(M, Name, NameLen, Ty,
290+
AddressSpace);
291+
}
292+
282293
// Must match the layout of `rustc_codegen_llvm::llvm::ffi::AttributeKind`.
283294
enum class LLVMRustAttributeKind {
284295
AlwaysInline = 0,

compiler/rustc_span/src/symbol.rs

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1152,6 +1152,7 @@ symbols! {
11521152
global_asm,
11531153
global_registration,
11541154
globs,
1155+
gpu_launch_sized_workgroup_mem,
11551156
gt,
11561157
guard_patterns,
11571158
half_open_range_patterns,

library/core/src/intrinsics/mod.rs

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3436,6 +3436,45 @@ pub(crate) const fn miri_promise_symbolic_alignment(ptr: *const (), align: usize
34363436
)
34373437
}
34383438

3439+
/// Returns the pointer to workgroup memory allocated at launch-time on GPUs.
3440+
///
3441+
/// Workgroup memory is a memory region that is shared between all threads in
3442+
/// the same workgroup. It is faster to access than other memory but pointers do not
3443+
/// work outside the workgroup where they were obtained.
3444+
/// Workgroup memory can be allocated statically or after compilation, when
3445+
/// launching a gpu-kernel. `gpu_launch_sized_workgroup_mem` returns the pointer to
3446+
/// the memory that is allocated at launch-time.
3447+
/// The size of this memory can differ between launches of a gpu-kernel, depending on
3448+
/// what is specified at launch-time.
3449+
/// However, the alignment is fixed by the kernel itself, at compile-time.
3450+
///
3451+
/// The returned pointer is the start of the workgroup memory region that is
3452+
/// allocated at launch-time.
3453+
/// All calls to `gpu_launch_sized_workgroup_mem` in a workgroup, independent of the
3454+
/// generic type, return the same address, so alias the same memory.
3455+
/// The returned pointer is aligned by at least the alignment of `T`.
3456+
///
3457+
/// # Safety
3458+
///
3459+
/// The pointer is safe to dereference from the start (the returned pointer) up to the
3460+
/// size of workgroup memory that was specified when launching the current gpu-kernel.
3461+
///
3462+
/// The user must take care of synchronizing access to workgroup memory between
3463+
/// threads in a workgroup. The usual data race requirements apply.
3464+
///
3465+
/// # Other APIs
3466+
///
3467+
/// CUDA and HIP call this dynamic shared memory, shared between threads in a block.
3468+
/// OpenCL and SYCL call this local memory, shared between threads in a work-group.
3469+
/// GLSL calls this shared memory, shared between invocations in a work group.
3470+
/// DirectX calls this groupshared memory, shared between threads in a thread-group.
3471+
#[must_use = "returns a pointer that does nothing unless used"]
3472+
#[rustc_intrinsic]
3473+
#[rustc_nounwind]
3474+
#[unstable(feature = "gpu_launch_sized_workgroup_mem", issue = "135513")]
3475+
#[cfg(any(target_arch = "amdgpu", target_arch = "nvptx64"))]
3476+
pub fn gpu_launch_sized_workgroup_mem<T>() -> *mut T;
3477+
34393478
/// Copies the current location of arglist `src` to the arglist `dst`.
34403479
///
34413480
/// # Safety

src/tools/tidy/src/style.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -222,6 +222,10 @@ fn should_ignore(line: &str) -> bool {
222222
|| static_regex!(
223223
"\\s*//@ \\!?(count|files|has|has-dir|hasraw|matches|matchesraw|snapshot)\\s.*"
224224
).is_match(line)
225+
// Matching for FileCheck checks
226+
|| static_regex!(
227+
"\\s*// [a-zA-Z0-9-_]*:\\s.*"
228+
).is_match(line)
225229
}
226230

227231
/// Returns `true` if `line` is allowed to be longer than the normal limit.

0 commit comments

Comments
 (0)