floor function triggers a host #702

Alexander-Barth · 2024-11-27T15:16:37Z

I am trying to port a CUDA code to AMDGPU. A lot of things already work, but I have a problem with the floor function, which seems to trigger a host call. I guess the warning mean that floor is not implemented for AMDGPUs and that the CPU version is used instead?
The julia code:

using AMDGPU

a = Float32[1.2,2.3,4.4]
b = zeros(Int16,length(a))

a_d  = roc(a)
b_d  = roc(b)

function foo_d!(a_d,b_d)
    index = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
    stride = gridGroupDim().x * workgroupDim().x

    @inbounds for i = index:stride:length(a_d)
        b_d[i] = floor(Int16,a_d[i])
    end
end

@roc foo_d!(a_d,b_d)
@show Array(b_d)

The output:

┌ Warning: Global hostcalls detected!
│ - Source: MethodInstance for foo_d!(::AMDGPU.Device.ROCDeviceVector{Float32, 1}, ::AMDGPU.Device.ROCDeviceVector{Float32, 1})
│ - Hostcalls: [:malloc_hostcall]
│ 
│ Use `AMDGPU.synchronize(; stop_hostcalls=false)` to synchronize and stop them.
│ Otherwise, performance might degrade if they keep running in the background.
└ @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/yqCEl/src/compiler/codegen.jl:208
Array(b_d) = Float32[1.0, 2.0, 4.0]
3-element Vector{Float32}:
 1.0
 2.0
 4.0

I use AMDGPU v1.1.2 on julia 1.11.1.
Is there some information about how to implement this function?

It seems that there is a floating point version (single and double precision) for the floor function defined in ROC.
https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/reference/kernel_language.html

And floor(Float32,x) does seem to work.

However, in my case I would need an integer as I will use it as index to an array.

In any case, thanks a lot for this great package!

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7A53 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 128 virtual cores)
Environment:
  LD_LIBRARY_PATH = /opt/cray/pe/papi/7.1.0.1/lib64:/opt/cray/libfabric/1.15.2.0/lib64

julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                     │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /opt/rocm/llvm/bin/ld.lld                                                                │
│     +     │ Device Libraries │ -         │ /users/barthale/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│     +     │ HIP              │ 6.0.32831 │ /opt/rocm-6.0.3/lib/libamdhip64.so                                                       │
│     +     │ rocBLAS          │ 4.0.0     │ /opt/rocm-6.0.3/lib/librocblas.so                                                        │
│     +     │ rocSOLVER        │ 3.24.0    │ /opt/rocm-6.0.3/lib/librocsolver.so                                                      │
│     +     │ rocSPARSE        │ -         │ /opt/rocm-6.0.3/lib/librocsparse.so                                                      │
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm-6.0.3/lib/librocrand.so                                                        │
│     +     │ rocFFT           │ 1.0.27    │ /opt/rocm-6.0.3/lib/librocfft.so                                                         │
│     +     │ MIOpen           │ 3.0.0     │ /opt/rocm-6.0.3/lib/libMIOpen.so                                                         │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │                Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘

The text was updated successfully, but these errors were encountered:

pxl-th · 2024-11-27T15:30:53Z

Hi. That is because regular floor does inexact precision check and if it fails, it throws an error, boxing the original value which launches malloc hostcall.

To use more GPU-friendly function you can use floor without conversion followed by unsafe_trunc:

julia> @code_llvm unsafe_trunc(Int, floor(1f0))
; Function Signature: unsafe_trunc(Type{Int64}, Float32)
;  @ float.jl:416 within `unsafe_trunc`
define i64 @julia_unsafe_trunc_836(float %"x::Float32") #0 {
top:
  %0 = fptosi float %"x::Float32" to i64
  %1 = freeze i64 %0
  ret i64 %1
}

You can also compare it with the original to see how fewer things it does:

julia> @code_llvm floor(Int, 1f0)
; Function Signature: floor(Type{Int64}, Float32)
;  @ rounding.jl:475 within `floor`
define i64 @julia_floor_794(float %"x::Float32") #0 {
top:
  %jlcallframe1 = alloca [3 x ptr], align 8
  %gcframe2 = alloca [4 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 32, i1 true)
  %thread_ptr = call ptr asm "movq %fs:0, $0", "=r"() #9
  %tls_ppgcstack = getelementptr i8, ptr %thread_ptr, i64 -8
  %tls_pgcstack = load ptr, ptr %tls_ppgcstack, align 8
  store i64 8, ptr %gcframe2, align 16
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
  %task.gcstack = load ptr, ptr %tls_pgcstack, align 8
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe2, ptr %tls_pgcstack, align 8
; ┌ @ rounding.jl:479 within `round` @ float.jl:463
   %0 = call float @llvm.floor.f32(float %"x::Float32")
; │ @ rounding.jl:479 within `round`
; │┌ @ rounding.jl:480 within `_round_convert`
; ││┌ @ number.jl:7 within `convert`
; │││┌ @ float.jl:991 within `Int64`
; ││││┌ @ float.jl:619 within `<=`
       %1 = fcmp ult float %0, 0xC3E0000000000000
; ││││└
      %2 = fcmp uge float %0, 0x43E0000000000000
      %narrow.not = or i1 %1, %2
      %3 = fsub float %0, %0
      %4 = fcmp une float %3, 0.000000e+00
      %or.cond = or i1 %narrow.not, %4
      br i1 %or.cond, label %L17, label %L15

L15:                                              ; preds = %top
; ││││ @ float.jl:992 within `Int64`
; ││││┌ @ float.jl:416 within `unsafe_trunc`
       %5 = fptosi float %0 to i64
       %6 = freeze i64 %5
       %frame.prev9 = load ptr, ptr %frame.prev, align 8
       store ptr %frame.prev9, ptr %tls_pgcstack, align 8
; ││││└
      ret i64 %6

L17:                                              ; preds = %top
; ││││ @ float.jl:994 within `Int64`
      %7 = load ptr, ptr getelementptr (i8, ptr @jl_small_typeof, i64 256), align 8
      %gc_slot_addr_1 = getelementptr inbounds ptr, ptr %gcframe2, i64 3
      store ptr %7, ptr %gc_slot_addr_1, align 8
      %box_Float32 = call ptr @ijl_box_float32(float %0)
      %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
      store ptr %box_Float32, ptr %gc_slot_addr_0, align 16
      store ptr @"jl_sym#Int64#807.jit", ptr %jlcallframe1, align 8
      %8 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
      store ptr %7, ptr %8, align 8
      %9 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 2
      store ptr %box_Float32, ptr %9, align 8
      %10 = call nonnull ptr @j1_InexactError_805(ptr nonnull @"+Core.InexactError#806.jit", ptr nonnull %jlcallframe1, i32 3)
      call void @ijl_throw(ptr nonnull %10)
      unreachable
; └└└└
}

pxl-th · 2024-11-27T15:32:10Z

And the reason it works on CUDA, because CUDA has malloc intrinsic for that.

Alexander-Barth · 2024-11-28T09:18:20Z

Thanks a lot for this precious information :-)

This solves the issue for me!
(just in case you are curious, I am porting some code to in-paint satellite images of the ocean https://github.com/gher-uliege/DINCAE.jl)

pxl-th · 2024-11-28T14:02:01Z

Sounds cool! Feel free to open new issues if you bump into them!

Alexander-Barth closed this as completed Nov 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

floor function triggers a host #702

floor function triggers a host #702

Alexander-Barth commented Nov 27, 2024 •

edited

Loading

pxl-th commented Nov 27, 2024

pxl-th commented Nov 27, 2024

Alexander-Barth commented Nov 28, 2024

pxl-th commented Nov 28, 2024 •

edited

Loading

floor function triggers a host #702

floor function triggers a host #702

Comments

Alexander-Barth commented Nov 27, 2024 • edited Loading

pxl-th commented Nov 27, 2024

pxl-th commented Nov 27, 2024

Alexander-Barth commented Nov 28, 2024

pxl-th commented Nov 28, 2024 • edited Loading

Alexander-Barth commented Nov 27, 2024 •

edited

Loading

pxl-th commented Nov 28, 2024 •

edited

Loading