Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

floor function triggers a host #702

Closed
Alexander-Barth opened this issue Nov 27, 2024 · 4 comments
Closed

floor function triggers a host #702

Alexander-Barth opened this issue Nov 27, 2024 · 4 comments

Comments

@Alexander-Barth
Copy link

Alexander-Barth commented Nov 27, 2024

I am trying to port a CUDA code to AMDGPU. A lot of things already work, but I have a problem with the floor function, which seems to trigger a host call. I guess the warning mean that floor is not implemented for AMDGPUs and that the CPU version is used instead?
The julia code:

using AMDGPU

a = Float32[1.2,2.3,4.4]
b = zeros(Int16,length(a))

a_d  = roc(a)
b_d  = roc(b)

function foo_d!(a_d,b_d)
    index = (workgroupIdx().x - 1) * workgroupDim().x + workitemIdx().x
    stride = gridGroupDim().x * workgroupDim().x

    @inbounds for i = index:stride:length(a_d)
        b_d[i] = floor(Int16,a_d[i])
    end
end

@roc foo_d!(a_d,b_d)
@show Array(b_d)

The output:

┌ Warning: Global hostcalls detected!
│ - Source: MethodInstance for foo_d!(::AMDGPU.Device.ROCDeviceVector{Float32, 1}, ::AMDGPU.Device.ROCDeviceVector{Float32, 1})
│ - Hostcalls: [:malloc_hostcall]
│ 
│ Use `AMDGPU.synchronize(; stop_hostcalls=false)` to synchronize and stop them.
│ Otherwise, performance might degrade if they keep running in the background.
└ @ AMDGPU.Compiler ~/.julia/packages/AMDGPU/yqCEl/src/compiler/codegen.jl:208
Array(b_d) = Float32[1.0, 2.0, 4.0]
3-element Vector{Float32}:
 1.0
 2.0
 4.0

I use AMDGPU v1.1.2 on julia 1.11.1.
Is there some information about how to implement this function?

It seems that there is a floating point version (single and double precision) for the floor function defined in ROC.
https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/reference/kernel_language.html

And floor(Float32,x) does seem to work.

However, in my case I would need an integer as I will use it as index to an array.

In any case, thanks a lot for this great package!

julia> versioninfo()
Julia Version 1.11.1
Commit 8f5b7ca12ad (2024-10-16 10:53 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD EPYC 7A53 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 128 virtual cores)
Environment:
  LD_LIBRARY_PATH = /opt/cray/pe/papi/7.1.0.1/lib64:/opt/cray/libfabric/1.15.2.0/lib64

julia> AMDGPU.versioninfo()
[ Info: AMDGPU versioninfo
┌───────────┬──────────────────┬───────────┬──────────────────────────────────────────────────────────────────────────────────────────┐
│ Available │ Name             │ Version   │ Path                                                                                     │
├───────────┼──────────────────┼───────────┼──────────────────────────────────────────────────────────────────────────────────────────┤
│     +     │ LLD              │ -         │ /opt/rocm/llvm/bin/ld.lld                                                                │
│     +     │ Device Libraries │ -         │ /users/barthale/.julia/artifacts/5ad5ecb46e3c334821f54c1feecc6c152b7b6a45/amdgcn/bitcode │
│     +     │ HIP              │ 6.0.32831 │ /opt/rocm-6.0.3/lib/libamdhip64.so                                                       │
│     +     │ rocBLAS          │ 4.0.0     │ /opt/rocm-6.0.3/lib/librocblas.so                                                        │
│     +     │ rocSOLVER        │ 3.24.0    │ /opt/rocm-6.0.3/lib/librocsolver.so                                                      │
│     +     │ rocSPARSE        │ -         │ /opt/rocm-6.0.3/lib/librocsparse.so                                                      │
│     +     │ rocRAND          │ 2.10.5    │ /opt/rocm-6.0.3/lib/librocrand.so                                                        │
│     +     │ rocFFT           │ 1.0.27    │ /opt/rocm-6.0.3/lib/librocfft.so                                                         │
│     +     │ MIOpen           │ 3.0.0     │ /opt/rocm-6.0.3/lib/libMIOpen.so                                                         │
└───────────┴──────────────────┴───────────┴──────────────────────────────────────────────────────────────────────────────────────────┘

[ Info: AMDGPU devices
┌────┬─────────────────────┬────────────────────────┬───────────┬────────────┬───────────────┐
│ Id │                Name │               GCN arch │ Wavefront │     Memory │ Shared Memory │
├────┼─────────────────────┼────────────────────────┼───────────┼────────────┼───────────────┤
│  1 │ AMD Instinct MI250X │ gfx90a:sramecc+:xnack- │        64 │ 63.984 GiB │    64.000 KiB │
└────┴─────────────────────┴────────────────────────┴───────────┴────────────┴───────────────┘
@pxl-th
Copy link
Member

pxl-th commented Nov 27, 2024

Hi. That is because regular floor does inexact precision check and if it fails, it throws an error, boxing the original value which launches malloc hostcall.

To use more GPU-friendly function you can use floor without conversion followed by unsafe_trunc:

julia> @code_llvm unsafe_trunc(Int, floor(1f0))
; Function Signature: unsafe_trunc(Type{Int64}, Float32)
;  @ float.jl:416 within `unsafe_trunc`
define i64 @julia_unsafe_trunc_836(float %"x::Float32") #0 {
top:
  %0 = fptosi float %"x::Float32" to i64
  %1 = freeze i64 %0
  ret i64 %1
}

You can also compare it with the original to see how fewer things it does:

julia> @code_llvm floor(Int, 1f0)
; Function Signature: floor(Type{Int64}, Float32)
;  @ rounding.jl:475 within `floor`
define i64 @julia_floor_794(float %"x::Float32") #0 {
top:
  %jlcallframe1 = alloca [3 x ptr], align 8
  %gcframe2 = alloca [4 x ptr], align 16
  call void @llvm.memset.p0.i64(ptr align 16 %gcframe2, i8 0, i64 32, i1 true)
  %thread_ptr = call ptr asm "movq %fs:0, $0", "=r"() #9
  %tls_ppgcstack = getelementptr i8, ptr %thread_ptr, i64 -8
  %tls_pgcstack = load ptr, ptr %tls_ppgcstack, align 8
  store i64 8, ptr %gcframe2, align 16
  %frame.prev = getelementptr inbounds ptr, ptr %gcframe2, i64 1
  %task.gcstack = load ptr, ptr %tls_pgcstack, align 8
  store ptr %task.gcstack, ptr %frame.prev, align 8
  store ptr %gcframe2, ptr %tls_pgcstack, align 8
; ┌ @ rounding.jl:479 within `round` @ float.jl:463
   %0 = call float @llvm.floor.f32(float %"x::Float32")
; │ @ rounding.jl:479 within `round`
; │┌ @ rounding.jl:480 within `_round_convert`
; ││┌ @ number.jl:7 within `convert`
; │││┌ @ float.jl:991 within `Int64`
; ││││┌ @ float.jl:619 within `<=`
       %1 = fcmp ult float %0, 0xC3E0000000000000
; ││││└
      %2 = fcmp uge float %0, 0x43E0000000000000
      %narrow.not = or i1 %1, %2
      %3 = fsub float %0, %0
      %4 = fcmp une float %3, 0.000000e+00
      %or.cond = or i1 %narrow.not, %4
      br i1 %or.cond, label %L17, label %L15

L15:                                              ; preds = %top
; ││││ @ float.jl:992 within `Int64`
; ││││┌ @ float.jl:416 within `unsafe_trunc`
       %5 = fptosi float %0 to i64
       %6 = freeze i64 %5
       %frame.prev9 = load ptr, ptr %frame.prev, align 8
       store ptr %frame.prev9, ptr %tls_pgcstack, align 8
; ││││└
      ret i64 %6

L17:                                              ; preds = %top
; ││││ @ float.jl:994 within `Int64`
      %7 = load ptr, ptr getelementptr (i8, ptr @jl_small_typeof, i64 256), align 8
      %gc_slot_addr_1 = getelementptr inbounds ptr, ptr %gcframe2, i64 3
      store ptr %7, ptr %gc_slot_addr_1, align 8
      %box_Float32 = call ptr @ijl_box_float32(float %0)
      %gc_slot_addr_0 = getelementptr inbounds ptr, ptr %gcframe2, i64 2
      store ptr %box_Float32, ptr %gc_slot_addr_0, align 16
      store ptr @"jl_sym#Int64#807.jit", ptr %jlcallframe1, align 8
      %8 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 1
      store ptr %7, ptr %8, align 8
      %9 = getelementptr inbounds ptr, ptr %jlcallframe1, i64 2
      store ptr %box_Float32, ptr %9, align 8
      %10 = call nonnull ptr @j1_InexactError_805(ptr nonnull @"+Core.InexactError#806.jit", ptr nonnull %jlcallframe1, i32 3)
      call void @ijl_throw(ptr nonnull %10)
      unreachable
; └└└└
}

@pxl-th
Copy link
Member

pxl-th commented Nov 27, 2024

And the reason it works on CUDA, because CUDA has malloc intrinsic for that.

@Alexander-Barth
Copy link
Author

Thanks a lot for this precious information :-)

This solves the issue for me!
(just in case you are curious, I am porting some code to in-paint satellite images of the ocean https://github.com/gher-uliege/DINCAE.jl)

@pxl-th
Copy link
Member

pxl-th commented Nov 28, 2024

Sounds cool! Feel free to open new issues if you bump into them!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants