Skip to content

Fix saxpy test for gfx1151#55

Open
statusfailed wants to merge 6 commits intoZaneham:masterfrom
statusfailed:feat/gfx1151-loading
Open

Fix saxpy test for gfx1151#55
statusfailed wants to merge 6 commits intoZaneham:masterfrom
statusfailed:feat/gfx1151-loading

Conversation

@statusfailed
Copy link
Copy Markdown

@statusfailed statusfailed commented Mar 5, 2026

This patch was written by codex, so I'm not necessarily expecting a merge. But hopefully it helps!
Here's a summary (written by me) of the issues and fixes.

(XREF: Issue #39)

Problem 1: hsaco loading fails

First problem: unable to load the hsaco file:

BarraCUDA SAXPY launcher
  .hsaco:  test.hsaco
  N:       1024
  a:       2.5
bc_runtime: GPU ready, queue size 131072
bc_runtime: executable_load failed (0x1010)
Kernel load failed (-3)

Fix is in eb04b1b7ecebe5e53794090bf3ad9e92d8eb1ab3, adds a couple of ELF segments and changes ELF header to include:

    ehdr.e_ident[8] = 1;    /* ELFABIVERSION_AMDGPU_HSA_V3 */
    ehdr.e_type = 3;        /* ET_DYN (AMDGPU code object) */

Warning: it looks like the e_ident change might break other platforms. I'm not sure if this can be merged as-is.

Problem 2: memory aperture violation

With previous fix applied, we unlock a new error:

BarraCUDA SAXPY launcher
  .hsaco:  test.hsaco
  N:       1024
  a:       2.5
bc_runtime: GPU ready, queue size 131072
bc_runtime: loaded 'saxpy' (kernarg=32, lds=0, scratch=0)
  dispatch: 4 blocks x 256 threads
Queue at 0x7ffff7b96000 inactivated due to async error:
        HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address.
./run.sh: line 15: 678661 Aborted                    (core dumped) nix develop .#default --command ./launch_saxpy test.hsaco

The fix for this is in 2119a9bc9d2981b12a1b2e76ac4af8783a7f06b0.
tl;dr: sets some additional properties in the amd_kernel_descriptor_t.

Problem 3: test fails

Last problem, saxpy test fails with mismatches:

BarraCUDA SAXPY launcher
  .hsaco:  test.hsaco
  N:       1024
  a:       2.5
bc_runtime: GPU ready, queue size 131072
bc_runtime: loaded 'saxpy' (kernarg=32, lds=0, scratch=0)
  dispatch: 4 blocks x 256 threads
  MISMATCH [1]: got 1023.0000, expected 1025.5000
  MISMATCH [2]: got 1022.0000, expected 1027.0000
  MISMATCH [3]: got 1021.0000, expected 1028.5000
  MISMATCH [4]: got 1020.0000, expected 1030.0000
  MISMATCH [5]: got 1019.0000, expected 1031.5000
  FAIL: 1023/1024 mismatches

This is fixed by 32857b901393809ffa15fcbc136ba69a9b83b81d.
tl;dr: adds some padding to the args struct in launch_saxpy.c to (apparently) obey AMDHSA ABI.

Result

With all these patches applied, I now see a working result:

BarraCUDA SAXPY launcher
  .hsaco:  test.hsaco
  N:       1024
  a:       2.5
bc_runtime: GPU ready, queue size 131072
bc_runtime: loaded 'saxpy' (kernarg=32, lds=0, scratch=0)
  dispatch: 4 blocks x 256 threads
  PASS: all 1024 elements correct

@Zaneham
Copy link
Copy Markdown
Owner

Zaneham commented Mar 5, 2026

Heyya @statusfailed thanks for the PR! I don't mind if you used CODEX to help diagnose and fix the issue. I'll be happy to review this shortly!

@Zaneham
Copy link
Copy Markdown
Owner

Zaneham commented Mar 5, 2026

Hello @statusfailed if you have some time is it possible if you rebase against the master branch?

- Added uint32_t _pad0 after float a.
- Added uint32_t _pad1 after uint32_t n.
- Initialized the new padding fields to 0:
- Kept existing logical arguments unchanged (y, x, a, n), but fixed their byte layout so:
    - n lands at byte offset 24.
    - total copied kernarg size is 32 bytes (matching runtime-reported kernarg=32).
- Set ENABLE_SGPR_DISPATCH_PTR (bit 1).
- Set ENABLE_SGPR_KERNARG_SEGMENT_PTR (bit 3).
- Set IS_PTR64 (bit 19).
- Set ENABLE_WAVEFRONT_SIZE32 (bit 10) for non-CDNA targets.
- Switched ELF type to `ET_DYN` for code objects.
- Added program headers (`PT_LOAD` for .text, `PT_NOTE` for metadata note).
- Set valid program-header table fields `(e_phoff, e_phentsize, e_phnum)`.
- Set AMDGPU HSA ABI version byte (`e_ident[EI_ABIVERSION] = 1`, HSA V3-compatible).
@statusfailed
Copy link
Copy Markdown
Author

Something in master has broken the patch, I'm not sure what yet. I'm getting errors like this:

barracuda: device function calls not yet supported (BIR_CALL func=1)

Not exactly sure if they're related

- Add hidden dispatch kernarg fields (block_count_*, group_size_*) in launch_saxpy and populate them from launch dims.
- Make isel SGPR layout conditional: use dispatch + kernarg SGPR pairs when needs_dispatch=1, otherwise keep kernarg-only layout.
@statusfailed statusfailed force-pushed the feat/gfx1151-loading branch from 32857b9 to a1ce1bb Compare March 6, 2026 04:39
@statusfailed
Copy link
Copy Markdown
Author

OK reworked it a bit and it works again. I get some warnings though, and had to modify launch_saxpy.c to include block count / group size hidden dispatch args.

+ make
make: Nothing to be done for 'all'.
+ gcc -std=c99 -O2 -I src/runtime examples/launch_saxpy.c src/runtime/bc_runtime.c -ldl -lm -o launch_saxpy
+ ./barracuda --amdgpu-bin --gfx1151 -o test.hsaco tests/canonical.cu
barracuda: device function calls not yet supported (BIR_CALL func=1)
barracuda: device function calls not yet supported (BIR_CALL func=1)
barracuda: device function calls not yet supported (BIR_CALL func=2)
barracuda: device function calls not yet supported (BIR_CALL func=2)
wrote test.hsaco (6796 bytes code, 8 kernels)
+ nix develop path:.#default --command ./launch_saxpy test.hsaco
BarraCUDA SAXPY launcher
  .hsaco:  test.hsaco
  N:       1024
  a:       2.5
bc_runtime: GPU ready, queue size 131072
bc_runtime: loaded 'saxpy' (kernarg=56, lds=0, scratch=0)
  dispatch: 4 blocks x 256 threads
  PASS: all 1024 elements correct

@Zaneham
Copy link
Copy Markdown
Owner

Zaneham commented Mar 7, 2026

Hey @statusfailed, cheers for getting saxpy going on GFX1151 and for sticking with the rebase, that's never fun. Had a proper look through and there are a few things I want to flag before we merge anything.

The TGID bit positions are the scary one. The PR puts CDNA on bits 11/12/13 and RDNA on 7/8/9, but the LLVM spec (AMDHSAKernelDescriptor.h) says WORKGROUP_ID_X/Y/Z are at bits 7/8/9 for everything from GFX6 to GFX120, no CDNA special case. Bits 11:12 are actually VGPR_WORKITEM_ID, so we'd be stomping on that instead of enabling workgroup IDs. We have 8/8 tests passing on MI300X right now with 7/8/9 and I'd rather not find out what happens when we stop telling the hardware where the workgroups are.

The kernel_code_properties widening to uint32_t is also dodgy. AMD defines it as uint16_t, and the IS_PTR64 bit at position 19 doesn't exist in the AMDHSA kernel descriptor, it's from the older amd_kernel_code_t format that predates Code Object V3. We'd be scribbling into reserved bytes which is the kind of thing that works until it very suddenly doesn't. The WAVEFRONT_SIZE32 at bit 10 is legit though and we should have that for RDNA.

On the ABI version going from 4 down to 1, that drops us from Code Object V6 to V3, but the msgpack metadata still says amdhsa.version [1, 2] which is V5. Admittedly master already has a mismatch here (V6 ABI with V5 metadata, my fault, not yours) and we get away with it because the ROCm runtime is pretty relaxed about the whole thing. But V3 with V5 hidden args is pushing our luck.

My gut feeling is the actual GFX1151 problem was probably the missing WAVEFRONT_SIZE32 bit combined with the ABI version mismatch on master confusing the runtime's hidden arg population. Might be worth trying just the wavefront fix with a cleaned-up ABI version before we rearchitect the SGPR layout around dispatch_ptr.

The flake.nix is great, happy to take that in a contributors repo if you want to split it out.

@Zaneham
Copy link
Copy Markdown
Owner

Zaneham commented Mar 8, 2026

Hey @statusfailed I've added better error messaging and handling so now whenever you experience a bug or an issue it'll produce a dump of the state of the compiler as it happened. Should help with the dev experience / if you recreate the bug a dump should be produced. If you need help managing the conflicts in the emitter please feel free to sing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants