Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segfault in ff_atrac3p_ipqf causes infinite loop due to at3_standalone malloc wrapper not aligning + auto-vectorization believing the annotations #20155

Closed
4 of 5 tasks
aktau opened this issue Mar 24, 2025 · 19 comments · Fixed by #20162
Milestone

Comments

@aktau
Copy link

aktau commented Mar 24, 2025

Game or games this happens in

UCES-00006 - Medievil

What area of the game / PPSSPP

While starting up, after a few screens (after choosing language), it usually locks up. The situation is basically identical to #19994 (comment).

I cleared my save states to attempt to make sure I didn't end up in a bad code path. I compiled this commit:

$ gdb -p $(pidof retroarch)
0x00007fe5cdc96092 in ff_atrac3p_ipqf (dct_ctx=dct_ctx@entry=0x55858950a9b0, hist=hist@entry=0x558589442270, in=in@entry=0x5585895028b0, out=out@entry=0x5585895068b0) at /home/aktau/github/libretro-super/libretro-ppsspp/ext/at3_standalone/atrac3plusdsp.cpp:649
649                 hist->buf2[hist_pos][i] = idct_out[7 - i];

Just as in the comment I referenced, it loops forever due to hitting a segfault and hitting the segv_handler which doesn't know what to do with this.

Sadly, what I said in #19994 (comment) isn't right. It was fixed (or at least I couldn't reproduce it twice) for Wipeout Pure, but I can reproduce it in Medievil.

What should happen

Softlocking shouldn't happen.

Logs

I can upload whatever is desired. If there are special instructions for Linux/Retroarch. I can also look around in the debugger if that's helpful.

Platform

Linux / BSD

Mobile device model or graphics card (GPU)

Intel HD 630

PPSSPP version affected

ca1819f

Last working version

No response

Graphics backend (3D API)

Vulkan

Checklist

  • Test in the latest git build in case it's already fixed.
  • Search for other reports of the same issue.
  • Try resetting settings or older versions and include if the issue is related.
  • Try without any cheats and without loading any save states.
  • Include logs or screenshots of issue.
@aktau aktau changed the title segfault in ff_atrac3p_ipqf causes infinite oloop segfault in ff_atrac3p_ipqf causes infinite loop Mar 24, 2025
@hrydgard hrydgard added this to the v1.19.0 milestone Mar 24, 2025
@hrydgard
Copy link
Owner

hrydgard commented Mar 24, 2025

Really confused by what you're seeing and also what you logged in the other thread. Unexplained crashes in simd code are usually due to misalignment, and if it's some crazy simd code generated by -O3, even more likely. Does it happen when you don't compile with -O3?

vmovaps means vector move aligned packed single, so it has an alignment requirement, which might get violated somehow, in case the ffmpeg code this comes from is playing a little fast-and-loose with pointer alignment, confusing the optimizer.

@aktau
Copy link
Author

aktau commented Mar 25, 2025

I think you're right that it's due to the unaligned access. To be honest I have no idea why gcc thinks it can generate a movaps when writing to hist->buf2[hist_pos] at

const int hist_pos = hist->pos;
for (i = 0; i < 8; i++) {
hist->buf1[hist_pos][i] = idct_out[i + 8];
}
for (i = 0; i < 8; i++) {
hist->buf2[hist_pos][i] = idct_out[7 - i];
}

On one hand the code specifically declares hist->buf2 to be aligned to 32 bytes:

typedef struct Atrac3pIPQFChannelCtx {
DECLARE_ALIGNED(32, float, buf1)[ATRAC3P_PQF_FIR_LEN * 2][8];
DECLARE_ALIGNED(32, float, buf2)[ATRAC3P_PQF_FIR_LEN * 2][8];
int pos;
} Atrac3pIPQFChannelCtx;

On the other hand, hist_post itself doesn't look like it predictably is 32 byte aligned:

hist->pos = mod23_lut[hist->pos]; // hist->pos = (hist->pos - 1) % 23;

Side note: GCC actually uses unaligned loads/stores for most of the rest of the function.

On the third hand (as far as third hands are a thing). The actual hist_pos that is crashing is 0:

(gdb) p hist_pos
$29 = 0

So it is just accessing hist.buf2, which is supposed to be aligned according to the annotation found above. Address:

(gdb) p &hist->buf2
$27 = (float (*)[24][8]) 0x561d50ae39f0

However, this address is only 16-byte aligned, not 32-byte aligned:

0x561d50ae39f0 is aligned with 16  (0x561d50ae39f0 % 16 == 0)
0x561d50ae39f0 is NOT aligned with 32 (0x561d50ae39f0 % 32 == 0x10)

Is the DECLARE_ALIGNED(32, ...) not working?


There's further oddness. I'm trying to setup some automation so I can run it with different flags and see whether it crashes. I'm also trying to extract the assembly of this specific function:

$ objdump --disassemble=ff_atrac3p_ipqf ~/.config/retroarch/cores/ppsspp_libretro.so

Result: https://gist.github.com/aktau/1aed0940d264600f9e7fb01b03dc5606

I notice this is not what the debugger says I'm running. So I resort to manually passing rebased addresses from the debugger:

$ objdump -d --start-address=0x892f60 --stop-address=0x893177 ~/.config/retroarch/cores/ppsspp_libretro.so

Result: https://gist.github.com/aktau/30ddcdfc8fbacf1c17e787a761ba36f5

This is the one I was running. The "standalone" version doesn't look vectorized, but the latter/real one does. The C++ name mangling also seems odd to me. I thought this maybe had to do with being inlined into a C++ function, but inlining doesn't make sense if we have a prologue/epilogue.

EDIT: I think the first function I found is the ffmpeg copy. It's curious that it's not optimized at all. Perhaps that's the reaosn why it never crashed before. Are special optimization flags used for the ffmpeg version?

@hrydgard
Copy link
Owner

We build ffmpeg separately, with different flags yes.

Long term goal is to get rid of the ffmpeg dependency entirely, but that's a ways away.

What if you just remove that DECLARE_ALIGNED where it appears on structure members (while leaving it on globals)? I don't think it can work unless it somehow cooperates with the allocator, in case the struct gets heap allocated... Maybe the compiler can arrange for it to be true on the stack.

@aktau
Copy link
Author

aktau commented Mar 25, 2025

OK, I think this is just a bug bourne from cutting away too much code. Walking up the stack, checking alignment of super-structures:

(gdb) frame 1
#1  0x00007f25b889cc9d in reconstruct_frame (ctx=0x561d507841f0, ch_unit=0x561d50adf310, num_channels=2)
    at /home/aktau/github/libretro-super/libretro-ppsspp/ext/at3_standalone/atrac3plusdec.cpp:291
(gdb) p ch_unit
$30 = (Atrac3pChanUnitCtx *) 0x561d50adf310
# 0x561d50adf310 is NOT aligned with 32 (0x561d50adf310 % 32 == 0x10)
(gdb) up
#2  atrac3p_decode_frame (ctx=0x561d507841f0, out_data=0x7f256c002598, nb_samples=0x7ffcfec204bc, indata=<optimized out>, indata_size=376)
    at /home/aktau/github/libretro-super/libretro-ppsspp/ext/at3_standalone/atrac3plusdec.cpp:352
(gdb) p ctx->ch_units
$32 = (Atrac3pChanUnitCtx *) 0x561d50adf310
# 0x561d50adf310 is NOT aligned with 32 (0x561d50adf310 % 32 == 0x10)

ctx->ctx_units is allocated here

ctx->ch_units = (Atrac3pChanUnitCtx *)av_mallocz_array(ctx->num_channel_blocks, sizeof(*ctx->ch_units));

This is similar to the ffmpeg version. If we go down the stack however, the actual allocating function is different. The version in ext/at3_standalone/mem.cpp is just a small malloc wrapper.

void *av_malloc(size_t size)
{
void *ptr = NULL;
ptr = malloc(size);
if(!ptr && !size) {
size = 1;
ptr= av_malloc(1);
}
return ptr;
}

While the ffmpeg version is much longer and has special code for guaranteeing alignment.

@hrydgard
Copy link
Owner

hrydgard commented Mar 25, 2025

Whoops, that would do it, yes. Although I don't think that alignment here is much of a win - so can you confirm that removing it from struct members fixes the crash? In that case I'll just go ahead and do the same on my end.

@aktau aktau changed the title segfault in ff_atrac3p_ipqf causes infinite loop segfault in ff_atrac3p_ipqf causes infinite loop due to at3_standalone malloc wrapper not aligning + auto-vectorization believing the annotations Mar 25, 2025
@aktau
Copy link
Author

aktau commented Mar 25, 2025

Whoops, that would do it, yes. Although I don't think that alignment here is much of a win - so can you confirm that removing it from struct members fixes the crash? In that case I'll just go ahead and do the same on my end.

I don't know what fraction of cycles this decoding normally is, but the vectorized code looks a lot tighter than the non-vectorized version I posted earlier. Why not just fix the malloc? Can the project use C11? (meaning: aligned_alloc). For C++ it's apparently C++17.

Also I'm not sure if just removing alignment from the struct members would be enough. There is intrinsics code in certain places as well that may well rely on some alignment.

@hrydgard
Copy link
Owner

It's probably still gonna vectorize fine even without the alignment. And atrac decoding is usually not consuming more than 1% cpu anyway, on pc much less..

@anr2me
Copy link
Collaborator

anr2me commented Mar 26, 2025

According to this, you need to recompile the library with -mstackrealign

@aktau
Copy link
Author

aktau commented Mar 26, 2025

According to this, you need to recompile the library with -mstackrealign

If this structure (ch_units) were on the stack, there would be no problems as the struct fields are properly annotated, as I noted in my original post:

DECLARE_ALIGNED(32, float, buf1)[ATRAC3P_PQF_FIR_LEN * 2][8];
DECLARE_ALIGNED(32, float, buf2)[ATRAC3P_PQF_FIR_LEN * 2][8];

The problem is that it is allocated on the heap (with av_mallocz_array), where top-level alignment is guaranteed by only by the malloc implementation.

It may be possible to consider just stack-allocating this, but AFAIK the struct is fairly large, and I don't know whether PPSSPP can influence all its platforms to give sufficient stack.

I still think making av_malloc aligned is by far the easiest and most local solution. For one, mem.cpp defines a constant implying it is POSIX 2004 compatible: https://github.com/hrydgard/ppsspp/blob/16f9851bbcb2434f048bb752c52b5183d419938a/ext/at3_standalone/mem.cpp#L27C9-L27C22. This means posix_memalign() can be used, in principle:

void *xmalloc(size_t size) {
  const size_t align = 32; // Sufficient for AVX operations.
  void *p = NULL;
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
  if (posix_memalign(&p, align, size)) {
    p = NULL;
  }
#else
#error "no aligned alloc, please write an open-coded version"
#endif
  return p;
}

void xfree(void *p) {
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
  free(p);
#else
#error "no aligned alloc, please write an open-coded version"
#endif
}

If the code is meant to comply with either the C11 or the C++17 standards, _aligned_alloc (note: no m) is available and recommended.

At any rate, here's a decision (Christmas) tree of possible library functions to call out to, and even a fallback open-coded implementation that could just be used without any ifdef checks. The latter should work because av_malloc is always paired with av_free.

NOTE: I have not tested the correctness of the open-coded version. It's just a quick writeup of what I remember from when I wrote this once way back when. Surely there are many similar ones on the internet.

#include <assert.h>

void *xmalloc(size_t size) {
  const size_t align = 32; // Sufficent for AVX operations.
  static_assert(align & (align -1) == 0, "align must be a power of two");
  void *p = NULL;
#ifdef _MSC_VER
  p = _aligned_malloc(size, align); // MSVC builtin
#elif _ISOC11_SOURCE
  p = aligned_alloc(size, align); // In C11 and C++17, align must be PoT.
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
  if (posix_memalign(&p, align, size)) {
    p = NULL;
  }
#elif PPSSPP_ARCH(SSE2)
  p = _mm_malloc(size, align); // From xmmintrin.h
#else
  size_t offset = (align - 1) + sizeof(&p);
  p = malloc(size + offset);
  if (p != NULL) {
    ((void **)p)[-1] = p; // Store the original pointer as "metadata"
                          // before the returned pointer, to use in xfree.
    p = ((uintptr_t)p + offset) & ~(alignment - 1);
  }
#endif
  return p;
}

void xfree(void *p) {
#ifdef _MSC_VER
  _aligned_free(p); // MSVC builtin
#elif _ISOC11_SOURCE
  free(p);                        // Defined in C11 and C++17.
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
  free(p);
#elif PPSSPP_ARCH(SSE2)
  _mm_free(p);                 // From xmmintrin.h
#else
  if (p != NULL) {
    p = ((void **)p)[-1];
  }
  free(p);
#endif
}

@hrydgard
Copy link
Owner

Alright, I'll make the alloc function aligned. But I don't think that alignment here is yielding a lot of performance - x86 is not that sensitive to alignment these days, although it can help a little bit.

@aktau
Copy link
Author

aktau commented Mar 26, 2025

Alright, I'll make the alloc function aligned. But I don't think that alignment here is yielding a lot of performance - x86 is not that sensitive to alignment these days, although it can help a little bit.

Sure. But it's also the most localized change, and we don't necessarily know whether there are other parts of the code which rely on some type of alignment (offsetof + masking or what-have-you). It seems safer and smaller in terms of change scope to just make it aligned.

@hrydgard
Copy link
Owner

The main problem is the lack of a cross platform realloc_aligned. However, the code only uses av_realloc in one single place, which can quite safely be replaced with a regular realloc, so I'll do that.

Not sure how an aligned realloc is implemented on other platforms... You can't really implement realloc without an ability to query the size of an allocation (so you know how much data to copy to the new one).

@hrydgard
Copy link
Owner

Yuck, ffmpeg does a horrible hack in order to allow using av_free both on pointers from av_malloc (which are allocated with memalign) and av_realloc (which are not): https://www.ffmpeg.org/doxygen/0.6/mem_8c-source.html#l00139

I refuse :)

@hrydgard
Copy link
Owner

Try #20162 .

@aktau
Copy link
Author

aktau commented Mar 26, 2025

Not sure how an aligned realloc is implemented on other platforms... You can't really implement realloc without an ability to query the size of an allocation (so you know how much data to copy to the new one).

I didn't know realloc was used in that code. In that case I'd just use the open-coded version, and stuff another 8-byte in there as the size.

Side note: there is a bug in my proposal, where I did a negative index before incrementing the pointer received from malloc, leading to fun times. I also did not know PPSSPP already had a AllocateAlignedMemory, which makes things easier.

Regarding 6162158:

-vlc->table = (VLC_TYPE(*)[2])av_realloc_f(vlc->table, vlc->table_allocated, sizeof(VLC_TYPE) * 2);
+vlc->table = (VLC_TYPE(*)[2])realloc(vlc->table, vlc->table_allocated, sizeof(VLC_TYPE) * 2);

If vlc->table was allocated with AllocateAlignedMemory, wouldn't this also cause corruption? But also this is a 3-argument version of realloc, which I haven't seen before. I'll need to take a closer look later.

@hrydgard
Copy link
Owner

@aktau Turns out three argument doesn't exist, MSVC compiled it anyway though. Fixed.

But it's only aligned to 4 bytes, which all allocators are anyway, so it's safe to swap it with a regular realloc (just had to take care to only switch the corresponding free, too).

@aktau
Copy link
Author

aktau commented Mar 27, 2025

But it's only aligned to 4 bytes, which all allocators are anyway, so it's safe to swap it with a regular realloc (just had to take care to only switch the corresponding free, too).

AFAIK malloc/realloc has to (at least) align to the maximum alignment of some standard type, IIRC. Not vector types. For my env (Linux, x86-64), that appears to be 16-bytes alignment. Don't have a reference handy right now, but it's not something one would want to rely on I think.

EDIT: The correct answer is here: jemalloc/jemalloc#1533 (comment).

Thanks for fixing. I"ll test it later today.

@aktau
Copy link
Author

aktau commented Mar 28, 2025

Now I can really confirm that this was fixed. I was able to play to my hearts content. Thanks!

@hrydgard
Copy link
Owner

hrydgard commented Mar 28, 2025

Great, thanks for confirming!

Windows 64-bit aligns to 8 bytes by default (or maybe 16 now?), 32-bit too these days I think, well at least 8. I doubt anything relevant aligns to smaller than 4...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants