-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segfault in ff_atrac3p_ipqf causes infinite loop due to at3_standalone malloc wrapper not aligning + auto-vectorization believing the annotations #20155
Comments
Really confused by what you're seeing and also what you logged in the other thread. Unexplained crashes in simd code are usually due to misalignment, and if it's some crazy simd code generated by -O3, even more likely. Does it happen when you don't compile with -O3?
|
I think you're right that it's due to the unaligned access. To be honest I have no idea why gcc thinks it can generate a ppsspp/ext/at3_standalone/atrac3plusdsp.cpp Lines 644 to 650 in 00fdace
On one hand the code specifically declares ppsspp/ext/at3_standalone/atrac3plus.h Lines 57 to 61 in 00fdace
On the other hand, ppsspp/ext/at3_standalone/atrac3plusdsp.cpp Line 712 in 00fdace
Side note: GCC actually uses unaligned loads/stores for most of the rest of the function. On the third hand (as far as third hands are a thing). The actual
So it is just accessing
However, this address is only 16-byte aligned, not 32-byte aligned:
Is the There's further oddness. I'm trying to setup some automation so I can run it with different flags and see whether it crashes. I'm also trying to extract the assembly of this specific function: $ objdump --disassemble=ff_atrac3p_ipqf ~/.config/retroarch/cores/ppsspp_libretro.so Result: https://gist.github.com/aktau/1aed0940d264600f9e7fb01b03dc5606 I notice this is not what the debugger says I'm running. So I resort to manually passing rebased addresses from the debugger:
Result: https://gist.github.com/aktau/30ddcdfc8fbacf1c17e787a761ba36f5 This is the one I was running. The "standalone" version doesn't look vectorized, but the latter/real one does. The C++ name mangling also seems odd to me. I thought this maybe had to do with being inlined into a C++ function, but inlining doesn't make sense if we have a prologue/epilogue. EDIT: I think the first function I found is the ffmpeg copy. It's curious that it's not optimized at all. Perhaps that's the reaosn why it never crashed before. Are special optimization flags used for the ffmpeg version? |
We build ffmpeg separately, with different flags yes. Long term goal is to get rid of the ffmpeg dependency entirely, but that's a ways away. What if you just remove that DECLARE_ALIGNED where it appears on structure members (while leaving it on globals)? I don't think it can work unless it somehow cooperates with the allocator, in case the struct gets heap allocated... Maybe the compiler can arrange for it to be true on the stack. |
OK, I think this is just a bug bourne from cutting away too much code. Walking up the stack, checking alignment of super-structures:
ppsspp/ext/at3_standalone/atrac3plusdec.cpp Line 155 in aa87fee
This is similar to the ffmpeg version. If we go down the stack however, the actual allocating function is different. The version in ext/at3_standalone/mem.cpp is just a small malloc wrapper. ppsspp/ext/at3_standalone/mem.cpp Lines 55 to 64 in 16f9851
While the ffmpeg version is much longer and has special code for guaranteeing alignment. |
Whoops, that would do it, yes. Although I don't think that alignment here is much of a win - so can you confirm that removing it from struct members fixes the crash? In that case I'll just go ahead and do the same on my end. |
I don't know what fraction of cycles this decoding normally is, but the vectorized code looks a lot tighter than the non-vectorized version I posted earlier. Why not just fix the malloc? Can the project use C11? (meaning: aligned_alloc). For C++ it's apparently C++17. Also I'm not sure if just removing alignment from the struct members would be enough. There is intrinsics code in certain places as well that may well rely on some alignment. |
It's probably still gonna vectorize fine even without the alignment. And atrac decoding is usually not consuming more than 1% cpu anyway, on pc much less.. |
According to this, you need to recompile the library with |
If this structure ( ppsspp/ext/at3_standalone/atrac3plus.h Lines 58 to 59 in 00fdace
The problem is that it is allocated on the heap (with It may be possible to consider just stack-allocating this, but AFAIK the struct is fairly large, and I don't know whether PPSSPP can influence all its platforms to give sufficient stack. I still think making void *xmalloc(size_t size) {
const size_t align = 32; // Sufficient for AVX operations.
void *p = NULL;
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
if (posix_memalign(&p, align, size)) {
p = NULL;
}
#else
#error "no aligned alloc, please write an open-coded version"
#endif
return p;
}
void xfree(void *p) {
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
free(p);
#else
#error "no aligned alloc, please write an open-coded version"
#endif
} If the code is meant to comply with either the C11 or the C++17 standards, At any rate, here's a decision (Christmas) tree of possible library functions to call out to, and even a fallback open-coded implementation that could just be used without any ifdef checks. The latter should work because NOTE: I have not tested the correctness of the open-coded version. It's just a quick writeup of what I remember from when I wrote this once way back when. Surely there are many similar ones on the internet. #include <assert.h>
void *xmalloc(size_t size) {
const size_t align = 32; // Sufficent for AVX operations.
static_assert(align & (align -1) == 0, "align must be a power of two");
void *p = NULL;
#ifdef _MSC_VER
p = _aligned_malloc(size, align); // MSVC builtin
#elif _ISOC11_SOURCE
p = aligned_alloc(size, align); // In C11 and C++17, align must be PoT.
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
if (posix_memalign(&p, align, size)) {
p = NULL;
}
#elif PPSSPP_ARCH(SSE2)
p = _mm_malloc(size, align); // From xmmintrin.h
#else
size_t offset = (align - 1) + sizeof(&p);
p = malloc(size + offset);
if (p != NULL) {
((void **)p)[-1] = p; // Store the original pointer as "metadata"
// before the returned pointer, to use in xfree.
p = ((uintptr_t)p + offset) & ~(alignment - 1);
}
#endif
return p;
}
void xfree(void *p) {
#ifdef _MSC_VER
_aligned_free(p); // MSVC builtin
#elif _ISOC11_SOURCE
free(p); // Defined in C11 and C++17.
#elif _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600
free(p);
#elif PPSSPP_ARCH(SSE2)
_mm_free(p); // From xmmintrin.h
#else
if (p != NULL) {
p = ((void **)p)[-1];
}
free(p);
#endif
} |
Alright, I'll make the alloc function aligned. But I don't think that alignment here is yielding a lot of performance - x86 is not that sensitive to alignment these days, although it can help a little bit. |
Sure. But it's also the most localized change, and we don't necessarily know whether there are other parts of the code which rely on some type of alignment (offsetof + masking or what-have-you). It seems safer and smaller in terms of change scope to just make it aligned. |
The main problem is the lack of a cross platform realloc_aligned. However, the code only uses av_realloc in one single place, which can quite safely be replaced with a regular realloc, so I'll do that. Not sure how an aligned realloc is implemented on other platforms... You can't really implement realloc without an ability to query the size of an allocation (so you know how much data to copy to the new one). |
Yuck, ffmpeg does a horrible hack in order to allow using av_free both on pointers from av_malloc (which are allocated with memalign) and av_realloc (which are not): https://www.ffmpeg.org/doxygen/0.6/mem_8c-source.html#l00139 I refuse :) |
Try #20162 . |
I didn't know realloc was used in that code. In that case I'd just use the open-coded version, and stuff another 8-byte in there as the size. Side note: there is a bug in my proposal, where I did a negative index before incrementing the pointer received from malloc, leading to fun times. I also did not know PPSSPP already had a Regarding 6162158: -vlc->table = (VLC_TYPE(*)[2])av_realloc_f(vlc->table, vlc->table_allocated, sizeof(VLC_TYPE) * 2);
+vlc->table = (VLC_TYPE(*)[2])realloc(vlc->table, vlc->table_allocated, sizeof(VLC_TYPE) * 2); If |
@aktau Turns out three argument doesn't exist, MSVC compiled it anyway though. Fixed. But it's only aligned to 4 bytes, which all allocators are anyway, so it's safe to swap it with a regular realloc (just had to take care to only switch the corresponding free, too). |
AFAIK malloc/realloc has to (at least) align to the maximum alignment of some standard type, IIRC. Not vector types. For my env (Linux, x86-64), that appears to be 16-bytes alignment. Don't have a reference handy right now, but it's not something one would want to rely on I think. EDIT: The correct answer is here: jemalloc/jemalloc#1533 (comment). Thanks for fixing. I"ll test it later today. |
Now I can really confirm that this was fixed. I was able to play to my hearts content. Thanks! |
Great, thanks for confirming! Windows 64-bit aligns to 8 bytes by default (or maybe 16 now?), 32-bit too these days I think, well at least 8. I doubt anything relevant aligns to smaller than 4... |
Game or games this happens in
UCES-00006 - Medievil
What area of the game / PPSSPP
While starting up, after a few screens (after choosing language), it usually locks up. The situation is basically identical to #19994 (comment).
I cleared my save states to attempt to make sure I didn't end up in a bad code path. I compiled this commit:
Just as in the comment I referenced, it loops forever due to hitting a segfault and hitting the segv_handler which doesn't know what to do with this.
Sadly, what I said in #19994 (comment) isn't right. It was fixed (or at least I couldn't reproduce it twice) for Wipeout Pure, but I can reproduce it in Medievil.
What should happen
Softlocking shouldn't happen.
Logs
I can upload whatever is desired. If there are special instructions for Linux/Retroarch. I can also look around in the debugger if that's helpful.
Platform
Linux / BSD
Mobile device model or graphics card (GPU)
Intel HD 630
PPSSPP version affected
ca1819f
Last working version
No response
Graphics backend (3D API)
Vulkan
Checklist
The text was updated successfully, but these errors were encountered: