-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extremely excessive memory usage in synthetic threading stress test #1002
Comments
Updated to 2.1.9 and now it still has bad memory usage but now it randomly segfaults too, which never happened with 2.1.7. |
Did you test you code with ASAN, TSAN under Linux with mimalloc disabled? |
Yes. ASAN has some false positive leak complaints (constant amount of leaked memory when each thread exits). No complaints from TSAN or UBSAN. Here's a version that still exhibits the issue but ASAN doesn't have false positives on: Disclaimer: I had to run it with #include <thread>
#include <vector>
#include <atomic>
#include <mutex>
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
using namespace std;
// mimalloc
#include <mimalloc.h>
#define malloc mi_malloc
#define free mi_free
// glibc
//__attribute__((optnone)) void * _malloc(size_t n) { return malloc(n); }
//#define malloc _malloc
std::mutex global_tofree_list_mtx;
std::vector<void *> global_tofree_list;
std::atomic_int mustexit;
void freeloop()
{
//int i = 0;
while (1)
{
//printf("%d\n", i++);
global_tofree_list_mtx.lock();
/*
if (global_tofree_list.size())
printf("%zd\n", global_tofree_list.size());
*/
for (auto & p : global_tofree_list)
free(p);
global_tofree_list.clear();
if (mustexit)
{
global_tofree_list_mtx.unlock();
return;
}
/*
while (global_tofree_list.size())
{
free(global_tofree_list.back());
global_tofree_list.pop_back();
}
*/
global_tofree_list_mtx.unlock();
//_walloc_flush_freelists();
}
}
std::atomic_int tc = 0;
int * ptrs[512][8];
void looper()
{
std::vector<void *> tofree_list;
auto do_free = [&](void * p)
{
tofree_list.push_back(p);
if (tofree_list.size() > 100)
{
global_tofree_list_mtx.lock();
for (auto & p : tofree_list)
global_tofree_list.push_back(p);
global_tofree_list_mtx.unlock();
tofree_list.clear();
}
};
int unique = tc.fetch_add(1);
for (int i = 0; i < 1000000; ++i)
{
size_t s = 1ULL << (i%20);
for (int j = 0; j < 8; j++)
{
ptrs[unique][j] = (int *)(malloc(s*sizeof(int)));
*ptrs[unique][j] = j+unique*1523;
}
for (int j = 8; j > 0; j--)
{
if (*ptrs[unique][j-1] != j-1+unique*1523)
assert(((void)"memory corruption! (FILO)", 0));
do_free(ptrs[unique][j-1]);
}
for (int j = 0; j < 8; j++)
{
ptrs[unique][j] = (int *)(malloc(s*sizeof(int)));
*ptrs[unique][j] = j+unique*1523;
}
for (int j = 0; j < 8; j++)
{
if (*ptrs[unique][j] != j+unique*1523)
assert(((void)"memory corruption! (FIFO)", 0));
do_free(ptrs[unique][j]);
}
}
global_tofree_list_mtx.lock();
for (auto & p : tofree_list)
global_tofree_list.push_back(p);
global_tofree_list_mtx.unlock();
tofree_list.clear();
//puts("!!!!!!!!!!!!!!! thread finished !!!!!!!!!!!!");
//printf("!!!! thread %zd finished !!!!\n", _thread_info->alt_id);
fflush(stdout);
}
int main()
{
int threadcount = 32;
vector<thread> threads;
for (int i = 0; i < threadcount; ++i)
threads.emplace_back(looper);
std::thread freeloop_thread(freeloop);
for (auto & thread : threads)
{
thread.join();
}
mustexit.store(1);
freeloop_thread.join();
puts("Done!");
return 0;
} |
@wareya this is an interesting benchmark. (I am one of the maintainers of mimalloc-bench with @daanx and @jvoisin). We would certainly consider adding it to mimalloc bench. With regard to memory usage. I modified the benchmark a little to void freeloop()
{
size_t max_list_bytes = 0;
while (1)
{
std::lock_guard<std::mutex> guard{global_tofree_list_mtx};
size_t list_bytes = 0 ;
for (auto & p : global_tofree_list)
{
list_bytes += malloc_usable_size(p);
free(p);
}
global_tofree_list.clear();
if (list_bytes > max_list_bytes)
{
printf("%zd\n", list_bytes);
max_list_bytes = list_bytes;
}
if (mustexit)
return;
}
} This records the high water mark of the memory in the free list, and then prints when the high water mark is increased. The benchmark is very sensitive to the relative speed of allocation and free, and can lead to an interesting growth in the size of the free list,
If I run this against the
I believe this is due to the OS scheduling quantum, and many threads being blocked while the freeing occurs as they cannot simply skip adding to the list. This causes the freeing thread to exhaust its quantum and get descheduled, and then the other threads create a longer free list, and this leads to an increasing free list over time. This leads to some interesting behaviours, but they are not really due to the allocators performance. This kind of producer consumer benchmark typically needs some form of backpressure to ensure the memory usage should remain constant. Overall, this test is much better as a stress test than something that the numbers that are reported could be useful for spotting performance regressions. Perf regressions are the main usage for me of mimalloc-bench. I would been keen to add a modified version to snmalloc as a stress test, assuming that is okay with you. It found a bug that I will be fixing today by stressing a case that normally only gets minimally exercised. |
@SchrodingerZhu I think it is Committed bytes. I can't remember the details around shared pages, and committed but not wired (i.e. in the page file). That is why I normally use vmmap on Windows to understand better. |
It's committed, not reserved. My toy allocator reserves 16TiB and it doesn't show up as that much in task manager, and the value only grows as I commit new pages.
That's fine by me!
The part that gets descheduled is outside of my mutex's locking period instead of inside of it, right? If it happened inside of it then that descheduling would function as backpressure.... I think. I guess the OS might be looking at the freelist thread and going "Hmm... Eh. I don't like that one." and radically deprioritizing it once it unlocks its mutex, and because the mutex isn't "fair", the producer threads win out over it often enough to cause problems. My toy allocator has thread-local freelist caches and a shared mutex-protected global freelist, and when the caches grow too big it dumps them in their entirety into the corresponding global freelist. (That threads bulk pull from the global one when allocating so they don't always hit a mutex every time they need to access it.) The fact that it dumps them in their entirety is I think why my toy allocator has no issues with this stress test. |
Yep, with a fair mutex instead of std::mutex I get similar peak memory usage of ~50MB. |
@wareya with your allocator what size do the free lists reach? I think I was seeing about ~60,000 elements. If your create an in band free list through the deallocated memory, then that is 60,000 cache misses inside the lock. From your description, my guess would be that both snmalloc and mimalloc allocate faster than your allocator, as neither take a lock when allocating except to effectively get more memory from the OS. However, that then swamps the freeing thread, which gets massively behind, and then sends giant batches back, which iteracts badly with the scheduler and causes it to be deprioritised, and the batches get larger and larger. |
Typical max freelist length I see with my toy allocator is ~8k allocations, ~1.5GB. It occasionally spikes higher than that. I think that points to task manager underreporting how much is being committed on average, but I'd expect the average total commit size to be less than that, maybe several hundred megabytes. My allocator just being slower and that keeping the freelist thread from getting swamped is very believable! My times are meaningfully slower than mimalloc's, after all. |
The underreporting is probably because you are not touching most of the allocation. This is committed but without physical RAM backing. Windows initially allocates in the page file, so large barely used allocation don't take physical RAM. |
I would be interested to know which fair mutex you use |
I wrapped std::mutex in a condition variable ticketing system; there are examples on Google etc. |
tldr: I have a stress test where mimalloc consumes ~30x as much working set memory as a toy allocator I wrote.
I'm using mimalloc
2.1.7-1
(?) as packaged by msys2.mimalloc:
![Image](https://private-user-images.githubusercontent.com/585488/409029810-9c6f078a-6752-49ff-93d7-e50b7b35b166.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMzkwMTcsIm5iZiI6MTczOTMzODcxNywicGF0aCI6Ii81ODU0ODgvNDA5MDI5ODEwLTljNmYwNzhhLTY3NTItNDlmZi05M2Q3LWU1MGI3YjM1YjE2Ni5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQwNTM4MzdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05YTg1YTBjNzZlODA0NTYwMTMyODM4ODdlZDRkZjU3NWFkYTk5NDI3MjQ5NTEwZTgxOTE5Y2U0ZWQ2ZWQ2ZGQ4JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.Uygvr_dHupgQgP7OgtyvhvonJ7GAJkXEzJ4w4RYAyY0)
toy allocator:
![Image](https://private-user-images.githubusercontent.com/585488/409029820-35158899-d9df-4930-9965-d7c5866bd8ab.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkzMzkwMTcsIm5iZiI6MTczOTMzODcxNywicGF0aCI6Ii81ODU0ODgvNDA5MDI5ODIwLTM1MTU4ODk5LWQ5ZGYtNDkzMC05OTY1LWQ3YzU4NjZiZDhhYi5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUwMjEyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MDIxMlQwNTM4MzdaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT05YmNlYzdmZGIyNzc5ZjA2NDc0NTcwNjdmYmFlMzhlMWZkYzFlZTFiNjQxYjM2OGFkY2Y1MTRhODU0NjhhODU3JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.gVjkbk_TXHVocdilGYVfaO5bhTtsfdsMkD-fFN7GdNY)
Background: I'm designing my own allocator and was stress-testing how it handles extremely GC-like workloads and ended up with a stress test with the following design:
Running this test, my toy allocator finishes in ~30 seconds and takes up ~70MB of RAM. Mimalloc finishes in ~5 seconds and takes up ~6GB of RAM (peak value). If I enable address sanitizer, the numbers change to 30s/200MB and 18s/2GB. I threw this same test at my system's malloc implementation (I'm in windows 10) and it takes so long to run that I'm treating it as "doesn't terminate" (it took over 50 minutes and still hadn't finished). The system allocator's working set usage amount was roughly the same as my toy allocator.
NOTE: I didn't bother deleting commented-out test/debug code before uploading. I'm still busy debugging issues with my allocator and GC and can't dedicate too much time to making this report clean.
If this is a false positive somehow (e.g. something is wrong with my environment), the stress test might still be useful for https://github.com/daanx/mimalloc-bench since it breaks my system's allocator too.
Attached zip file contains a .cpp file containing my stress test.
mallocstress.zip
cpp file contents follow:
The text was updated successfully, but these errors were encountered: