Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extremely excessive memory usage in synthetic threading stress test #1002

Open
wareya opened this issue Feb 3, 2025 · 13 comments
Open

Extremely excessive memory usage in synthetic threading stress test #1002

wareya opened this issue Feb 3, 2025 · 13 comments

Comments

@wareya
Copy link

wareya commented Feb 3, 2025

tldr: I have a stress test where mimalloc consumes ~30x as much working set memory as a toy allocator I wrote.

I'm using mimalloc 2.1.7-1 (?) as packaged by msys2.

mimalloc:
Image

toy allocator:
Image

Background: I'm designing my own allocator and was stress-testing how it handles extremely GC-like workloads and ended up with a stress test with the following design:

  • 32 work threads, allocating various sizes of allocations, writing to them, checking the value they just wrote, then punting them to a free-loop to be freed
  • The single free-loop thread frees the allocations

Running this test, my toy allocator finishes in ~30 seconds and takes up ~70MB of RAM. Mimalloc finishes in ~5 seconds and takes up ~6GB of RAM (peak value). If I enable address sanitizer, the numbers change to 30s/200MB and 18s/2GB. I threw this same test at my system's malloc implementation (I'm in windows 10) and it takes so long to run that I'm treating it as "doesn't terminate" (it took over 50 minutes and still hadn't finished). The system allocator's working set usage amount was roughly the same as my toy allocator.

NOTE: I didn't bother deleting commented-out test/debug code before uploading. I'm still busy debugging issues with my allocator and GC and can't dedicate too much time to making this report clean.

If this is a false positive somehow (e.g. something is wrong with my environment), the stress test might still be useful for https://github.com/daanx/mimalloc-bench since it breaks my system's allocator too.

Attached zip file contains a .cpp file containing my stress test.

mallocstress.zip

cpp file contents follow:

#include <thread>
#include <vector>
#include <atomic>
#include <mutex>
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
using namespace std;

//// custom malloc
////#define WALLOC_SYS_MALLOC
////#define WALLOC_GLOBAL_FREELIST
//#define WALLOC_PULL_OVERRIDE (256*128)
//#define WALLOC_FLUSH_OVERRIDE (4096*64)
//#define WALLOC_FLUSH_KEEP_OVERRIDE 0
//#define WALLOC_MAXIMUM_FAST
////#define WALLOC_CACHEHINT 64
//#include "wmalloc.hpp"
//#define malloc _walloc_raw_malloc
//#define calloc _walloc_raw_calloc
//#define realloc _walloc_raw_realloc
//#define free _walloc_raw_free

// mimalloc
#include <mimalloc.h>
#define malloc mi_malloc
#define free mi_free

// glibc
//__attribute__((optnone)) void * _malloc(size_t n) { return malloc(n); }
//#define malloc _malloc

std::mutex global_tofree_list_mtx;
std::vector<void *> global_tofree_list;

std::atomic_int mustexit;
void freeloop()
{
    //int i = 0;
    while (!mustexit)
    {
        //printf("%d\n", i++);
        global_tofree_list_mtx.lock();
        /*
        if (global_tofree_list.size())
            printf("%zd\n", global_tofree_list.size());
        */
        for (auto & p : global_tofree_list)
            free(p);
        global_tofree_list.clear();
        /*
        while (global_tofree_list.size())
        {
            free(global_tofree_list.back());
            global_tofree_list.pop_back();
        }
        */
        global_tofree_list_mtx.unlock();
        //_walloc_flush_freelists();
    }
}

std::atomic_int tc = 0;
int * ptrs[512][8];
void looper()
{
    std::vector<void *> tofree_list;
    auto do_free = [&](void * p)
    {
        tofree_list.push_back(p);
        if (tofree_list.size() > 100)
        {
            global_tofree_list_mtx.lock();
            for (auto & p : tofree_list)
                global_tofree_list.push_back(p);
            global_tofree_list_mtx.unlock();
            tofree_list.clear();
        }
    };
    int unique = tc.fetch_add(1);
    for (int i = 0; i < 1000000; ++i)
    {
        size_t s = 1ULL << (i%20);
        for (int j = 0; j < 8; j++)
        {
            ptrs[unique][j] = (int *)(malloc(s*sizeof(int)));
            *ptrs[unique][j] = j+unique*1523;
        }
        for (int j = 8; j > 0; j--)
        {
            if (*ptrs[unique][j-1] != j-1+unique*1523)
                assert(((void)"memory corruption! (FILO)", 0));
            do_free(ptrs[unique][j-1]);
        }
        
        for (int j = 0; j < 8; j++)
        {
            ptrs[unique][j] = (int *)(malloc(s*sizeof(int)));
            *ptrs[unique][j] = j+unique*1523;
        }
        for (int j = 0; j < 8; j++)
        {
            if (*ptrs[unique][j] != j+unique*1523)
                assert(((void)"memory corruption! (FIFO)", 0));
            do_free(ptrs[unique][j]);
        }
    }
    //puts("!!!!!!!!!!!!!!! thread finished !!!!!!!!!!!!");
    //printf("!!!! thread %zd finished !!!!\n", _thread_info->alt_id);
    fflush(stdout);
}

int main()
{
    int threadcount = 32;
    vector<thread> threads;
    
    for (int i = 0; i < threadcount; ++i)
        threads.emplace_back(looper);
    
    std::thread freeloop_thread(freeloop);
    
    for (auto & thread : threads)
    {
        thread.join();
    }
    
    mustexit.store(1);
    freeloop_thread.join();
    
    puts("Done!");
    
    return 0;
}
@wareya
Copy link
Author

wareya commented Feb 3, 2025

Updated to 2.1.9 and now it still has bad memory usage but now it randomly segfaults too, which never happened with 2.1.7.

@JochenBaier
Copy link

Did you test you code with ASAN, TSAN under Linux with mimalloc disabled?

@wareya
Copy link
Author

wareya commented Feb 3, 2025

Yes. ASAN has some false positive leak complaints (constant amount of leaked memory when each thread exits). No complaints from TSAN or UBSAN. Here's a version that still exhibits the issue but ASAN doesn't have false positives on:

mallocstress.zip

Disclaimer: I had to run it with 10000 iterations instead of 1000000 because 1000000 iterations would have taken ~3 hours to run (10000 iterations with glibc malloc and ASAN took ~2 minutes). The mimalloc issues are still apparent at 10000 iterations (taking ~1.5GB instead of ~300MB like glibc malloc takes in this case). The attached file has it set to 1000000 iterations.

#include <thread>
#include <vector>
#include <atomic>
#include <mutex>
#include <stdio.h>
#include <assert.h>
#include <stdlib.h>
using namespace std;

// mimalloc
#include <mimalloc.h>
#define malloc mi_malloc
#define free mi_free

// glibc
//__attribute__((optnone)) void * _malloc(size_t n) { return malloc(n); }
//#define malloc _malloc

std::mutex global_tofree_list_mtx;
std::vector<void *> global_tofree_list;

std::atomic_int mustexit;
void freeloop()
{
    //int i = 0;
    while (1)
    {
        //printf("%d\n", i++);
        global_tofree_list_mtx.lock();
        /*
        if (global_tofree_list.size())
            printf("%zd\n", global_tofree_list.size());
        */
        for (auto & p : global_tofree_list)
            free(p);
        global_tofree_list.clear();
        
        if (mustexit)
        {
            global_tofree_list_mtx.unlock();
            return;
        }
        
        /*
        while (global_tofree_list.size())
        {
            free(global_tofree_list.back());
            global_tofree_list.pop_back();
        }
        */
        global_tofree_list_mtx.unlock();
        //_walloc_flush_freelists();
    }
}

std::atomic_int tc = 0;
int * ptrs[512][8];
void looper()
{
    std::vector<void *> tofree_list;
    auto do_free = [&](void * p)
    {
        tofree_list.push_back(p);
        if (tofree_list.size() > 100)
        {
            global_tofree_list_mtx.lock();
            for (auto & p : tofree_list)
                global_tofree_list.push_back(p);
            global_tofree_list_mtx.unlock();
            tofree_list.clear();
        }
    };
    int unique = tc.fetch_add(1);
    for (int i = 0; i < 1000000; ++i)
    {
        size_t s = 1ULL << (i%20);
        for (int j = 0; j < 8; j++)
        {
            ptrs[unique][j] = (int *)(malloc(s*sizeof(int)));
            *ptrs[unique][j] = j+unique*1523;
        }
        for (int j = 8; j > 0; j--)
        {
            if (*ptrs[unique][j-1] != j-1+unique*1523)
                assert(((void)"memory corruption! (FILO)", 0));
            do_free(ptrs[unique][j-1]);
        }
        
        for (int j = 0; j < 8; j++)
        {
            ptrs[unique][j] = (int *)(malloc(s*sizeof(int)));
            *ptrs[unique][j] = j+unique*1523;
        }
        for (int j = 0; j < 8; j++)
        {
            if (*ptrs[unique][j] != j+unique*1523)
                assert(((void)"memory corruption! (FIFO)", 0));
            do_free(ptrs[unique][j]);
        }
    }
    global_tofree_list_mtx.lock();
    for (auto & p : tofree_list)
        global_tofree_list.push_back(p);
    global_tofree_list_mtx.unlock();
    tofree_list.clear();
    //puts("!!!!!!!!!!!!!!! thread finished !!!!!!!!!!!!");
    //printf("!!!! thread %zd finished !!!!\n", _thread_info->alt_id);
    fflush(stdout);
}

int main()
{
    int threadcount = 32;
    vector<thread> threads;
    
    for (int i = 0; i < threadcount; ++i)
        threads.emplace_back(looper);
    
    std::thread freeloop_thread(freeloop);
    
    for (auto & thread : threads)
    {
        thread.join();
    }
    
    mustexit.store(1);
    freeloop_thread.join();
    
    puts("Done!");
    
    return 0;
}

@mjp41
Copy link
Member

mjp41 commented Feb 4, 2025

@wareya this is an interesting benchmark. (I am one of the maintainers of mimalloc-bench with @daanx and @jvoisin). We would certainly consider adding it to mimalloc bench.

With regard to memory usage. I modified the benchmark a little to

void freeloop()
{
    size_t max_list_bytes = 0;
    while (1)
    {
        std::lock_guard<std::mutex> guard{global_tofree_list_mtx};
        size_t list_bytes = 0 ;
        for (auto & p : global_tofree_list)
        {
            list_bytes += malloc_usable_size(p);
            free(p);
        }
        global_tofree_list.clear();

        if (list_bytes > max_list_bytes)
        {
            printf("%zd\n", list_bytes);
            max_list_bytes = list_bytes;
        }

        if (mustexit)
            return;
    }
}

This records the high water mark of the memory in the free list, and then prints when the high water mark is increased.

The benchmark is very sensitive to the relative speed of allocation and free, and can lead to an interesting growth in the size of the free list, global_tofree_list. This is a slightly modified version of your benchmark running against my local checkout of snmalloc:

170083456 bytes
216212864 bytes
264445696 bytes
302027888 bytes
330811712 bytes
413300032 bytes
566258816 bytes
620076976 bytes
661396992 bytes
701208656 bytes
849664576 bytes
1031691520 bytes
1193551120 bytes
1207726384 bytes
1267221264 bytes
1412386928 bytes
1476661392 bytes
1596250752 bytes
1766893744 bytes
1821219104 bytes
1846398080 bytes
2056237312 bytes
2148789120 bytes
2280470912 bytes
2317644240 bytes
2407210048 bytes
2451550336 bytes
2578845408 bytes
2711983712 bytes
2766111200 bytes
2809192768 bytes
2890089296 bytes
...

If I run this against the check version of snmalloc, the memory usage is much lower, but so is the size of the list:

33769088 bytes
65236512 bytes
67282048 bytes
100925824 bytes
103523008 bytes
103832816 bytes
112299968 bytes
120014208 bytes
127924480 bytes
146619392 bytes
162553856 bytes
165692768 bytes
230757520 bytes
234348560 bytes
335723072 bytes
336340784 bytes
362428256 bytes
461058816 bytes
520277248 bytes
598759936 bytes
604541696 bytes
683307328 bytes
703600768 bytes
768678624 bytes
805330048 bytes

I believe this is due to the OS scheduling quantum, and many threads being blocked while the freeing occurs as they cannot simply skip adding to the list. This causes the freeing thread to exhaust its quantum and get descheduled, and then the other threads create a longer free list, and this leads to an increasing free list over time. This leads to some interesting behaviours, but they are not really due to the allocators performance. This kind of producer consumer benchmark typically needs some form of backpressure to ensure the memory usage should remain constant.

Overall, this test is much better as a stress test than something that the numbers that are reported could be useful for spotting performance regressions. Perf regressions are the main usage for me of mimalloc-bench.

I would been keen to add a modified version to snmalloc as a stress test, assuming that is okay with you. It found a bug that I will be fixing today by stressing a case that normally only gets minimally exercised.

@SchrodingerZhu
Copy link

@mjp41 @wareya A side question: I wonder what the numbers in the windows task manager mean. Do they mean reserved memory or committed memory.

@mjp41
Copy link
Member

mjp41 commented Feb 4, 2025

@SchrodingerZhu I think it is Committed bytes. I can't remember the details around shared pages, and committed but not wired (i.e. in the page file). That is why I normally use vmmap on Windows to understand better.

@wareya
Copy link
Author

wareya commented Feb 4, 2025

A side question: I wonder what the numbers in the windows task manager mean. Do they mean reserved memory or committed memory.

It's committed, not reserved. My toy allocator reserves 16TiB and it doesn't show up as that much in task manager, and the value only grows as I commit new pages.

I would been keen to add a modified version to snmalloc as a stress test, assuming that is okay with you. It found a bug that I will be fixing today by stressing a case that normally only gets minimally exercised.

That's fine by me!

I believe this is due to the OS scheduling quantum, and many threads being blocked while the freeing occurs as they cannot simply skip adding to the list. This causes the freeing thread to exhaust its quantum and get descheduled, and then the other threads create a longer free list, and this leads to an increasing free list over time. This leads to some interesting behaviours, but they are not really due to the allocators performance. This kind of producer consumer benchmark typically needs some form of backpressure to ensure the memory usage should remain constant.

The part that gets descheduled is outside of my mutex's locking period instead of inside of it, right? If it happened inside of it then that descheduling would function as backpressure.... I think. I guess the OS might be looking at the freelist thread and going "Hmm... Eh. I don't like that one." and radically deprioritizing it once it unlocks its mutex, and because the mutex isn't "fair", the producer threads win out over it often enough to cause problems.

My toy allocator has thread-local freelist caches and a shared mutex-protected global freelist, and when the caches grow too big it dumps them in their entirety into the corresponding global freelist. (That threads bulk pull from the global one when allocating so they don't always hit a mutex every time they need to access it.) The fact that it dumps them in their entirety is I think why my toy allocator has no issues with this stress test.

@wareya
Copy link
Author

wareya commented Feb 4, 2025

Yep, with a fair mutex instead of std::mutex I get similar peak memory usage of ~50MB.

@mjp41
Copy link
Member

mjp41 commented Feb 4, 2025

@wareya with your allocator what size do the free lists reach? I think I was seeing about ~60,000 elements. If your create an in band free list through the deallocated memory, then that is 60,000 cache misses inside the lock.

From your description, my guess would be that both snmalloc and mimalloc allocate faster than your allocator, as neither take a lock when allocating except to effectively get more memory from the OS. However, that then swamps the freeing thread, which gets massively behind, and then sends giant batches back, which iteracts badly with the scheduler and causes it to be deprioritised, and the batches get larger and larger.

@wareya
Copy link
Author

wareya commented Feb 4, 2025

Typical max freelist length I see with my toy allocator is ~8k allocations, ~1.5GB. It occasionally spikes higher than that. I think that points to task manager underreporting how much is being committed on average, but I'd expect the average total commit size to be less than that, maybe several hundred megabytes.

My allocator just being slower and that keeping the freelist thread from getting swamped is very believable! My times are meaningfully slower than mimalloc's, after all.

@mjp41
Copy link
Member

mjp41 commented Feb 4, 2025

The underreporting is probably because you are not touching most of the allocation. This is committed but without physical RAM backing. Windows initially allocates in the page file, so large barely used allocation don't take physical RAM.

@JochenBaier
Copy link

Yep, with a fair mutex instead of std::mutex I get similar peak memory usage of ~50MB.

I would be interested to know which fair mutex you use

@wareya
Copy link
Author

wareya commented Feb 6, 2025

Yep, with a fair mutex instead of std::mutex I get similar peak memory usage of ~50MB.

I would be interested to know which fair mutex you use

I wrapped std::mutex in a condition variable ticketing system; there are examples on Google etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants