Based on the instruction from: https://github.com/lotabout/write-a-C-interpreter/blob/master/README.md
But modify some by myself:
- Change from
inttolongto fit my 64-bit Mac Machine - Use
.hfile andCMake( ✌︎( ᐛ )✌︎)
Compiler
├──no mempool version/
│ ├── CMakeLists.txt
│ ├── global.h
│ ├── main.c
│ ├── lexer.c
│ ├── parser.c
│ ├── test0.c
│ └── vm.c
├──reorganized version/
│ ├── CMakeLists.txt
│ ├── global.h
│ ├── main.c
│ ├── lexer.c
│ ├── parser.c
│ ├── test.c
│ ├── test_mem.c
│ ├── concurrency_mempool.cpp
│ ├── concurrency_mempool.hpp
│ └── vm.c
├──comparison/
├── bench.py
├── test_perf.c
-
global.h (The Registry): Stores all shared definitions (Token IDs, instruction codes, register variable declarations). Any file that needs to communicate with others must include this.
-
main.c (The Assembly Plant): Responsible for allocating memory and reading source files. It first calls the parser to work, and then triggers the VM to run.
-
lexer.c (The Cutter): Solely manages
next(). It chops the long source string into individual Tokens (e.g.,int,main,123). -
parser.c (The Translator): Solely manages
program(),expression(), etc. It observes the Tokens and writes instructions (assembly) into thetextarray. -
vm.c (The CPU): Solely manages
eval(). It reads the instructions from thetextarray and executes them one by one.
How to compile + test running:
cd "no mempool version"
mkdir build
cd build
cmake ..
make
./compiler_no_mem ../test0.cThen you can get:
xxx build % make
[100%] Built target compiler
xxx build % ./compiler ../test.c
fibonacci(0) = 1
fibonacci(1) = 1
fibonacci(2) = 2
fibonacci(3) = 3
fibonacci(4) = 5
fibonacci(5) = 8
fibonacci(6) = 13
fibonacci(7) = 21
fibonacci(8) = 34
fibonacci(9) = 55
fibonacci(10) = 89
exit(0)
The biggest architectural leap in the reorganized version is the replacement of the native OS malloc/free with a custom-built, highly optimized Concurrent Memory Pool (concurrency_mempool.cpp).
Inspired by Google's TCMalloc, this C++ memory allocator handles all memory requests originating from the VM's MALC and FREE instructions. It drastically reduces user-to-kernel context switches and provides lock-free allocation for multithreaded environments.
The Mempool is built upon a strict three-tier caching system to balance performance and memory fragmentation:
-
ThreadCache (The Front-line): * A thread-local storage (
thread_local) cache that handles all small memory requests (≤ 4096 bytes).- Completely Lock-Free: Threads can allocate and deallocate memory here without any lock contention, making it incredibly fast.
-
CentralCache (The Distributor): * Acts as the middleman. When a
ThreadCacheruns out of memory (or has too much), it communicates with theCentralCache.- Manages memory in blocks called
Spans. Uses lightweightSpinLockto handle concurrent requests from multipleThreadCacheinstances efficiently.
- Manages memory in blocks called
-
PageCache (The Wholesaler): * Directly interfaces with the Operating System (
mmapon Linux/macOS,VirtualAllocon Windows) to allocate large chunks of memory in page units.- Uses a 2-Level Radix Tree (
PageMap2) to map raw memory addresses toSpanstructures in$O(1)$ time complexity, ensuring lightning-fast memory recovery and merging during deallocation.
- Uses a 2-Level Radix Tree (
- Cross-Language Integration: The memory pool is implemented in modern C++ but exports a clean
extern "C"API (PoolAllocate,PoolDeallocate), allowing the C-based Virtual Machine to seamlessly call it. - Metadata Optimization (
SpanPool): Instead of using the nativenewoperator to createSpanmetadata objects (which would defeat the purpose of a custom allocator), it features an internalSpanPoolthat pre-allocates and manages its own metadata structures. - Space-Time Tradeoff: By pre-allocating large pages from the OS and managing fragmentation internally, this architecture sacrifices a slight amount of peak physical memory (RSS) to buy pure execution speed and predictability.
To prove the effectiveness of the Mempool, we ran a heavy memory benchmark (bench.py running test_perf.c) simulating 100,000 continuous malloc and free operations. We compared the Baseline (Native Allocator) against our Custom Mempool.
| Metric | Baseline (Native) | Mempool (Custom) | Difference |
|---|---|---|---|
| Avg Time (ms) | 28.76 | 28.34 | + 1.45% (Faster) |
| Peak RSS (MB) | 2.33 | 2.98 | - 28.18% (More RAM) |
The results perfectly demonstrate the classic Space-Time Tradeoff in computer science:
- Time Improvement: The custom Mempool successfully outperforms the highly-optimized modern OS allocator in execution speed because it eliminates frequent user-to-kernel mode context switches during allocation.
- Space Cost: The Peak RSS (Resident Set Size) is naturally higher. The Mempool pre-allocates large memory blocks (pages) from the OS to serve future requests and maintains internal metadata (Spans, Radix Trees), trading base memory usage for pure execution speed.