Skip to content

Exploration of Using a Different Memory Allocator for VTR Runtime Improvement #3189

@stefanpie

Description

@stefanpie

Note, this is an independent exploration that I conducted, and I don't expect that VTR authors to immediately consider these changes in VTR. I merely want to provide some useful insight into VTR I gained from my own research to the authors of VTR to help make VTR better.

Motivation

Recently, I've been looking at ways to speed up EDA tools without the need to make changes to the underlying code of the EDA tools themselves. For example, I want to try to make Vitis HLS or Vivado run faster, but I don't have access to the Vitis HLS or Vivado source code, and I can't change it unless I go work or intern for AMD/Xilinx. I want to explore approaches that would let me speed up commercial EDA tools in a "black box" manner. My original intent with this idea was that I could speed up EDA tools that interact with LLM agents, since the bottleneck there is the EDA tools (which can take >30s to minutes) and not LLM inference (which is usually <20s). But this is also a useful pursuit in general for large-scale EDA dataset collection and EDA benchmarking.

Approach

One trick in this exploration was to use a different memory allocator than the default malloc implementation that was compiled or linked with the software. For my exploration, I was testing the use of mimalloc (https://github.com/microsoft/mimalloc), a custom memory allocator from Microsoft Research, which has many tunable and customizable parameters. Even better, you can override the default allocator used by any software that dynamically links malloc using LD_PRELOAD to swap in mimalloc instead.

The general idea here is that (assuming our system has lots of memory resources) we can tune the memory allocator to take advantage of memory resources to trade off faster program execution for a small amount of memory overhead. This can hopefully result in non-trivial speedups in runtime over many program executions. For example, if I have to place and route 1000 benchmarks and I can get a 0.9x-0.8x reduction in runtime with a better memory allocator, my benchmark evaluation can take 9-8 hours instead of 10 hours. However, this model of what I would expect speedups to be is not that realistic. Program execution and memory allocation in a tool may be complex. Running the program on small designs might yield a 0.99x reduction (virtually nothing) while running the program on larger designs might yield a 0.8x reduction. Some programs or phases of program execution (synth vs. pack vs. place vs. route) might also see different behaviors and runtime reductions as well.

I decided to also try to evaluate this approach on open-source EDA software like yosys (still testing) and VTR, and I was able to capture some simple initial results for VTR just by swapping in mimalloc with minimal tweaking of the allocator configuration.

Evaluation

Note that these initial results are not valid since I made a mistake, but a mistake in favor of good results once corrected. See my later post in this issue thread with corrected and improved results. The experimental setup described here is still consistent and worth reading.

I benchmarked the VTR execution using two designs:

  • mcnc_simple: $VPR_BIN --pack --place --route --analysis --write_rr_graph rr_graph.xml -j1 ./EArch.xml ./sqrt8.blif
  • mcnc_big: $VPR_BIN --pack --place --route --analysis --route_chan_width 128 --write_rr_graph rr_graph.xml -j1 ./EArch.xml ./clma.blif
  • mcnc_big_search: $VPR_BIN --pack --place --route --analysis --write_rr_graph rr_graph.xml -j1 ./EArch.xml ./clma.blif

Each case was executed with 48 samples each, all running in parallel on 48 cores (on a 64 core machine), with each VTR run using no parallelism inside VTR. During the execution there, the total used memory was always way lower than the total system memory, so there should have been no weird memory throttling or swapping behavior.

The figures below show the results. For each execution, there is a base version where the design files are read and written to disk, a version where design files are read and written to memory filesystem (using the shm mount), and the case where files are accessed from a memory file system and also mimalloc is dropped in place with minimal tweaking of the configuration. The runtimes of each of these cases is plotted as a "histogram" (really a KDE plot).

Image Image Image

I also included some tables showing the normalized runtime for each method for the mean, median, 5%-th percentile and 95%-th percentile of runtimes. These are all normalized to the "base" approach which means files on disk and no mimalloc used.

Results for vtr__mcnc_simple

Method Mean Median 5% Quantile 95% Quantile
disk 1 1 1 1
shm 0.85 0.91 0.64 0.89
shm+mimalloc 0.92 0.93 0.78 0.95

Results for vtr__mcnc_big

Method Mean Median 5% Quantile 95% Quantile
disk 1 1 1 1
shm 0.97 0.92 1.02 0.96
shm+mimalloc 0.98 0.94 1.03 0.97

Results for vtr__mcnc_big_search

Method Mean Median 5% Quantile 95% Quantile
disk 1 1 1 1
shm 0.97 0.92 0.92 1.06
shm+mimalloc 0.96 0.89 0.94 1.04

I also conducted some (non-rigorous) non-parametric statistical tests (since I don't know if I can assume normality) to see if there is a statistically significant difference in runtime. Although I would not draw any conclusive results from these results without many more samples, it's just a nice thing to set up in the code that is useful once I can test many more samples across more designs.

# Statistical Analysis Report: vtr__mcnc_simple
## Mann-Whitney U Test Results
- SHM < Disk: U-statistic = 89.00, p-value = 0.0000
- SHM + Mimalloc < Disk: U-statistic = 197.00, p-value = 0.0000

# Statistical Analysis Report: vtr__mcnc_big
## Mann-Whitney U Test Results
- SHM < Disk: U-statistic = 467.00, p-value = 0.2751
- SHM + Mimalloc < Disk: U-statistic = 485.00, p-value = 0.3610

# Statistical Analysis Report: vtr__mcnc_big_search
## Mann-Whitney U Test Results
- SHM < Disk: U-statistic = 447.00, p-value = 0.1932
- SHM + Mimalloc < Disk: U-statistic = 405.00, p-value = 0.0764

Conclusion

Looking at the simple results so far, it seems like there is a compelling case here to be made to explore further the use of an alternative memory allocator, especially mimalloc. The initial results on three execution VTR cases with different designs and minimal tweaking to mimalloc suggest that 1) with minimal changes we can get ~0.9x reduction in runtime 2) more benchmarking across many more designs and execution modes of VTR is needed and 3) more exploration of how to configure mimalloc for VTR is also needed. Since VTR is open-source, unlike the commercial EDA tools that motivated this exploration, I can actually also gain some insight from the VTR code itself to help guide the configuration of mimalloc. I also want to note that I plan to run further experiments to disentangle the effects of the other techniques I explore, using an in-memory file system, from the technique I propose here, using a different memory allocator. In the current results, there are no results for just adding mimalloc when files are still being read from and written to disk. But hopefully, as I explore other black-box EDA tool speedup techniques, I can update my VTR results alongside my other results from other tools.

Hopefully this provides some useful insights for the VTR authors, and I am definitely open to discussion on this idea!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions