Skip to content

Commit 161cfa4

Browse files
committed
Updated README
1 parent 9346949 commit 161cfa4

File tree

1 file changed

+45
-31
lines changed

1 file changed

+45
-31
lines changed

README.md

Lines changed: 45 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -5,10 +5,10 @@ jemalloc.NET is a .NET API over the [jemalloc](http://jemalloc.net/) native memo
55

66
The jemalloc.NET project provides:
77
* A low-level .NET API over the native jemalloc API functions like je_malloc, je_calloc, je_free, je_mallctl...
8-
* A safety-focused high-level .NET API providing data structures like arrays backed by native memory allocated using jemalloc.
8+
* A safety-focused high-level .NET API providing data structures like arrays backed by native memory allocated using jemalloc together with management features like reference counting.
99
* A benchmark CLI program: `jembench` which uses the excellent [BenchmarkDotNet](http://benchmarkdotnet.org/index.htm) library for easy and accurate benchmarking operations on native data structures vs managed objects using different parameters.
1010

11-
Data structures provided by the high-level API are more efficient than managed .NET arrays and objects at the scale of millions of elements, and memory allocation is much more resistant to fragmentation. Large .NET arrays must be allocated on the Large Object Heap which leads to fragmentation and lower performance. For example in the following `jembench` benchmark on my laptop, filling a managed array of type UInt64[] of size 100 million is 2.6x slower than using an equivalent native array provided by jemalloc.NET:
11+
Data structures provided by the high-level API are more efficient than managed .NET arrays and objects at the scale of millions of elements, and memory allocation is much more resistant to fragmentation, while still providing necessary safety features like array bounds checking. Large .NET arrays must be allocated on the Large Object Heap and are not relocatable which leads to fragmentation and lower performance. For example in the following `jembench` benchmark on my laptop, filling a `UInt64[]` managed array of size 100 million is 2.6x slower than using an equivalent native array provided by jemalloc.NET:
1212

1313
``` ini
1414

@@ -27,11 +27,10 @@ Runtime=Core AllowVeryLargeObjects=True Toolchain=InProcessToolchain
2727
| 'Fill a managed array with a single value.' | 100000000 | 327.4 ms | 3.102 ms | 2.902 ms | 937.5000 | 937.5000 | 937.5000 | 800000192 B |
2828
| 'Fill a SafeArray on the system unmanaged heap with a single value.' | 100000000 | 126.1 ms | 1.220 ms | 1.081 ms | - | - | - | 264 B |
2929

30-
You can run this benchmark with the command `jembench array --fill -l -u 100000000`. In this case we see that using the managed array allocated 800 MB on the managed heap while using the native array did not cause any allocations on the managed heap for the array data. Avoiding the managed heap for very large but simple data structures like arrays is a key optimizarion for apps that do large-scale in-memory computations.
30+
You can run this benchmark with the command `jembench array --fill -l -u 100000000`. In this case we see that using the managed array allocated 800 MB on the managed heap while using the native array did not cause any allocations on the managed heap for the array data. Avoiding the managed heap for very large but simple data structures like arrays is a key optimizarion for apps that do large-scale in-memory computation.
3131

32-
Perhaps the killer feature of the recently introduced `Span<T>` class in .NET is its ability to efficently re-interpret numeric data structures (`Int32, Int64` and their siblings) into other strucutres like the `Vector<T>` SIMD-enabled data types introduced in 2016. `Vector<T>` types are special in that the .NET RyuJIT JIT compiler can compile operations on Vectors to use SIMD instructions like SSE, SSE2, and AVX for parallelizing operations on data on a single CPU core.
3332

34-
Using the SIMD-enabled `SafeBuffer<T>.VectoryMultiply(n)` method provided by the jemalloc.NET API yields a 4.5x speedup for a simple in-place multiplication of a `Uint16[]` array of 1 million elements compared to the unoptimized linear approach, allowing the operation to complete in 3.3 ms:
33+
Managed .NET arays are also limited to `Int32` indexing and a maximum size of about 2.15 billion elements. jemalloc.NET provides huge arrays through the `HugeArray<T>` class which allows you to access all available memory as a flat contiguous buffer using array semantics. In the next benchmark `jembench hugearray --fill -i 4200000000`:
3534

3635
``` ini
3736

@@ -43,16 +42,20 @@ Frequency=2531251 Hz, Resolution=395.0616 ns, Timer=TSC
4342

4443
Job=JemBenchmark Jit=RyuJit Platform=X64
4544
Runtime=Core AllowVeryLargeObjects=True Toolchain=InProcessToolchain
46-
RunStrategy=Throughput
45+
RunStrategy=ColdStart TargetCount=7 WarmupCount=-1
4746

4847
```
49-
| Method | Parameter | Mean | Error | StdDev | Gen 0 | Allocated |
50-
|-------------------------------------------------------------------- |---------- |----------:|----------:|----------:|----------:|-----------:|
51-
| 'Multiply all values of a managed array with a single value.' | 1024000 | 15.861 ms | 0.3169 ms | 0.4231 ms | 7781.2500 | 24576000 B |
52-
| 'Vector multiply all values of a native array with a single value.' | 1024000 | 3.299 ms | 0.0344 ms | 0.0287 ms | - | 56 B |
48+
| Method | Parameter | Mean | Error | StdDev | Allocated |
49+
|------------------------------------------------------------------------------- |----------- |--------:|---------:|---------:|-------------:|
50+
| 'Fill a managed array with the maximum size [2146435071] with a single value.' | 4200000000 | 3.177 s | 0.1390 s | 0.0617 s | 8585740456 B |
51+
| 'Fill a HugeArray on the system unmanaged heap with a single value.' | 4200000000 | 4.029 s | 3.2233 s | 1.4312 s | 0 B |
52+
53+
54+
an `Int32[]` of maximum size can be allocated and filled in 3.2s. This array consumes 8.6GB on the managed heap. But a jemalloc.NET `HugeArray<Int32>` of nearly double the size at 4.2 billion elements can be allocated in only 4 s and again consumes no memory on the managed heap. The only limit on the size of a `HugeArray<T>` is the available system memory.
5355

56+
Perhaps the killer feature of the [recently introduced](https://blogs.msdn.microsoft.com/dotnet/2017/11/15/welcome-to-c-7-2-and-span/) `Span<T>` class in .NET is its ability to efficently zero-copy re-interpret numeric data structures (`Int32, Int64` and their siblings) into other structures like the `Vector<T>` SIMD-enabled data types introduced in 2016. `Vector<T>` types are special in that the .NET RyuJIT JIT compiler can compile operations on Vectors to use SIMD instructions like SSE, SSE2, and AVX for parallelizing operations on data on a single CPU core.
5457

55-
Managed .NET arays are also limited to Int32 indexing and a maximum size of about 2.15 billion elements. jemalloc.NET provides huge arrays through the `HugeArray<T>` class which allows you to access all available memory as a flat contiguous buffer using array semantics. In the next benchmark `jembench hugearray --fill -i 4200000000`:
58+
Using the SIMD-enabled `SafeBuffer<T>.VectoryMultiply(n)` method provided by the jemalloc.NET API yields a 4.5x speedup for a simple in-place multiplication of a `Uint16[]` array of 1 million elements, compared to the unoptimized linear approach, allowing the operation to complete in 3.3 ms:
5659

5760
``` ini
5861

@@ -64,18 +67,16 @@ Frequency=2531251 Hz, Resolution=395.0616 ns, Timer=TSC
6467

6568
Job=JemBenchmark Jit=RyuJit Platform=X64
6669
Runtime=Core AllowVeryLargeObjects=True Toolchain=InProcessToolchain
67-
RunStrategy=ColdStart TargetCount=7 WarmupCount=-1
70+
RunStrategy=Throughput
6871

6972
```
70-
| Method | Parameter | Mean | Error | StdDev | Allocated |
71-
|------------------------------------------------------------------------------- |----------- |--------:|---------:|---------:|-------------:|
72-
| 'Fill a managed array with the maximum size [2146435071] with a single value.' | 4200000000 | 3.177 s | 0.1390 s | 0.0617 s | 8585740456 B |
73-
| 'Fill a HugeArray on the system unmanaged heap with a single value.' | 4200000000 | 4.029 s | 3.2233 s | 1.4312 s | 0 B |
74-
73+
| Method | Parameter | Mean | Error | StdDev | Gen 0 | Allocated |
74+
|-------------------------------------------------------------------- |---------- |----------:|----------:|----------:|----------:|-----------:|
75+
| 'Multiply all values of a managed array with a single value.' | 1024000 | 15.861 ms | 0.3169 ms | 0.4231 ms | 7781.2500 | 24576000 B |
76+
| 'Vector multiply all values of a native array with a single value.' | 1024000 | 3.299 ms | 0.0344 ms | 0.0287 ms | - | 56 B |
7577

76-
an Int32[] array of maximum size can be allocated and filled in 3.2s. This array consumes 8.6GB on the managed heap. But a jemalloc.NET `HugeArray<Int32>` of nearly double the size at 4.2 billion elements can be allocated in only 4 s and again consumes no memory on the managed heap. The only limit on the size of a `HugeArray<T>` is the available system memory.
7778

78-
For huge arrays of `Int16[]` we see similar speedups:
79+
For huge arrays of `UInt16[]` we see similar speedups:
7980
``` ini
8081

8182
BenchmarkDotNet=v0.10.11, OS=Windows 10 Redstone 2 [1703, Creators Update] (10.0.15063.726)
@@ -95,33 +96,35 @@ RunStrategy=ColdStart TargetCount=1
9596
| 'Vector multiply all values of a native array with a single value.' | 4096000000 | 12.06 s | NA | - | - | 0 B |
9697

9798

98-
For a huge array with 4.1 billion `UInt16` values it takes 12 seconds to do a SIMD-enabled multiplication operation on all the elements of the array. This is still 3x the performance of doing the same non-vectorized operation on a managed array of hald the size
99-
In a .NET application jemalloc.NET native arrays and data structures can be straightforwardly accessed by native libraries without the need to make additional copies. Buffer operations can be SIMD-vectorized which can make a significant performance difference for huge buffers with 10s of billions of values.
99+
For a huge array with 4.1 billion `UInt16` values it takes 12 seconds to do a SIMD-enabled multiplication operation on all the elements of the array. This is still 3x the performance of doing the same non-vectorized operation on a managed array of half the size.
100100

101-
The goal of the jemalloc.NET project is to make accessible to .NET the kind of big-data in-memory numeric, scientific and other computing that typically would require coding in a low=level language like C/C++ or assembler.
101+
Inside a .NET application, jemalloc.NET native arrays and data structures can be straightforwardly accessed by native libraries without the need to make additional copies or allocations. The goal of the jemalloc.NET project is to make accessible to .NET the kind of big-data in-memory numeric, scientific and other computing that typically would require coding in a low=level language like C/C++ or assembler.
102102

103103

104104

105105
## Installation
106+
### Requirements
107+
Currently only runs on 64bit Windows; support for Linux 64bit and other platforms supported by .NET Core will be added
108+
soon.
106109

107-
108-
109-
## Usage
110-
110+
#### Windows
111+
* The latest [.NET Core 2.0 x64 runtime](https://www.microsoft.com/net/download/thank-you/dotnet-runtime-2.0.3-windows-x64-installer)
112+
* The latest version of the [Microsoft Visual C++ Redistributable for Visual Studio 2017](https://go.microsoft.com/fwlink/?LinkId=746572)
111113

112114

113115
## Building from source
114-
Currently build instuctions are only provided for Visual Studio 2017 on Windows x64.
116+
Currently build instuctions are only provided for Visual Studio 2017 on Windows but instructions for building on Linux will also be provided. jemalloc.NET is a 64-bit library only.
115117
### Requirements
116118
[Visual Studio 2017 15.5](https://www.visualstudio.com/en-us/news/releasenotes/vs2017-relnotes#15.5.1) with at least the following components:
117119
* C# 7.2 compiler
118-
* .NET Core 2.0 SDK
120+
* .NET Core 2.0 SDK x64
119121
* MSVC 2017 compiler toolset v141 or higher
120-
* Windows 10 SDK for Desktop C++ version 10.0.10.15603 or higher
122+
* Windows 10 SDK for Desktop C++ version 10.0.10.15603 or higher. Note that if you only have higher versions installed you will need to retarget the jemalloc MSVC project to your SDK version from Visual Studio.
121123

122124
Per the instructions for building the native jemalloc library for Windows, you will also need Cygwin (32- or 64-bit )with the following packages:
123125
* autoconf
124126
* autogen
127+
* gcc
125128
* gawk
126129
* grep
127130
* sed
@@ -130,8 +133,19 @@ Cygwin tools aren't actually used for compiling jemalloc but for generating the
130133

131134
### Steps
132135
0. You must add the [.NET Core](https://dotnet.myget.org/gallery/dotnet-core) NuGet [feed](https://dotnet.myget.org/F/dotnet-core/api/v3/index.json) on MyGet and also the [CoreFxLab](https://dotnet.myget.org/gallery/dotnet-corefxlab) [feed](https://dotnet.myget.org/F/dotnet-core/api/v3/index.json) to your NuGet package sources. You can do this in Visual Studio 2017 from Tools->Options->NuGet Package Manager menu item.
133-
1. Clone the project: `git clone https://github.com/alllisterb/jemalloc.NET`
136+
1. Clone the project: `git clone https://github.com/alllisterb/jemalloc.NET` and init the submodules: `git submodule update --init --recursive`
134137
2. Open a x64 Native Tools Command Prompt for VS 2017 and temporarily add `Cygwin\bin` to the PATH e.g `set PATH=%PATH%;C:\cygwin\bin`. Switch to the `jemalloc` subdirectory in your jemalloc.NET solution dir and run `sh -c "CC=cl ./autogen.sh"`. This will generate some files in the `jemalloc` subdirectory and only needs to be done once.
135-
4. From a Visual Studio 2017 Developer Command prompt run `build.cmd`.
138+
4. From a Visual Studio 2017 Developer Command prompt run `build.cmd`. Alternatively you can load the solution in Visual Studio and using the "Benchmark" solution configuration build the entire solution.
136139
5. The solution should build without errors.
137140
6. Run `jembench` from the solution folder to see the project version and help.
141+
142+
## Usage
143+
144+
### jembench CLI
145+
Examples:
146+
* `jembench hugearray -l -u --math --cold-start -t 3 4096000000` Benchmark math operations on `HugeArray<UInt64>` arrays of size 4096000000 without benchmark warmup and only using 3 iterations of the target methods. Benchmarks on huge arrays can be lengthy so you should carefully control
147+
148+
149+
150+
151+
##Using the command-line program jembench

0 commit comments

Comments
 (0)