Table of Contents generated with DocToc
Following benchmarks are taken.
-
Scalability of PartialCsvParser.
-
Comparison with other CSV parser.
- PartialCsvParser v0.1.1
- csv-parser-cplusplus v1.0.0
Parses this CSV file and just counts the total number of columns:
-
Size: 412 MB
-
Number of rows: 8,192,000
-
Number of columns: 5
-
First 5 rows:
1,Douglas,Sanchez,[email protected],2.221.14.151 2,Robert,Hudson,[email protected],211.131.138.32 3,Victor,Taylor,[email protected],41.251.57.238 4,Kelly,Harvey,[email protected],69.31.229.204 5,John,Adams,[email protected],206.51.26.246
Used 2 environments.
-
MBA
- Macbook Air, a comodity laptop as you know.
-
clokoap100
- Middle class server with both SSD and HDD.
More detailed specification follows.
CPU clock | 1.3 GHz |
# CPU sockets | 1 |
# Cores per socket | 2 |
# Logical cores per physical core | 2 |
Memory size | 8 GB |
Memory sequential read | 2.142 GB/sec |
Memory random read | 1.467 GB/sec |
Memory file system | hfs |
SSD sequential read | 492.289 MB/s |
SSD random read | 11.374 MB/sec |
SSD file system | hfs |
CPU clock | 2.393 GHz |
# CPU sockets | 2 |
# Cores per socket | 4 |
# Logical cores per physical core | 2 |
Memory size | 24 GB |
Memory sequential read | 2.194 GB/sec |
Memory random read | 1.759 GB/sec |
Memory file system | tmpfs |
SSD sequential read | 470.114 MB/s |
SSD random read | 30.350 MB/sec |
SSD file system | ext3 |
HDD sequential read | 261.251 MB/sec |
HDD random read | 4.191 MB/sec |
HDD file system | ext3 |
Used fio to check random/sequential read speed.
$ vim random-read-mem.fio
[random-read]
rw=randread
size=512m
directory=/dev/shm
$ vim sequential-read-mem.fio
[sequential-read]
rw=read
size=512m
directory=/dev/shm
$ fio random-read-mem.fio
Change the directory
to access to different devices.
Unlike Linux, Mac OSX does not have tmpfs mounted to /dev/shm
.
You can create and destroy memory file system by the following way.
$ hdid -nomount ram://2097152 # 512 bytes * 2097152 sectors = 1024 Mbytes
/dev/disk2
$ newfs_hfs /dev/disk2
Initialized /dev/rdisk2 as a 1024 MB case-insensitive HFS Plus volume
$ mkdir /tmp/mnt
$ mount -t hfs /dev/disk2 /tmp/mnt
$ hdiutil detach /dev/disk2
Raw data is available at Google SpreadSheet.
You can also conduct experiments with your own environments following the instructions below.
$ ../script/generate-benchmark-data.sh 12 # you can use bigger or smaller data
$ ls csv/
20480000col.csv 5000col.csv
Internet connection is necessary because Makefile internally invoke wget
to get other libraries to compare the performance with.
$ cmake . && make
With 4 threads, for example.
$ time ./PartialCsvParser_bench -p 4 -c 20480000 -f csv/20480000col.csv
/Users/nakatani.sho/git/PartialCsvParser/benchmark/PartialCsvParser_bench.cpp:50 - 0.0186529 seconds - mmap(2) file
/Users/nakatani.sho/git/PartialCsvParser/benchmark/PartialCsvParser_bench.cpp:73 - 3.97924 seconds - join parsing threads
OK. Parsed 20480000 columns.
real 0m4.010s
user 0m13.796s
sys 0m0.172s
Check the wall-clock time. 4.010 seconds in this execution.
$ time ./csv_parser_cplusplus_bench -c 20480000 -f csv/20480000col.csv
/Users/nakatani.sho/git/PartialCsvParser/benchmark/csv_parser_cplusplus_bench.cpp:42 - 34.0444 seconds - parse
OK. Parsed 20480000 columns.
real 0m34.049s
user 0m33.022s
sys 0m0.498s