You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Dec 15, 2025. It is now read-only.
Copy file name to clipboardExpand all lines: README.md
+24-20Lines changed: 24 additions & 20 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@
18
18
19
19
This benchmark suite contains 10 typical micro workloads. This benchmark suite also has options for users to enable input/output compression for most workloads with default compression codec (zlib). Some initial work based on this benchmark suite please refer to the included ICDE workshop paper (i.e., WISS10_conf_full_011.pdf).
20
20
21
-
Note:
21
+
Note:
22
22
1. Since HiBench-2.2, the input data of benchmarks are all automatically generated by their corresponding prepare scripts.
23
23
2. Since HiBench-3.0, it introduces Yarn support
24
24
3. Since HiBench-4.0, it consists of more workload implementations on both Hadoop MR and Spark. For Spark, three different APIs including Scala, Java, Python are supportive.
@@ -32,7 +32,7 @@ Note:
32
32
2. WordCount (wordcount)
33
33
34
34
This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.
35
-
35
+
36
36
3. TeraSort (terasort)
37
37
38
38
TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
@@ -52,7 +52,7 @@ Note:
52
52
6. PageRank (pagerank)
53
53
54
54
This workload benchmarks PageRank algorithm implemented in Spark-MLLib/Hadoop (a search engine ranking benchmark included in pegasus 2.0) examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.
55
-
55
+
56
56
7. Nutch indexing (nutchindexing)
57
57
58
58
Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file /usr/share/dict/linux.words.
@@ -74,13 +74,15 @@ Note:
74
74
10. enhanced DFSIO (dfsioe)
75
75
76
76
Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.
77
-
77
+
78
78
**Supported hadoop/spark release:**
79
79
80
80
- Apache release of Hadoop 1.x and Hadoop 2.x
81
81
- CDH4/CDH5 release of MR1 and MR2.
82
+
- HDP2.3
82
83
- Spark1.2
83
84
- Spark1.3
85
+
Note : No version of CDH supports SparkSQL. Please download SparkSQL from Apache-spark official release page if you are using it.
84
86
85
87
---
86
88
### Getting Started ###
@@ -92,39 +94,41 @@ Note:
92
94
Download/checkout HiBench benchmark suite
93
95
94
96
Run `<HiBench_Root>/bin/build-all.sh` to build HiBench.
95
-
97
+
96
98
Note: Begin from HiBench V4.0, HiBench will need python 2.x(>=2.6) .
97
99
98
100
2. HiBench Configurations.
99
101
100
102
For minimum requirements: create & edit `conf/99-user_defined_properties.conf`:
hibench.hadoop.home The Hadoop installation location
108
110
hibench.spark.home The Spark installation location
109
111
hibench.hdfs.master HDFS master
110
112
hibench.spark.master SPARK master
111
-
113
+
112
114
Note: For YARN mode, set `hibench.spark.master` to `yarn-client`. (`yarn-cluster` is not supported yet)
113
115
116
+
To run HiBench on HDP, please specify `hibench.hadoop.mapreduce.home` to the mapreduce home, normally it should be "/usr/hdp/current/hadoop-mapreduce-client". Also please specify `hibench.hadoop.release` to "hdp".
117
+
114
118
3. Run
115
119
116
120
Execute the `<HiBench_Root>/bin/run-all.sh` to run all workloads with all language APIs with `large` data scale.
117
121
118
122
4. View the report:
119
-
123
+
120
124
Goto `<HiBench_Root>/report` to check for the final report:
121
125
-`report/hibench.report`: Overall report about all workloads.
122
126
-`report/<workload>/<language APIs>/bench.log`: Raw logs on client side.
123
127
-`report/<workload>/<language APIs>/monitor.html`: System utilization monitor results.
124
128
-`report/<workload>/<language APIs>/conf/<workload>.conf`: Generated environment variable configurations for this workload.
125
129
-`report/<workload>/<language APIs>/conf/sparkbench/<workload>/sparkbench.conf`: Generated configuration for this workloads, which is used for mapping to environment variable.
126
130
-`report/<workload>/<language APIs>/conf/sparkbench/<workload>/spark.conf`: Generated configuration for spark.
127
-
131
+
128
132
[Optional] Execute `<HiBench root>/bin/report_gen_plot.py report/hibench.report` to generate report figures.
129
133
130
134
Note: `report_gen_plot.py` requires `python2.x` and `python-matplotlib`.
@@ -134,12 +138,12 @@ Note:
134
138
135
139
1. Parallelism, memory, executor number tuning:
136
140
137
-
hibench.default.map.parallelism Mapper numbers in MR,
141
+
hibench.default.map.parallelism Mapper numbers in MR,
138
142
partition numbers in Spark
139
-
hibench.default.shuffle.parallelism Reducer numbers in MR, shuffle
143
+
hibench.default.shuffle.parallelism Reducer numbers in MR, shuffle
140
144
partition numbers in Spark
141
145
hibench.yarn.executors.num Number executors in YARN mode
142
-
hibench.yarn.executors.cores Number executor cores in YARN mode
146
+
hibench.yarn.executors.cores Number executor cores in YARN mode
143
147
spark.executors.memory Executor memory, standalone or YARN mode
144
148
spark.driver.memory Driver memory, standalone or YARN mode
145
149
@@ -149,11 +153,11 @@ Note:
149
153
150
154
hibench.compress.profile Compression option `enable` or `disable`
151
155
hibench.compress.codec.profile Compression codec, `snappy`, `lzo` or `default`
152
-
156
+
153
157
3. Data scale profile selection:
154
158
155
159
hibench.scale.profile Data scale profile, `tiny`, `small`, `large`, `huge`, `gigantic`, `bigdata`
156
-
160
+
157
161
You can add more data scale profiles in `conf/10-data-scale-profile.conf`. And please don't change `conf/00-default-properties.conf` if you have no confidence.
158
162
159
163
4. Configure for each workload or each language API:
@@ -165,7 +169,7 @@ Note:
165
169
workloads/<workload>/<language APIs>/.../*.conf Configure for various languages
166
170
167
171
2. For configurations in same folder, the loading sequence will be
168
-
sorted according to configure file name.
172
+
sorted according to configure file name.
169
173
170
174
3. Values in latter configure will override former.
171
175
@@ -188,7 +192,7 @@ Note:
188
192
hibench.spark.version spark1.3
189
193
190
194
6. Configures for running workloads and language APIs:
191
-
195
+
192
196
The `conf/benchmarks.lst` file under the package folder defines the
193
197
workloads to run when you execute the `bin/run-all.sh` script under
194
198
the package folder. Each line in the list file specifies one
@@ -226,7 +230,7 @@ Note:
226
230
You'll need to install numpy (version > 1.4) in master & all slave nodes.
227
231
228
232
For CentOS(6.2+):
229
-
233
+
230
234
`yum install numpy`
231
235
232
236
For Ubuntu/Debian:
@@ -238,7 +242,7 @@ Note:
238
242
You'll need to install python-matplotlib(version > 0.9).
0 commit comments