Skip to content
This repository was archived by the owner on Dec 15, 2025. It is now read-only.

Commit eb667b6

Browse files
committed
Merge branch 'master' of github.com:intel-hadoop/HiBench
Conflicts: bin/build-all.sh bin/functions/load-config.py bin/functions/workload-functions.sh src/autogen/src/main/java/org/apache/hadoop/fs/dfsioe/IOMapperBase.java
2 parents e31cf86 + 11229eb commit eb667b6

File tree

9 files changed

+80
-54
lines changed

9 files changed

+80
-54
lines changed

README.md

Lines changed: 24 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
This benchmark suite contains 10 typical micro workloads. This benchmark suite also has options for users to enable input/output compression for most workloads with default compression codec (zlib). Some initial work based on this benchmark suite please refer to the included ICDE workshop paper (i.e., WISS10_conf_full_011.pdf).
2020

21-
Note:
21+
Note:
2222
1. Since HiBench-2.2, the input data of benchmarks are all automatically generated by their corresponding prepare scripts.
2323
2. Since HiBench-3.0, it introduces Yarn support
2424
3. Since HiBench-4.0, it consists of more workload implementations on both Hadoop MR and Spark. For Spark, three different APIs including Scala, Java, Python are supportive.
@@ -32,7 +32,7 @@ Note:
3232
2. WordCount (wordcount)
3333

3434
This workload counts the occurrence of each word in the input data, which are generated using RandomTextWriter. It is representative of another typical class of real world MapReduce jobs - extracting a small amount of interesting data from large data set.
35-
35+
3636
3. TeraSort (terasort)
3737

3838
TeraSort is a standard benchmark created by Jim Gray. Its input data is generated by Hadoop TeraGen example program.
@@ -52,7 +52,7 @@ Note:
5252
6. PageRank (pagerank)
5353

5454
This workload benchmarks PageRank algorithm implemented in Spark-MLLib/Hadoop (a search engine ranking benchmark included in pegasus 2.0) examples. The data source is generated from Web data whose hyperlinks follow the Zipfian distribution.
55-
55+
5656
7. Nutch indexing (nutchindexing)
5757

5858
Large-scale search indexing is one of the most significant uses of MapReduce. This workload tests the indexing sub-system in Nutch, a popular open source (Apache project) search engine. The workload uses the automatically generated Web data whose hyperlinks and words both follow the Zipfian distribution with corresponding parameters. The dict used to generate the Web page texts is the default linux dict file /usr/share/dict/linux.words.
@@ -74,13 +74,15 @@ Note:
7474
10. enhanced DFSIO (dfsioe)
7575

7676
Enhanced DFSIO tests the HDFS throughput of the Hadoop cluster by generating a large number of tasks performing writes and reads simultaneously. It measures the average I/O rate of each map task, the average throughput of each map task, and the aggregated throughput of HDFS cluster. Note: this benchmark doesn't have Spark corresponding implementation.
77-
77+
7878
**Supported hadoop/spark release:**
7979

8080
- Apache release of Hadoop 1.x and Hadoop 2.x
8181
- CDH4/CDH5 release of MR1 and MR2.
82+
- HDP2.3
8283
- Spark1.2
8384
- Spark1.3
85+
Note : No version of CDH supports SparkSQL. Please download SparkSQL from Apache-spark official release page if you are using it.
8486

8587
---
8688
### Getting Started ###
@@ -92,39 +94,41 @@ Note:
9294
Download/checkout HiBench benchmark suite
9395

9496
Run `<HiBench_Root>/bin/build-all.sh` to build HiBench.
95-
97+
9698
Note: Begin from HiBench V4.0, HiBench will need python 2.x(>=2.6) .
9799

98100
2. HiBench Configurations.
99101

100102
For minimum requirements: create & edit `conf/99-user_defined_properties.conf`
101-
102-
cd conf
103+
104+
cd conf
103105
cp 99-user_defined_properties.conf.template 99-user_defined_properties.conf
104-
106+
105107
And Make sure below properties has been set:
106108

107109
hibench.hadoop.home The Hadoop installation location
108110
hibench.spark.home The Spark installation location
109111
hibench.hdfs.master HDFS master
110112
hibench.spark.master SPARK master
111-
113+
112114
Note: For YARN mode, set `hibench.spark.master` to `yarn-client`. (`yarn-cluster` is not supported yet)
113115

116+
To run HiBench on HDP, please specify `hibench.hadoop.mapreduce.home` to the mapreduce home, normally it should be "/usr/hdp/current/hadoop-mapreduce-client". Also please specify `hibench.hadoop.release` to "hdp".
117+
114118
3. Run
115119

116120
Execute the `<HiBench_Root>/bin/run-all.sh` to run all workloads with all language APIs with `large` data scale.
117121

118122
4. View the report:
119-
123+
120124
Goto `<HiBench_Root>/report` to check for the final report:
121125
- `report/hibench.report`: Overall report about all workloads.
122126
- `report/<workload>/<language APIs>/bench.log`: Raw logs on client side.
123127
- `report/<workload>/<language APIs>/monitor.html`: System utilization monitor results.
124128
- `report/<workload>/<language APIs>/conf/<workload>.conf`: Generated environment variable configurations for this workload.
125129
- `report/<workload>/<language APIs>/conf/sparkbench/<workload>/sparkbench.conf`: Generated configuration for this workloads, which is used for mapping to environment variable.
126130
- `report/<workload>/<language APIs>/conf/sparkbench/<workload>/spark.conf`: Generated configuration for spark.
127-
131+
128132
[Optional] Execute `<HiBench root>/bin/report_gen_plot.py report/hibench.report` to generate report figures.
129133

130134
Note: `report_gen_plot.py` requires `python2.x` and `python-matplotlib`.
@@ -134,12 +138,12 @@ Note:
134138

135139
1. Parallelism, memory, executor number tuning:
136140

137-
hibench.default.map.parallelism Mapper numbers in MR,
141+
hibench.default.map.parallelism Mapper numbers in MR,
138142
partition numbers in Spark
139-
hibench.default.shuffle.parallelism Reducer numbers in MR, shuffle
143+
hibench.default.shuffle.parallelism Reducer numbers in MR, shuffle
140144
partition numbers in Spark
141145
hibench.yarn.executors.num Number executors in YARN mode
142-
hibench.yarn.executors.cores Number executor cores in YARN mode
146+
hibench.yarn.executors.cores Number executor cores in YARN mode
143147
spark.executors.memory Executor memory, standalone or YARN mode
144148
spark.driver.memory Driver memory, standalone or YARN mode
145149

@@ -149,11 +153,11 @@ Note:
149153

150154
hibench.compress.profile Compression option `enable` or `disable`
151155
hibench.compress.codec.profile Compression codec, `snappy`, `lzo` or `default`
152-
156+
153157
3. Data scale profile selection:
154158

155159
hibench.scale.profile Data scale profile, `tiny`, `small`, `large`, `huge`, `gigantic`, `bigdata`
156-
160+
157161
You can add more data scale profiles in `conf/10-data-scale-profile.conf`. And please don't change `conf/00-default-properties.conf` if you have no confidence.
158162

159163
4. Configure for each workload or each language API:
@@ -165,7 +169,7 @@ Note:
165169
workloads/<workload>/<language APIs>/.../*.conf Configure for various languages
166170

167171
2. For configurations in same folder, the loading sequence will be
168-
sorted according to configure file name.
172+
sorted according to configure file name.
169173

170174
3. Values in latter configure will override former.
171175

@@ -188,7 +192,7 @@ Note:
188192
hibench.spark.version spark1.3
189193

190194
6. Configures for running workloads and language APIs:
191-
195+
192196
The `conf/benchmarks.lst` file under the package folder defines the
193197
workloads to run when you execute the `bin/run-all.sh` script under
194198
the package folder. Each line in the list file specifies one
@@ -226,7 +230,7 @@ Note:
226230
You'll need to install numpy (version > 1.4) in master & all slave nodes.
227231

228232
For CentOS(6.2+):
229-
233+
230234
`yum install numpy`
231235

232236
For Ubuntu/Debian:
@@ -238,7 +242,7 @@ Note:
238242
You'll need to install python-matplotlib(version > 0.9).
239243

240244
For CentOS(6.2+):
241-
245+
242246
`yum install python-matplotlib`
243247

244248
For Ubuntu/Debian:

bin/build-all.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ cd $DIR/src/sparkbench && \
2727
for mr in MR1 MR2; do
2828
for spark_version in 1.2 1.3 1.4 1.5; do
2929
cp target/*-jar-with-dependencies.jar jars
30-
mvn clean package -D spark$spark_version -D$mr
30+
mvn clean package -D spark$spark_version -D $mr
3131
if [ $? -ne 0 ]; then
3232
echo "Build failed for spark$spark_version and $mr, please check!"
3333
exit 1

bin/functions/execute_with_log.py

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -77,22 +77,23 @@ def execute(workload_result_file, command_lines):
7777
count = 100
7878
last_time=0
7979
log_file = open(workload_result_file, 'w')
80-
while True:
80+
# see http://stackoverflow.com/a/4417735/1442961
81+
lines_iterator = iter(proc.stdout.readline, b"")
82+
for line in lines_iterator:
8183
count += 1
8284
if count > 100 or time()-last_time>1: # refresh terminal size for 100 lines or each seconds
8385
count, last_time = 0, time()
8486
width, height = get_terminal_size()
8587
width -= 1
8688

8789
try:
88-
line = proc.stdout.readline().rstrip()
90+
line = line.rstrip()
8991
log_file.write(line+"\n")
9092
log_file.flush()
9193
except KeyboardInterrupt:
9294
proc.terminate()
9395
break
9496
line = line.decode('utf-8')
95-
if not line: break
9697
line = replace_tab_to_space(line)
9798
#print "{Red}log=>{Color_Off}".format(**Color), line
9899
lline = line.lower()

0 commit comments

Comments
 (0)