/ 中存储网

How to Benchmark a Hadoop Cluster

2015-03-01 23:36:21 来源:中存储网

How to Benchmark a Hadoop Cluster

    % hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -write -nrFiles 10
    
    -fileSize 1000

    At the end of the run, the results are written to the console and also recorded in a local file (which is appended to, so you can rerun the benchmark and not lose old results):

    % cat TestDFSIO_results.log
    
    ----- TestDFSIO ----- : write
    
               Date & time: Sun Apr 12 07:14:09 EDT 2009
    
           Number of files: 10
    
    Total MBytes processed: 10000
    
         Throughput mb/sec: 7.796340865378244
    
    Average IO rate mb/sec: 7.8862199783325195
    
     IO rate std deviation: 0.9101254683525547
    
        Test exec time sec: 163.387

    The files are written under the?/benchmarks/TestDFSIO?directory by default (this can be changed by setting thetest.build.data?system property), in a directory called?io_data.

    To run a read benchmark, use the?-read?argument. Note that these files must already exist (having been written byTestDFSIO -write):

    % hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -read -nrFiles 10 
    
    -fileSize 1000

    Here are the results for a real run:

    ----- TestDFSIO ----- : read
    
               Date & time: Sun Apr 12 07:24:28 EDT 2009
    
           Number of files: 10
    
    Total MBytes processed: 10000
    
         Throughput mb/sec: 80.25553361904304
    
    Average IO rate mb/sec: 98.6801528930664
    
     IO rate std deviation: 36.63507598174921
    
        Test exec time sec: 47.624

    When you’ve finished benchmarking, you can delete all the generated files from HDFS using the?-clean?argument:

    % hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar TestDFSIO -clean

    Benchmarking MapReduce with Sort

    Hadoop comes with a MapReduce program that does a partial sort of its input. It is very useful for benchmarking the whole MapReduce system, as the full input dataset is transferred through the shuffle. The three steps are: generate some random data, perform the sort, then validate the results.

    First we generate some random data using?RandomWriter. It runs a MapReduce job with 10 maps per node, and each map generates (approximately) 10 GB of random binary data, with key and values of various sizes. You can change these values if you like by setting the properties?test.randomwriter.maps_per_host?and?test.randomwrite.bytes_per_map. There are also settings for the size ranges of the keys and values; see?RandomWriter?for details.

    Here’s how to invoke?RandomWriter?(found in the example JAR file, not the test one) to write its output to a directory calledrandom-data:

    % hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar randomwriter random-data

    Next we can run the?Sort?program:

    % hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar sort random-data sorted-data

    The overall execution time of the sort is the metric we are interested in, but it’s instructive to watch the job’s progress via the web UI (http://jobtracker-host:50030/), where you can get a feel for how long each phase of the job takes.

    As a final sanity check, we validate the data in?sorted-data?is, in fact, correctly sorted:

    % hadoop jar $HADOOP_INSTALL/hadoop-*-test.jar testmapredsort -sortInput random-data 
    
      -sortOutput sorted-data

    This command runs the?SortValidator?program, which performs a series of checks on the unsorted and sorted data to check whether the sort is accurate. It reports the outcome to the console at the end of its run:

    SUCCESS! Validated the MapReduce framework's 'sort' successfully.

    Other benchmarks

    There are many more Hadoop benchmarks, but the following are widely used:

    • MRBench?(invoked with?mrbench) runs a small job a number of times. It acts as a good counterpoint to sort, as it checks whether small job runs are responsive.

    • NNBench?(invoked with?nnbench) is useful for load testing namenode hardware.

    • Gridmix?is a suite of benchmarks designed to model a realistic cluster workload, by mimicking a variety of data-access patterns seen in practice. See?src/benchmarks/gridmix2?in the distribution for further details.[63]