一个mapreduce执行完成后,控制台会输出一些执行过程中产生的数据,通过分析这些数据可以帮助我们验证执行过程是否正常。
举一个实际的例子:有一个文本文件共有1亿行,每行是一个取值范围0~1亿的整数,文件大小约837MB。现在使用mapreduce抽取这个文件中最小的100个数。
采用的策略是,在map阶段,维护一个尺寸为100的最小堆,将每个split中最小的100个数放在堆内,输出格式是<数值,出现次数>,结束时按顺序输出堆里的100个数字。在reduce阶段,由于mapreduce框架已经自动将map的输出按key(即数值)排序,所以直接输出前100个数值即可(每个数值输出次数是map阶段的value)。
由于所使用环境HDFS的block size是128MB,因此可以预期到这个837MB的文件会被分割为7个split(每个split约120MB)分别进行map操作。
提交并运行这个mapreduce程序,在控制台里最后可以看到下面的输出,我添加了一些注释内容:
File System Counters FILE: Number of bytes read=50429 FILE: Number of bytes written=2373560 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=4574262068 HDFS: Number of bytes written=2584 HDFS: Number of read operations=97 HDFS: Number of large read operations=0 HDFS: Number of write operations=10 Map-Reduce Framework //总输入记录数,即文件里的1亿个数字 Map input records=100000000 //任务被分成7个split,每个map输出100条记录,所以map阶段总共输出700条记录 Map output records=700 //map阶段输出的字节数(平均每条记录12字节,<IntWritable,LongWritable>) Map output bytes=8400 Map output materialized bytes=9842 Input split bytes=784 Combine input records=0 Combine output records=0 //Reducer的输入一共有多少个unique的key Reduce input groups=346 Reduce shuffle bytes=9842 //Reducer的输入一共有多少条记录 Reduce input records=700 //Reducer一共输出了多少条记录 Reduce output records=700 Spilled Records=1400 Shuffled Maps =7 Failed Shuffles=0 Merged Map outputs=7 GC time elapsed (ms)=1564 Total committed heap usage (bytes)=3536846848 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=877801882 File Output Format Counters Bytes Written=2584
同样程序执行三次,分别耗时3分48秒、3分12秒、5分25秒。
若将文件行数增加到20亿(耗时53分钟),控制台输出的统计量也相应发生变化:
File System Counters FILE: Number of bytes read=48724250 FILE: Number of bytes written=63469529 FILE: Number of read operations=0 FILE: Number of large read operations=0 FILE: Number of write operations=0 HDFS: Number of bytes read=1499621751504 HDFS: Number of bytes written=71836 HDFS: Number of read operations=22798 HDFS: Number of large read operations=0 HDFS: Number of write operations=151 Map-Reduce Framework Map input records=2000000000 Map output records=14800 Map output bytes=177600 Map output materialized bytes=208088 Input split bytes=16280 Combine input records=0 Combine output records=0 Reduce input groups=6809 Reduce shuffle bytes=208088 Reduce input records=14800 Reduce output records=14800 Spilled Records=29600 Shuffled Maps =148 Failed Shuffles=0 Merged Map outputs=148 GC time elapsed (ms)=19059 Total committed heap usage (bytes)=73469526016 Shuffle Errors BAD_ID=0 CONNECTION=0 IO_ERROR=0 WRONG_LENGTH=0 WRONG_MAP=0 WRONG_REDUCE=0 File Input Format Counters Bytes Read=19778375016 File Output Format Counters Bytes Written=71836
欢迎转载
请保留原始链接:https://bjzhanghao.com/p/539
请保留原始链接:https://bjzhanghao.com/p/539