job和mapper的输出类型

MapReduce输出数据类型的缺省值：

mapred.output.key.class：org.apache.hadoop.io.LongWritable
mapred.output.value.class：org.apache.hadoop.io.Text

当不想使用以上缺省值时，MapReduce的Job提供下面几个方法，用于设置输出数据的类型：

job.setOutputKeyClass()
job.setOutputValueClass()
job.setMapOutputKeyClass()
job.setMapOutputValueClass()

其中Key和Value的区别很明显，分别作用于输出的key和value。那么output（setOutputKey、setOutputValue）和mapoutput（setMapOutputKey、setMapOutputValue）的区别是什么呢？

hadoop的javadoc里对output的解释为：

Set the key class for the job output data.

对mapoutput的解释为：

Set the key class for the map output data. This allows the user to specify the map output key class to be different than the final output value class.

从字面上看，output控制reducer的输出类型，而mapoutput控制mapper的输出类型。但实际操作发现，如果只设置output类型，却可能提示mapper的输出类型与所设置的类型不匹配。

原来output不只设置reducer的输出类型，同时也设置mapper的输出类型，也就是整个job的输出类型。但如果同时又设置了mapoutput的类型，则mapper的输出类型以mapoutput的类型为准。

查看JobConf代码可以验证上面的逻辑：

public Class<?> getMapOutputKeyClass() {
 Class<?> retv = getClass("mapred.mapoutput.key.class", null, Object.class);
 if (retv == null) {
 retv = getOutputKeyClass();
 }
 return retv;
}

参考资料：

Where does job.setOutputKeyClass and job.setOutputReduceClass refers to?

欢迎转载
请保留原始链接：https://bjzhanghao.com/p/545

八进制

少年壮志无烟抽

job和mapper的输出类型

发表回复取消回复

八进制

少年壮志无烟抽

发表回复 取消回复

发表回复取消回复