我试图从hbase表生成mahout向量。mahout需要向量序列文件作为输入。我得到的印象是，我不能从使用hbase作为源的map reduce作业写入序列文件。这里什么都没有：

public void vectorize() throws IOException, ClassNotFoundException, InterruptedException {
    JobConf jobConf = new JobConf();
    jobConf.setMapOutputKeyClass(LongWritable.class);
    jobConf.setMapOutputValueClass(VectorWritable.class);

    // we want the vectors written straight to HDFS,
    // the order does not matter.
    jobConf.setNumReduceTasks(0);

    jobConf.setOutputFormat(SequenceFileOutputFormat.class);

    Path outputDir = new Path("/home/cloudera/house_vectors");
    FileSystem fs = FileSystem.get(configuration);
    if (fs.exists(outputDir)) {
        fs.delete(outputDir, true);
    }

    FileOutputFormat.setOutputPath(jobConf, outputDir);

    // I want the mappers to know the max and min value
    // so they can normalize the data.
    // I will add them as properties in the configuration,
    // by serializing them with avro.
    String minmax = HouseAvroUtil.toString(Arrays.asList(minimumHouse,
            maximumHouse));
    jobConf.set("minmax", minmax);

    Job job = Job.getInstance(jobConf);
    Scan scan = new Scan();
    scan.addFamily(Bytes.toBytes("data"));
    TableMapReduceUtil.initTableMapperJob("homes", scan,
            HouseVectorizingMapper.class, LongWritable.class,
            VectorWritable.class, job);

    job.waitForCompletion(true);
}

我有一些测试代码来运行它，但是我得到了：

java.io.IOException: mapred.output.format.class is incompatible with new map API mode.
    at org.apache.hadoop.mapreduce.Job.ensureNotSet(Job.java:1173)
    at org.apache.hadoop.mapreduce.Job.setUseNewAPI(Job.java:1204)
    at org.apache.hadoop.mapreduce.Job.submit(Job.java:1262)
    at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1287)
    at jinvestor.jhouse.mr.HouseVectorizer.vectorize(HouseVectorizer.java:90)
    at jinvestor.jhouse.mr.HouseVectorizerMT.vectorize(HouseVectorizerMT.java:23)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
    at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

所以我认为我的问题是我使用的是import org.apache.hadoop.mapreduce.job，setoutputformat方法需要org.apache.hadoop.mapreduce.outputformat的示例，这是一个类。该类只有四个实现，并且没有一个是用于序列文件的。以下是它的javadocs：
http://hadoop.apache.org/docs/r2.2.0/api/index.html?org/apache/hadoop/mapreduce/outputformat.html
如果可以的话，我会使用job类的旧api版本，但是hbase的tablemapreduceutil只接受新api的作业。
我想我可以先把结果写成文本，然后再做第二个map/reduce作业，将输出转换成序列文件，但这听起来效率很低。
还有旧的org.apache.hadoop.hbase.mapred.tablemapreduceutil，但我不赞成使用它。
我的mahout jar是版本0.7-cdh4.5.0我的hbase jar是版本0.94.6-cdh4.5.0我所有的hadoop jar都是2.0.0-cdh4.5.0
有人能告诉我在我的情况下如何从m/r写序列文件吗？

2条答案

按热度按时间

wfsdck301#

实际上，sequencefileoutputformat是新outputformat的后代。为了找到javadoc中的直接子类，您必须进一步查看。
http://hadoop.apache.org/docs/r2.2.0/api/org/apache/hadoop/mapreduce/lib/output/sequencefileoutputformat.html
您可能在驱动程序类中导入了错误的（旧的）驱动程序。从您的问题中无法确定这一点，因为您的代码示例中没有包含导入。

赞(0）回复(0）举报 2021-06-04

jyztefdp2#

这是我在使用oozie时丢失的类似问题。从braindump：

<!-- New API for map -->
<property>
    <name>mapred.mapper.new-api</name>
    <value>true</value>
</property>

<!-- New API for reducer -->
<property>
    <name>mapred.reducer.new-api</name>
    <value>true</value>
</property>

hbase、map/reduce和sequencefiles:mapred.output.format.class与新的map api模式不兼容

2条答案

相关问题

热门标签

最新问答