目标：

我希望能够指定在输入文件上使用的Map器的数量
等价地，我要指定每个Map器将占用的文件行数

简单示例：

对于10行的输入文件（长度不等；下面的例子），我希望有2个Map器——每个Map器将处理5行。

This is
an arbitrary example file
of 10 lines.
Each line does
not have to be
of
the same
length or contain
the same
number of words

这就是我所拥有的：

（我有它，使每个Map器产生一个“<map，1>”键值对。。。因此它将在减速机中求和）

package org.myorg;
import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.InputFormat;

public class Test {

  // prduce one "<map,1>" pair per mapper
  public static class Map extends Mapper<Object, Text, Text, IntWritable>{
    private final static IntWritable one = new IntWritable(1);
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      context.write(new Text("map"), one);
    }
  }

  // reduce by taking a sum
  public static class Red extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {      
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job1 = Job.getInstance(conf, "pass01");

    job1.setJarByClass(Test.class);
    job1.setMapperClass(Map.class);
    job1.setCombinerClass(Red.class);
    job1.setReducerClass(Red.class);

    job1.setOutputKeyClass(Text.class);
    job1.setOutputValueClass(IntWritable.class);

    FileInputFormat.addInputPath(job1, new Path(args[0]));
    FileOutputFormat.setOutputPath(job1, new Path(args[1]));

    // // Attempt#1
    // conf.setInt("mapreduce.input.lineinputformat.linespermap", 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#2
    // NLineInputFormat.setNumLinesPerSplit(job1, 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#3
    // conf.setInt(NLineInputFormat.LINES_PER_MAP, 5);
    // job1.setInputFormatClass(NLineInputFormat.class);

    // // Attempt#4
    // conf.setInt("mapreduce.input.fileinputformat.split.minsize", 234);
    // conf.setInt("mapreduce.input.fileinputformat.split.maxsize", 234);

    System.exit(job1.waitForCompletion(true) ? 0 : 1);
  }
}

上面的代码，使用上面的示例数据，将生成

map 10

我希望输出是

map 2

第一个Map器将对前5行执行某些操作，第二个Map器将对后5行执行某些操作。

import java.io.IOException; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper; public class MapperNLine extends Mapper<LongWritable, Text, LongWritable, Text> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { context.write(key, value); } }

import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.conf.Configured; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.NLineInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.Tool; import org.apache.hadoop.util.ToolRunner; public class Driver extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.out .printf("Two parameters are required for DriverNLineInputFormat- <input dir> <output dir>\n"); return -1; } Job job = new Job(getConf()); job.setJobName("NLineInputFormat example"); job.setJarByClass(Driver.class); job.setInputFormatClass(NLineInputFormat.class); NLineInputFormat.addInputPath(job, new Path(args[0])); job.getConfiguration().setInt("mapreduce.input.lineinputformat.linespermap", 5); LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MapperNLine.class); job.setNumReduceTasks(0); boolean success = job.waitForCompletion(true); return success ? 0 : 1; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new Configuration(), new Driver(), args); System.exit(exitCode); } }

1条答案

按热度按时间

disho6za1#

您可以使用nlineinputformat。
与 NLineInputFormat 功能，您可以指定一个Map器应该有多少行。e、如果您的文件有500行，并且您将“每个Map器的行数”（number of lines per mapper）设置为10，那么您就有50个Map器（而不是一个，假设文件小于hdfs块大小）。
编辑：
以下是使用nlineinputformat的示例：
Map器类：

驾驶员等级：

使用您提供的输入，上述示例Map器的输出将在两个Map器初始化时写入两个文件：
第m-00001部分

0   This is
8   an arbitrary example file
34  of 10 lines.
47  Each line does
62  not have to be

第m-00002部分

77  of
80  the same
89  length or contain
107 the same
116 number of words

赞(0）回复(0）举报 2021-06-03

mapreduce：如何让mapper处理多行？

目标：

简单示例：

这就是我所拥有的：

1条答案

相关问题

热门标签

最新问答