如何在hadoop中获得最大字数？

xzabzqsa 于 2021-05-27 发布在 Hadoop

关注(0)|答案(0)|浏览(292)

我已经设法得到我的字计数程序承销，现在我想能够得到最大的发生。
我的wordcount输出如下所示：

File1:Word1: x
File1:Word2: x

其中file表示文件，word表示搜索的单词，x表示计数。
我想得到这些字数的最大值。所以，我举个例子：

File1:Word1: 4
File1:Word2: 10
File2:Word1: 4
File2:Word2: 1

我希望文件1的word1和文件2的word1加1，因为这是特定文件的最大字数。
不幸的是，我很难得到我想要的输出。
我的Map函数如下所示：

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
        throws IOException { 

    String parsedLine = value.toString();
    String[] pieces = parsedLine.split(":");
    StringTokenizer tokenizer = new StringTokenizer(pieces[1]);

    while (tokenizer.hasMoreTokens()) {
        String token = tokenizer.nextToken();
        outputCollector.collect(new Text(token), ONE);
    }
}

我的大脑是这样的：

private int maximum = 0;

@Override
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> outputCollector, Reporter reporter)
        throws IOException {

    Text occuredKey = new Text();

    int total = 0;
    while (values.hasNext()) {
        total += values.next().get();
    }

    if (total > maximum) {
        maximum = total;
        occuredKey.set(key);
    }
    outputCollector.collect(occuredKey, new IntWritable(total));
}

我试过几种方法：
把关键字（这里是word1，word2）放在一个Map上，那是不起作用的。
在我的Map中迭代，如果找到了这个词，就把它放到一个列表中，然后比较列表的大小
我的理解是第一个作业的输出是第二个作业的输入，但这似乎不对，因为我无法访问第一个作业的计数。
谢谢你的帮助，我在这件事上已经耽搁了一段时间了。
要明确输出：
我有60个文件，每个文件都有相同的5个词，在我的字数搜索。因此，在第一个作业的输出文件中，总共有60 x 5条记录。第二项工作将采取的5个字，并计算多少次，这个字是最高的收集5为每个文件。所以，我的输出应该是5条记录，这5条记录的总数应该等于60

Java hadoop yarn apache

来源：https://stackoverflow.com/questions/54798963/how-to-get-the-maximum-word-count-in-hadoop