mapreduce树冠聚类中心_大数据知识库

我正在努力理解这个代码。这两个类（一个map，一个reduce）的目的是寻找树冠中心。我的问题是我不理解map和reduce函数之间的区别。他们几乎一样。
有什么区别吗？或者我只是在减速机里重复同样的过程？
我认为答案是map和reduce函数处理代码的方式不同。即使代码相似，它们也对数据执行不同的操作。
所以，有人能解释一下Map的过程吗？当我们试图找到树冠中心时，能不能缩小？
例如，我知道Map可能是这样的--（乔，1）（戴夫，1）（乔，1）（乔，1）（乔，1）
然后减量会是这样：——（乔，3）（戴夫，1）
同样的事情在这里发生过吗？
或者我在执行同样的任务两次？
非常感谢。
Map功能：

package nasdaq.hadoop;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.util.*;

public class CanopyCentersMapper extends Mapper<LongWritable, Text, Text, Text> {
    //A list with the centers of the canopy
    private ArrayList<ArrayList<String>> canopyCenters;

@Override
public void setup(Context context) {
        this.canopyCenters = new ArrayList<ArrayList<String>>();
}

@Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
    //Seperate the stock name from the values to create a key of the stock and a list of values - what is list of values?
    //What exactly are we splitting here?
    ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(value.toString().split(","))); 

    //remove stock and make first canopy center around it canopy center
    String stockKey = stockData.remove(0);

    //?
    String stockValue = StringUtils.join(",", stockData);

    //Check wether the stock is avaliable for usage as a new canopy center
    boolean isClose = false;    

    for (ArrayList<String> center : canopyCenters) {    //Run over the centers

    //I think...let's say at this point we have a few centers. Then we have our next point to check.
    //We have to compare that point with EVERY center already created. If the distance is larger than EVERY T1
    //then that point becomes a new center! But the more canopies we have there is a good chance it is within
    //the radius of one of the canopies...

            //Measure the distance between the center and the currently checked center
            if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
                    //Center is too close
                    isClose = true;
                    break;
            }
    }

    //The center is not smaller than the small radius, add it to the canopy
    if (!isClose) {
        //Center is not too close, add the current data to the center
        canopyCenters.add(stockData);

        //Prepare hadoop data for output
        Text outputKey = new Text();
        Text outputValue = new Text();

        outputKey.set(stockKey);
        outputValue.set(stockValue);

        //Output the stock key and values to reducer
        context.write(outputKey, outputValue);
    }
}

}
减少功能：

package nasdaq.hadoop;

import java.io.*;
import java.util.*;

import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;

public class CanopyCentersReducer extends Reducer<Text, Text, Text, Text> {
    //The canopy centers list
    private ArrayList<ArrayList<String>> canopyCenters;

@Override
public void setup(Context context) {
        //Create a new list for the canopy centers
        this.canopyCenters = new ArrayList<ArrayList<String>>();
}

public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
    for (Text value : values) {
        //Format the value and key to fit the format
        String stockValue = value.toString();
        ArrayList<String> stockData = new ArrayList<String>(Arrays.asList(stockValue.split(",")));
        String stockKey = key.toString();

        //Check wether the stock is avaliable for usage as a new canopy center
        boolean isClose = false;    
        for (ArrayList<String> center : canopyCenters) {    //Run over the centers
                //Measure the distance between the center and the currently checked center
                if (ClusterJob.measureDistance(center, stockData) <= ClusterJob.T1) {
                        //Center is too close
                        isClose = true;
                        break;
                }
        }

        //The center is not smaller than the small radius, add it to the canopy
        if (!isClose) {
            //Center is not too close, add the current data to the center
            canopyCenters.add(stockData);

            //Prepare hadoop data for output
            Text outputKey = new Text();
            Text outputValue = new Text();

            outputKey.set(stockKey);
            outputValue.set(stockValue);

            //Output the stock key and values to reducer
            context.write(outputKey, outputValue);
        }

    }

**编辑——更多代码和解释

stockkey是表示股票的键值(纳斯达克等）
clusterjob.measuredistance（）：

public static double measureDistance(ArrayList<String> origin, ArrayList<String> destination)
{
    double deltaSum = 0.0;
    //Run over all points in the origin vector and calculate the sum of the squared deltas
    for (int i = 0; i < origin.size(); i++) {
        if (destination.size() > i) //Only add to sum if there is a destination to compare to
        {
            deltaSum = deltaSum + Math.pow(Math.abs(Double.valueOf(origin.get(i)) - Double.valueOf(destination.get(i))),2);
        }
    }
    //Return the square root of the sum
    return Math.sqrt(deltaSum);

好的，代码的直接解释是：-Map绘制者遍历一些（可能是随机的）数据子集，并生成树冠中心，所有这些中心彼此之间至少有t1距离。这些中心被发射出来然后，减速器从所有Map器遍历属于每个特定股票密钥（如msft、goog等）的所有树冠中心，然后确保每个股票密钥值的树冠中心都不在t1范围内（例如，goog中的两个中心都不在t1范围内，尽管msft的一个中心和goog的一个中心可能很接近）。
代码的目标还不清楚，我个人认为一定有一个bug。reducer基本上解决了这个问题，就好像您试图独立地为每个股票密钥生成中心一样（即，为goog的所有数据点计算canopy中心），而mapper似乎解决了试图为所有股票生成中心的问题。这样放在一起，你会得到一个矛盾，所以两个问题实际上都没有得到解决。
如果您想要所有股票键的中心：-那么Map输出必须将所有内容发送到一个减速机。将map output键设置为诸如nullwritable之类的琐碎值。然后，减速器将执行正确的操作，而不会发生变化。
如果要为每个股票密钥设置中心：-则需要更改Map器，以便有效地为每个股票密钥设置一个单独的canopy列表，您可以为每个股票密钥保留一个单独的arraylist（首选，因为这样会更快），或者，您只需更改距离度量，使属于不同股票键的股票键相距无限远（因此它们从不交互）。
p、顺便说一下，你的距离度量也有一些不相关的问题。首先，您使用double.parsedouble解析数据，但没有捕获numberformatexception。因为你要给它stockdata，它包含非数字字符串，比如第一个字段中的“goog”，所以一旦你运行它，你就要结束它。第二，距离度量忽略任何缺少值的字段。这是l2（毕达哥拉斯）距离度量的错误实现。要了解原因，请考虑以下字符串：“，”与任何其他点的距离为0，如果选择该字符串作为树冠中心，则不能选择其他中心。与其将缺失维度的增量设置为零，不如考虑将其设置为合理的值，例如该属性的总体平均值，或者（为了安全起见）为了聚类的目的而从数据集中丢弃该行。

mapreduce树冠聚类中心

1条答案

相关问题

热门标签

最新问答