如何在pig中分组时处理倾斜数据

3zwjbxry 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(357)

我正在做一个分组操作，其中一个reduce任务运行的时间很长。下面是示例代码片段和问题的描述，

inp =load 'input' using PigStorage('|') AS(f1,f2,f3,f4,f5);

grp_inp = GROUP inp BY (f1,f2) parallel 300;

由于数据中存在偏差，即一个键的值太多，因此一个减速机运行了4个小时。rest所有reduce任务在1分钟左右完成。
我能做些什么来解决这个问题，还有其他的方法吗？任何帮助都将不胜感激。谢谢！

hadoop apache-pig

来源：https://stackoverflow.com/questions/38567471/how-to-handle-skewed-data-while-grouping-in-pig

1条答案

按热度按时间

7dl7o3gd1#

您可能需要检查以下几项：-
1> 过滤掉f1和f2值都为空的记录（如果有）
2> 如果可能，尝试通过实现代数接口来使用hadoop combiner：-
https://www.safaribooksonline.com/library/view/programming-pig/9781449317881/ch10s02.html
3> 使用CustomPartitioner使用另一个键跨reducer分发数据。
下面是我用来在join之后对倾斜数据进行分区的示例代码（同样也可以在group之后使用）：-

import java.util.logging.Level;
import java.util.logging.Logger;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.pig.backend.executionengine.ExecException;
import org.apache.pig.data.Tuple;
import org.apache.pig.impl.io.NullableTuple;
import org.apache.pig.impl.io.PigNullableWritable;

public class KeyPartitioner extends Partitioner<PigNullableWritable, Writable> {

/**

* Here key contains value of current key used for partitioning and Writable
* value conatins all fields from your tuple. I used my 5th field from tuple to do partitioning as I knew it has evenly distributed value.
**/

@Override
public int getPartition(PigNullableWritable key, Writable value, int numPartitions) {
    Tuple valueTuple = (Tuple) ((NullableTuple) value).getValueAsPigType();
    try {
        if (valueTuple.size() > 5) {
            Object hashObj = valueTuple.get(5);
            Integer keyHash = Integer.parseInt(hashObj.toString());
            int partitionNo = Math.abs(keyHash) % numPartitions;
            return partitionNo;
        } else {
            if (valueTuple.size() > 0) {
                return (Math.abs(valueTuple.get(1).hashCode())) % numPartitions;
            }
        }
    } catch (NumberFormatException | ExecException ex) {
        Logger.getLogger(KeyPartitioner.class.getName()).log(Level.SEVERE, null, ex);
    }
    return (Math.abs(key.hashCode())) % numPartitions;
}
}

赞(0）回复(0）举报 2021-05-29

我来回答

如何在pig中分组时处理倾斜数据

1条答案

相关问题

热门标签

最新问答