分组后排序

z5btuh9x  于 2021-06-25  发布在  Pig
关注(0)|答案(2)|浏览(270)

我有一个清单如下。

from    to  duration 
5       10  1
10      30  15
10      30  25
5       10  10
10      40  15
5       20  5

我需要找到最常见的从到对,如下所示。

from    to  count 
10      30      2
5       10      2

我将它们按“从,到”进行分组,我可以找到如下计数。

10  30  2
10  40  1
5   20  1
5   10  2

如何只提取最大对数。

a = load 'x' using PigStorage;
b = group a by (from, to);
c = foreach b {
d = COUNT(c);
generate group, d;};
e = group d all;
f = foreach e {
g = order e by d;
h = limit g 1;
generate group, h; };
a7qyws3x

a7qyws3x1#

以上这些肯定有用。我想过这样写逻辑。但这里的代码很长。

A = LOAD 'input.txt' AS (from,to,duration);
B = GROUP A BY (from,to);
C = FOREACH B GENERATE FLATTEN(group) AS(from,to),COUNT(A.from) as count;
D = ORDER C BY count DESC;
E = LIMIT D 1;
F = JOIN C by count,E BY $2;
G = FOREACH F GENERATE $0,$1,$2;

如果你觉得有用,请检查一下

oipij1gg

oipij1gg2#

你能试着告诉我这对你有用吗。
更新:
如果你没有 RANK 操作员,下载 piggbank.jar 并将其设置在类路径中,然后尝试下面的方法。
输入文件

5       10      1
10      30      15
10      30      25
5       10      10
10      40      15
5       20      5

pigscript:pig version<11

REGISTER /tmp/piggybank.jar;

    DEFINE MyOver org.apache.pig.piggybank.evaluation.Over('myrank:int');
    DEFINE MyStitch org.apache.pig.piggybank.evaluation.Stitch;

    A = LOAD 'input.txt' AS (from,to,duration);
    B = GROUP A BY (from,to);
    C = FOREACH B{
                    mycount = COUNT($1);
                    GENERATE group, mycount AS cnt;
                 }
    D = GROUP C ALL;
    E = FOREACH D  {
                      mysort = ORDER C BY cnt DESC;
                      GENERATE FLATTEN(MyStitch(mysort,MyOver(mysort,'dense_rank',0,1,1)));
                   };
    F = FILTER E BY stitched::myrank==1;
    G = FOREACH F GENERATE FLATTEN(stitched::group),stitched::cnt;
    DUMP G;

输出:

(5,10,2)
(10,30,2)

pigscript:pigversion >=11支持秩运算符

A = LOAD 'input.txt' AS (from,to,duration);
B = GROUP A BY (from,to);
C = FOREACH B{
                mycount = COUNT($1);
                GENERATE group, mycount AS cnt;
             }
D = RANK C BY cnt DESC;
E = FILTER D BY rank_C==1;
F = FOREACH E GENERATE FLATTEN(group),cnt;
DUMP F;

输出:

(5,10,2)
(10,30,2)

相关问题