pig中的合并行

5ssjco0h  于 2021-06-25  发布在  Pig
关注(0)|答案(1)|浏览(255)

我想为下面的查询编写一个pig脚本。
输入为:

AAA,,,
,BBB,,
,,,DDD
AAA,,,
,BBB,,
,,CCC,
,,,DDD
AAA,,,
,BBB,,
,,,DDD

输出应为:

AAA,BBB,,DDD
AAA,BBB,CCC,DDD
AAA,BBB,,DDD

我尝试过在pig中合并两行,但如果我尝试拆分bag split(3,$1),则输出不正确,因为我的输出将合并前三行,然后合并后四行,然后再合并后三行
输入可能会增加,但最后一行最重要的是,,,ddd。
有人能帮我吗?

erhoui1w

erhoui1w1#

你的输入数据应该分成不同的长度(3,4,3),所以 BagSplit 在这种情况下,函数将不起作用。你能试试下面的方法吗?关系的重复部分 E (TOTUPLE) 可以使用 MACROS 但它会导致更多的混乱,所以我没有优化到现在。
输入文件

AAA,,,
,BBB,,
,,,DDD
AAA,,,
,BBB,,
,,CCC,
,,,DDD
AAA,,,
,BBB,,
,,,DDD

Pig手稿:

A = LOAD 'input.txt' USING PigStorage(',') AS(f1,f2,f3,f4);
B = RANK A;
C = GROUP B ALL;
D = FOREACH C  {
                 firstRecord = FILTER B BY rank_A<=3;                /* store first 3 records*/
                 secondRecord= FILTER B BY rank_A>3 AND rank_A<=7;   /* store next 4 records */
                 thirdRecord = FILTER B BY rank_A>7;                 /* store next 3 records */
                 GENERATE firstRecord,secondRecord,thirdRecord;
                }

/* Convert each split bags(firstRecord,secondRecord and thirdRecord) into strings and replace 'null' and '_' with  empty characters.*/
E = FOREACH D GENERATE FLATTEN(TOBAG(
                                        TOTUPLE(REPLACE(BagToString(firstRecord.f1),'[null|_]',''),
                                                REPLACE(BagToString(firstRecord.f2),'[null|_]',''),
                                                REPLACE(BagToString(firstRecord.f3),'[null|_]',''),
                                                REPLACE(BagToString(firstRecord.f4),'[null|_]','')),
                                        TOTUPLE(REPLACE(BagToString(secondRecord.f1),'[null|_]',''),
                                                REPLACE(BagToString(secondRecord.f2),'[null|_]',''),
                                                REPLACE(BagToString(secondRecord.f3),'[null|_]',''),
                                                REPLACE(BagToString(secondRecord.f4),'[null|_]','')),
                                        TOTUPLE(REPLACE(BagToString(thirdRecord.f1),'[null|_]',''),
                                                REPLACE(BagToString(thirdRecord.f2),'[null|_]',''),
                                                REPLACE(BagToString(thirdRecord.f3),'[null|_]',''),
                                                REPLACE(BagToString(thirdRecord.f4),'[null|_]',''))
                                        )
                                 );
DUMP E;

输出:

(AAA,BBB,,DDD)
(AAA,BBB,CCC,DDD)
(AAA,BBB,,DDD)

相关问题