pig自定义函数,用于加载多字符^^(双胡萝卜)分隔符

93ze6v8z  于 2021-06-02  发布在  Hadoop
关注(0)|答案(1)|浏览(308)

我是新来的Pig,有人能帮我如何加载一个文件,我可以用多个字符(在我的情况下,^^')作为一个列分隔符。
例如,我有一个包含以下列的文件:aisforapple^^bisforball^^cisforcat^^disfordoll^^和isforelephant fisforfish^^gisforgreen^^hisforhat^^iisforriceem^^和jisforjar kisforking^^lisforlion^格式错误^ nisfornose^和oisfororange

敬礼

uqdfh47h

uqdfh47h1#

正则表达式最适合这种多字符

input.txt
aisforapple^^bisforball^^cisforcat^^disfordoll^^andeisforelephant
fisforfish^^gisforgreen^^hisforhat^^iisforicecreem^^andjisforjar
kisforking^^lisforlion^^misformango^^nisfornose^^andoisfororange

PigScript
A = LOAD 'input.txt' AS line;
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)\\^\\^(.*)\\^\\^(.*)\\^\\^(.*)\\^\\^(.*)')) AS (f1,f2,f3,f4,f5);
DUMP B;

Output:
(aisforapple,bisforball,cisforcat,disfordoll,andeisforelephant)
(fisforfish,gisforgreen,hisforhat,iisforicecreem,andjisforjar)
(kisforking,lisforlion,misformango,nisfornose,andoisfororange)

说明:

For better understanding i break the regex into multiple lines
(.*)\\^\\^ ->Any character match till ^^ and stored into f1,(double backslash for special characters) 
(.*)\\^\\^ ->Any character match till ^^ and stored into f2,(double backslash for special characters) 
(.*)\\^\\^ ->Any character match till ^^ and stored into f3,(double backslash for special characters) 
(.*)\\^\\^ ->Any character match till ^^ and stored into f4,(double backslash for special characters) 
(.*)       ->Any character match till the end of string and stored into f5

相关问题