在我的脚本中,我从多个文件中读取数据,并使用一个regex及其补码将记录划分为两个组/类。我期待两个相互排斥的类,但我没有发现当我数记录。。。因此,我添加了一个分割部分,以查找我的约束及其补码未涵盖的“其余”记录。结果(再次)不是预期的结果。。。我的剧本怎么了?谢谢你的帮助!
期望的“数学”:
input: 1464 records
ouputs: 264 + 870 + ???_330__??
剧本:
A = load 'input/*' using PigStorage('\t','-tagPath') as (src:chararray, content:chararray);
Ac = foreach (GROUP A all) generate COUNT(A);
B = filter A by content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)';
Bc = foreach (GROUP B all) generate COUNT(B);
Bnot = filter A by NOT content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)';
Bcnot = foreach (GROUP Bnot all) generate COUNT(Bnot);
SPLIT A INTO SET1 IF (content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)')
, SET2 IF (NOT content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)')
, SETn OTHERWISE;
STORE SET1 into 'output/set1';
STORE SET2 into 'output/set2';
STORE SETn into 'output/setn';
结果是:
Input(s):
Successfully read 1464 records (49024 bytes) from: "hdfs://localhost:9000/user/dag/input/*"
Output(s):
Successfully stored 264 records (25276 bytes) in: "hdfs://localhost:9000/user/dag/output/set1"
Successfully stored 870 records (84190 bytes) in: "hdfs://localhost:9000/user/dag/output/set2"
Successfully stored 0 records in: "hdfs://localhost:9000/user/dag/output/setn"
1条答案
按热度按时间dauxcl2d1#
我假设在330个案例中
null
. 如果将布尔表达式替换为content is null OR NOT content MATCHES '(^\\b[BCDFMSTX].*\\b\\:\\s{1}.*)'
应该有用。也就是说,我不认为这是非常直观的,我认为pig应该抛出一个nullpointerexception或者至少记录一个警告。