生成子字符串的计数(apache pig)

92vpleto  于 2021-06-25  发布在  Pig
关注(0)|答案(1)|浏览(242)

我试图计算子串的特定部分。我的a是正确的,但是我和b在工作上有困难。我加入了实验室的评论来帮助解释某些代码。

data = LOAD '/dualcore/orders' AS (order_id:int,
         cust_id:int,
         order_dtm:chararray);

 /*
  * Include only records where the 'order_dtm' field matches
  * the regular expression pattern:
  *
  *   ^       = beginning of string
  *   2013    = literal value '2013'
  *   0[2345] = 0 followed by 2, 3, 4, or 5
  *   -       = a literal character '-'
  *   \\d{2}  = exactly two digits
  *   \\s     = a single whitespace character
  *   .*      = any number of any characters
  *   $       = end of string
  *
  * If you are not familiar with regular expressions and would
  * like to know more about them, see the Regular Expression 
  * Reference at the end of the Exercise Manual.
  */
 recent = FILTER data by order_dtm matches '^2013-0[2345]-\\d{2}\\s.*$';

 -- TODO (A): Create a new relation with just the order's year and month
 A = FOREACH data GENERATE SUBSTRING(order_dtm,0,7);

 -- TODO (B): Count the number of orders in each month
 B = FOREACH data GENERATE COUNT_STAR(A);

 -- TODO (C): Display the count by month to the screen.
 DUMP C;'
yks3o0rb

yks3o0rb1#

你可以用两种方法解决这个问题。
选项1:使用你提到的子字符串
输入

1       100     2013-02-15 test
2       100     2013-04-20 test1
1       101     2013-02-14 test2
1       101     2014-02-27 test3

Pig手稿:

data = LOAD 'input' AS (order_id:int,cust_id:int,order_dtm:chararray);
recent = FILTER data by order_dtm matches '^2013-0[2345]-\\d{2}\\s.*$';
A = FOREACH recent GENERATE order_id,cust_id,SUBSTRING(order_dtm,0,4) AS year,SUBSTRING(order_dtm,5,7) AS month;
B = GROUP A BY month;
C = FOREACH B GENERATE group AS month,FLATTEN(A.year) AS year,COUNT(A) AS cnt;
DUMP C;

输出:

(02,2013,2)
(02,2013,2)
(04,2013,1)

选项2:使用regex函数

data = LOAD 'input' AS(order_id:int,cust_id:int,order_dtm:chararray);
A = FOREACH data GENERATE order_id,cust_id,FLATTEN(REGEX_EXTRACT_ALL(order_dtm,'^(2013)-(0[2345])-\\d{2}\\s.*$')) AS (year,month);
B = FILTER A BY month IS NOT NULL;
C = GROUP B BY month;
D = FOREACH C GENERATE group AS month,FLATTEN(B.year) AS year,COUNT(B) AS cnt;
DUMP D;

输出:

(02,2013,2)
(02,2013,2)
(04,2013,1)

在这两种情况下,我都在最终输出中包含了年份,如果您不想,请删除 FLATTEN(year) 从剧本里。

相关问题