建议使用hive或pig的最佳方式

p4rjhz4m  于 2021-06-26  发布在  Hive
关注(0)|答案(2)|浏览(284)

问题陈述假设有一个日志文本文件。下面是文件中的字段。
日志文件

userID
productID
action

其中一项行动是——

Browse, Click, AddToCart, Purchase, LogOut

选择执行addtocart操作但未执行购买操作的用户。

('1001','101','201','Browse'),
('1002','102','202','Click'),
('1001','101','201','AddToCart'),
('1001','101','201','Purchase'),
('1002','102','202','AddToCart')

有人建议用Hive或Pig来获得这些信息吗

5jdjgkvh

5jdjgkvh1#

这可以使用sum()或analytic sum()实现,具体取决于单个表扫描中的确切要求。如果用户向购物车添加了两个产品,但只购买了一个呢?
对于用户+产品:

select userID, productID 
  from
(
select 
       userID,
       productID,
       sum(case when action='AddToCart' then 1 else 0 end) addToCart_cnt,
       sum(case when action='Purchase' then 1 else 0 end)  Purchase_cnt
  from table
  group by userID, productID
)s
where addToCart_cnt>0 and Purchase_cnt=0
olhwl3o2

olhwl3o22#

配置单元:使用 not in ```
select * from table
where action='AddtoCart' and
userID not in (select distinct userID from table where action='Purchase')

pig:使用action过滤id并执行左连接,检查id是否为null

A = LOAD '\path\file.txt' USING PigStorage(',') AS (userID:int,b:int,c:int,action:chararray) -- Note I am assuming the first 3 columns are int.You will have to figure out the loading without the quotes.
B = FILTER A BY (action='AddToCart');
C = FILTER A BY (action='Purchase');
D = JOIN B BY userID LEFT OUTER,C BY userID;
E = FILTER D BY C.userID is null;
DUMP E;

相关问题