pig连接和平均值

vsaztqbk  于 2021-06-25  发布在  Pig
关注(0)|答案(1)|浏览(337)

我正在努力自学小Pig,我有以下脚本:

customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int, customer_id:int, rating:int); 
item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);
item_join = join item_data by item_id, customer_ratings by i_id;
item_group = GROUP item_join ALL;
item_foreach = foreach item_group generate item_id, item_name, item_url,  AVG(item_join.rating);
PRINT = limit item_foreach 40;
dump PRINT;

foreach失败,出现以下错误:

Invalid field projection. Projected field [item_id] does not exist in schema: group:char array,item_join:bag{:tuple(item_data::item_id:int,item_data::item_name:char array,item_data::dummy:int,item_data::item_url:chararray,customer_ratings::i_id:int,customer_ratings::customer_id:int,customer_ratings::rating:int)}.

我知道有些东西我不明白通过教程,以实现这一点。。。你知道怎么把我的照片打印出来吗 foreach ?
我也试过了 generate item_data::item_id, item_data::item_name, etc. 如(pig-如何在连接之后引用foreach中的列?)中所述,但这也不起作用。。。

aurhwmvo

aurhwmvo1#

customer_ratings = LOAD 'customer_ratings.txt' as (i_id:int,customer_id:int, rating:int); 

item_data = LOAD 'item_data.txt' USING PigStorage(',') as (item_id:int,item_name:chararray, dummy:int,item_url:chararray);

item_join = foreach (
             join item_data by item_id, 
             customer_ratings by i_id
             )
            generate 
             item_data::item_id as item_id, 
             item_data::item_name as item_name,
             cutsomer_rating::rating as rating
            ;

item_group = GROUP item_join by (item_id, item_url);

item_foreach = foreach item_group generate 
                FLATTEN(group) as (item_id, item_url), 
                AVG(item_join.rating)
               ;

PRINT = limit item_foreach 40;

dump PRINT;

像这样的东西,我想,行得通。虽然我还没测试过。我做了两件事。首先,在join之后,我将字段命名为一些简单的名称,这样我们就不必携带一堆名为relation.fieldname的字段。
将组展平是一种更容易的方法,可以通过以下方式将密钥从组中取出。在你的例子中,我认为你需要使用

generate item_join.item_data::item_id

相关问题