如何在ApachePig中加入Map(存储在hbase中)

rdlzhqv9 于 2021-06-24 发布在 Pig

关注(0)|答案(1)|浏览(271)

我与apache pig有一个问题，不知道如何解决它，或者如果它是可能的。我正在使用hbase作为“存储层”。表如下所示：

row key/column  (b1, c1)        (b2, c2)    ...     (bn, cn)
a1              empty           empty               empty   
a2              ...
an              ...

有行键a1到an，每一行都有不同的列，其语法为（bn，cn）。每行/每列的值都为空。
我的Pig程序如下所示：

/* Loading the data */
mydata = load 'hbase://mytable' ... as (a:chararray, b_c:map[]);

/* finding the right elements */ 
sub1 = FILTER mydata BY a == 'a1';
sub2 = FILTER mydata BY a == 'a2');

现在我想连接sub1和sub2，这意味着我想找到同时存在于数据sub1和sub2中的列。我该怎么做？

hbase apache-pig

来源：https://stackoverflow.com/questions/18062614/how-to-join-maps-in-apache-pig-stored-in-hbase

1条答案

按热度按时间

3bygqnnd1#

“Map”将无法在“纯Pig”中执行类似操作。因此你需要一个自定义项。我不确定您想要得到什么作为join的输出，但是根据您的需要调整python udf应该相当容易。
myudf.py公司

@outputSchema('cols: {(col:chararray)}')
def join_maps(M1, M2):
    # This literally returns all column names that exist in both maps.
    out = []
    for k,v in M1.iteritems():
        if k in M2 and v is not None and M2[k] is not None:
            out.append(k)
    return out

你可以像这样使用它：

register 'myudf.py' using jython as myudf ;

# We can call sub2 from in sub1 since it only has one row

D = FOREACH sub1 GENERATE myudf.join_maps(b_c, sub2.b_c) ;

赞(0）回复(0）举报 2021-06-24

我来回答

如何在ApachePig中加入Map(存储在hbase中)

1条答案

相关问题

热门标签

最新问答