使用配置单元进行字数统计

b91juud3  于 2021-06-28  发布在  Hive
关注(0)|答案(1)|浏览(337)

假设我有一个包含列id和内容的表:

id | content
________________________
1  | abc abr abc as abs
2  | abc arc cre arc
3  | agr ann agd agd agd

我想要的是这样的输出:

{"abc":2,"abr":1,"as":1, "abs":1}  # for id 1
{"abc":1,"arc":2,"cre":1}          # for id 2
{"agr":1,"agd":3,"ann":1}          # for id 3

如何使用hive完成任务?

ohtdti5x

ohtdti5x1#

你需要这个图书馆。建造起来非常简单。
查询:

ADD JAR /path/to/jar/brickhouse-0.7.1.jar;
CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';

SELECT id
  , COLLECT(words, c) AS count_map
FROM (
  SELECT id
    , words
    , COUNT(*) AS c
  FROM (
    SELECT id, words
    FROM db.tbl
    LATERAL VIEW EXPLODE(SPLIT(content, ' ')) exptbl AS words ) x
  GROUP BY id, words ) y
GROUP BY id

输出:

+----+---------------------------------+
|id  |count_map                        |
+----+---------------------------------+
|1   |{"as":1,"abs":1,"abc":2,"abr":1} |
+----+---------------------------------+
|2   |{"cre":1,"arc":2,"abc":1}        |
+----+---------------------------------+
|3   |{"ann":1,"agr":1,"agd":3}        |
+----+---------------------------------+

相关问题