hive-通过跨组聚合值来创建Map列类型

hxzsmxv2  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(361)

我有一张这样的table:

|customer|category|room|date|
-----------------------------
|1       |   A    | aa | d1 |
|1       |   A    | bb | d2 |
|1       |   B    | cc | d3 |
|1       |   C    | aa | d1 |
|1       |   C    | bb | d2 |
|2       |   A    | aa | d3 |
|2       |   A    | bb | d4 |
|2       |   C    | bb | d4 |
|2       |   C    | ee | d5 |
|3       |   D    | ee | d6 |

我想从表中创建两个Map:
1.Map\客户\房间\日期:将按客户分组,收集所有不同的房间(钥匙)和日期(价值)。
我用的是 collect() 自定义函数brickhouse函数。
可以使用类似的方式存档:

select customer, collect(room,date) as map_customer_room_date
from table
group by customer

第二,Map种类房间日期有点复杂,也由相同的Map类型组成 collect(room, date) 它将包含所有类别的所有房间作为钥匙,其中客户x是类别。这意味着,对于客户1来说,它需要空间 ee 即使它属于客户2。这是因为customer1具有类别 C 这一类也出现在客户2中。
最后一个表按客户分组,如下所示:

|customer| map_customer_room_date  |     map_category_room_date    |
-------------------------------------------------------------------|
|   1    |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
|   2    |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6}       |
|   3    |{ee: d6}                 |{ee: d6}                       |

我在构建第二个Map和呈现所描述的最终表时遇到问题。你知道怎么做到吗?

f8rj6qna

f8rj6qna1#

这可以通过使用一系列自连接来完成,在将结果合并到2个Map之前,先查找同一类别中的其他房间。

代码

CREATE TABLE `table` AS
SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
;

SELECT
    customer_rooms.customer,
    collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
    collect(
        COALESCE(customer_category_rooms.room, category_rooms.room),
        COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
FROM `table` AS customer_rooms
JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room
WHERE (
    customer_rooms.customer = customer_category_rooms.customer AND
    customer_rooms.category = customer_category_rooms.category AND
    customer_rooms.room = customer_category_rooms.room AND
    customer_rooms.date = customer_category_rooms.date
)
OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)
GROUP BY
    customer_rooms.customer
;

结果集

1   {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
2   {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
3   {"ee":"d6"} {"ee":"d6"}

解释

FROM `table` AS customer_rooms

首先,从初始数据中得出结果 table . 我们把这种关系命名为 customer_rooms . 正如你在问题中已经指出的那样,这已经足够了 map_customer_room_date .

JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category

第一个自联接标识与中明确提到的房间具有相同类别的所有房间 customer_rooms 排。我们把这种关系命名为 category_rooms .

LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
AND category_rooms.category = customer_category_rooms.category
AND category_rooms.room = customer_category_rooms.room

第二个self-join使用我们在中标识的房间 category_rooms 并试图找出这个房间是否已经被 customer_rooms . 我们把这种关系命名为 customer_category_rooms . 这是一个 LEFT OUTER JOIN ,因为我们希望保留先前联接中的所有行。结果要么是1)来自 customer_rooms 以及 customer_category_rooms 是相同的,因为客户已经拥有这个房间,或者2)来自 customer_category_rooms 就这样了 NULL ,因为客户没有持有这个房间,但它是属于同一类别的房间。这一区别将变得重要,以便我们能够保护 date 如果他们已经占用了房间。
接下来,我们需要过滤。

WHERE (
    customer_rooms.customer = customer_category_rooms.customer AND
    customer_rooms.category = customer_category_rooms.category AND
    customer_rooms.room = customer_category_rooms.room AND
    customer_rooms.date = customer_category_rooms.date
)

这包括客户在原始文件中明确持有的房间 table .

OR (
    customer_category_rooms.customer IS NULL AND
    customer_category_rooms.category IS NULL AND
    customer_category_rooms.room IS NULL AND
    customer_category_rooms.date IS NULL
)

这包括不由客户持有但与客户持有的房间属于同一类别的房间。

collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
``` `map_customer_room_date` 可以通过从表中收集原始数据来构建,我们将其别名为 `customer_rooms` .

collect(
COALESCE(customer_category_rooms.room, category_rooms.room),
COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date

建筑物 `map_category_room_date` 更复杂。如果客户明确持有该房间,那么我们希望保留该房间 `date` . 但是,如果客户没有明确持有房间,那么我们希望能够使用 `room` 以及 `date` 从具有重叠类别的另一行。为此,我们使用hive coalesce函数来选择第一个不存在的值 `NULL` . 如果客户已经持有房间(如非- `NULL` 中的值 `customer_category_rooms` ),那么我们就用这个。如果不是,那么我们将使用 `category_rooms` 相反。
请注意,如果同一类别/房间组合可以Map到多个类别/房间组合,则仍可能存在一些模糊性 `date` 价值观。如果这是重要的,那么你可能需要投入更多的工作来选择正确的 `date` 基于某些业务规则(例如,使用最快的 `date` )或Map到多个 `date` 值而不是单个值。这应该给你一个很好的起点,但如果有这样的额外要求。

相关问题