sql(配置单元)：在使用group by进行聚合时使用窗口函数

izkcnapc 于 2021-06-24 发布在 Hive

关注(0)|答案(2)|浏览(1082)

我在雅典娜（hive/presto）有下表：

CREATE EXTERNAL TABLE tmp (
    id STRING,
    updated_at TIMESTAMP,
    location STRING,
    direction STRING
)
LOCATION 's3://path';

我需要把这些数据汇总起来 id 字段，同时选择 location 以及 direction 与最新的 timestamp 在组内（分区再次打开 id ).
到目前为止，我提出了以下查询，首先使用窗口函数，然后使用分组方式：

SELECT
    b.id,
    MAX(b.latest_location) AS "latest_location",  -- It seems it is not possible to use first_value() on GROUP BY
    MAX(b.latest_direction) AS "latest_direction",
    COUNT(*) AS "total"
FROM (
    SELECT
        a.id,
        first_value(a.location) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_location",
        first_value(a.direction) OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS "latest_direction"
    FROM tmp a
) b
GROUP BY b.id;

我首先尝试了同时进行groupby聚合和window聚合，但是引擎似乎不允许这样做。有没有可能写一个更有效的查询（也许没有子查询）？

sql Hive presto hiveql amazon-athena

来源：https://stackoverflow.com/questions/63619460/sql-hive-using-window-functions-while-aggregating-with-group-by

2条答案

按热度按时间

iyr7buue1#

您可以混合使用窗口函数和聚合函数。但另一个方向是：先聚合，然后是窗口函数。
也就是说，如果消除聚合，查询速度应该会更快。只是使用 row_number() 和过滤：

SELECT a.id, a.location, a.updated_at
FROM (SELECT a.*,
             ROW_NUMBER() OVER (PARTITION BY a.id ORDER BY a.updated_at DESC) AS seqnum
      FROM tmp a
     ) a
WHERE seqnum = 1;

赞(0）回复(0）举报 2021-06-24

1wnzp6jl2#

SELECT DISTINCT
    id,
    first_value(a.location)  OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_location,
    first_value(a.direction) OVER (PARTITION BY id ORDER BY updated_at DESC) AS latest_direction,
    count(*) OVER (PARTITION BY id) as total
FROM tmp

在你最初的询问中， max 基本上是一个虚拟聚合，因为所有行都有相同的值。以及 group by 基本上是在做 distinct 在这里有。

赞(0）回复(0）举报 2021-06-24

我来回答

sql(配置单元)：在使用group by进行聚合时使用窗口函数

2条答案

相关问题

热门标签

最新问答