从分组查询中删除联接操作

6rqinv9w 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(263)

我有一张table，看起来像：

usr_id  query_ts
12345   2019/05/13 02:06
123444  2019/05/15 04:06
123444  2019/05/16 05:06
12345   2019/05/16 02:06
12345   2019/05/15 02:06

它包含用户运行查询时的用户id。表中的每个条目表示以给定的时间戳运行1个查询的id。
我正在尝试制作：

usr_id  day_1   day_2   …   day_30
12345   31       13           15
123444  23       41           14

我想显示过去30天内每天为每个id运行的查询数，如果当天没有运行任何查询，则为0。
这是我提出的问题的一部分，

SELECT
t1.usr_id,
case when t1.count_day_1 is null then 0 else t1.count_day_1 end as day_1,
case when t2.count_day_2 is null then 0 else t2.count_day_2 end as day_2
FROM

(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_1,
        COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_1
        FROM db.table
        WHERE
            DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 1
        AND
            from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
                BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
                AND from_unixtime(unix_timestamp())
        GROUP BY usr_id, day_1) t1

LEFT JOIN
(SELECT usr_id, DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) as day_2,
        COUNT( DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd"))) as count_day_2
        FROM db.table
        WHERE
            DAY(from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")) = 2
        AND
            from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
                BETWEEN date_sub(from_unixtime(unix_timestamp()), 30)
                AND from_unixtime(unix_timestamp())
        GROUP BY usr_id, day_2) t2
ON (t1.usr_id = t2.usr_id)
ORDER BY t1.usr_id;

这非常有效，它显示了前两天每天运行的查询数，并用0替换空值。
问题是要让它在30天内正常工作，我必须使用30个左连接，它在集群上占用了~400gb+的内存。
有没有更简单的方法？

sql hadoop Hive query-optimization hiveql

来源：https://stackoverflow.com/questions/56450300/removing-join-operations-from-a-grouping-query

1条答案

按热度按时间

7xzttuei1#

尝试在没有连接的情况下执行此操作，并使用当前的\u日期或当前的\u时间戳常量，而不是unix的\u timestamp（），在其中，此函数是不确定的，并且它的值对于查询执行的范围不是固定的，因此会妨碍查询的正确优化-从2.0开始，人们就不赞成使用此函数，而是使用当前的\u时间戳常量：

select usr_id,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 1 then 1 end),0) as day_1,
nvl(count(case when from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "dd") = 2 then 1 end),0) as day_2
...
from db.table
        WHERE
            from_unixtime(unix_timestamp(query_ts ,"yyyy/MM/dd"), "yyyy-MM-dd")
                BETWEEN date_sub(current_date, 30) AND current_date)
group by usr_id

赞(0）回复(0）举报 2021-05-29

我来回答

从分组查询中删除联接操作

1条答案

相关问题

热门标签

最新问答