计算SQL(hive)中一天中部分时间的事务数

odopli94 于 2022-11-23 发布在 Hive

关注(0)|答案(1)|浏览(113)

我有两个表要合并在一起，并希望计算某些点之前的事务数。例如，在以下两个表中：

CREATE TABLE table1 (ID INT, time TIME);

INSERT INTO table1 
VALUES
(1, '11:30:00'),
(1, '14:30:00'),
(1, '18:00:00')
;

CREATE TABLE table2 (ID INT, txn_time TIME, txn_val INT);

INSERT INTO table2
VALUES
(1, '10:45:13', 1),
(1, '10:50:52', 2),
(1, '11:01:20', 4),
(1, '14:32:12', 2),
(1, '16:43:20', 5),
(1, '19:22:02', 3)
;

我想计算table2中的事务数，它在table1中的每次（累计）之前出现。因此，对于上面的数据，我需要下表：

┌─────────────┬──────────────┬──────────────┐
│ ID          │        time  │    txn count │
├─────────────┼──────────────┼──────────────┤
│ 1           │     11:30:00 │            3 │
│ 1           │     14:30:00 │            3 │
│ 1           │     18:00:00 │            5 │
└─────────────┴──────────────┴──────────────┘

我现在的代码是：

SELECT t1.ID,
t1.time,
sum(CASE WHEN t2.txn_time < t1.time THEN 1 END) over(PARTITION BY t1.time)
FROM table1 AS t1
LEFT JOIN table2 AS t2 on t1.ID = t2.ID
GROUP BY t1.ID, t1.time
ORDER BY t1.time

但是这给出了错误的数字，所以我遗漏了一些东西。你知道如何使用partition by来实现这一点吗，或者有没有更简单的方法？最终我会想在几天内这样做，在每天开始时重置计数。它也在做多对多连接，对于更大的数据集来说，这会变得越来越昂贵，但我不知道如何防止这种情况。

Hive

来源：https://stackoverflow.com/questions/74367519/counting-the-number-of-transactions-for-part-of-a-day-in-sql-hive

1条答案

按热度按时间

z9zf31ra1#

您可以join和聚合：

select t1.id, t1.time, count(t2.id) as txn_count
from table1 t1
left join table2 t2 
    on  t2.id = t1.id 
    and t2.txn_time <= t1.time
group by t1.id, t1.time

这并不是最佳化，因为table2中相同的数据列会随着table1中time的增加而链接多次。预先计算table1中的范围，然后使用 windows 函数可能会更有效率：

select t1.id, t1.time, 
    sum(count(t2.id)) over(partition by id order by time) as txn_count 
from (
    select t1.*, lag(t1.time) over(partition by id order by time) lag_time
    from table1 t1
) t1
left join table2 t2 
    on  t2.id = t1.id
    and t2.txn_time <= t1.time 
    and (t2.txn_time > t1.lag_time or t1.lag_time is null)
group by t1.id, t1.time

赞(0）回复(0）举报 2022-11-23

我来回答

计算SQL(hive)中一天中部分时间的事务数

1条答案

相关问题

热门标签

最新问答