从事件中提取会话

hwazgwia  于 2021-07-15  发布在  ClickHouse
关注(0)|答案(2)|浏览(384)

在clickhouse中,我有一个带有用户id和标记的事件列表。我的目标是从这些数据中提取会话。
会话是一组时间相近的事件。举个例子,如果一个事件比前一个事件晚了半个多小时,那么它就在另一个会话中。但是,如果每15分钟发生一次事件,则会话可以长达12小时。
我查看了timeslot函数的文档,它描述了一个与我类似的用例,但我不知道如何编写查询(https://clickhouse.yandex/docs/en/query_language/functions/date_time_functions/#timeslot)
例如:
活动:

date                | user  | tag
2018-12-21 00:00:00 │ user1 │ tag1
2018-12-21 00:00:00 │ user2 │ tag1
2018-12-21 00:15:00 │ user1 │ tag1
2018-12-21 00:15:00 │ user2 │ tag2
2018-12-21 00:30:00 │ user1 │ tag1
2018-12-21 00:45:00 │ user1 │ tag1
2018-12-21 01:45:00 │ user1 │ tag1

结果会话:

date                | date_end            | user  | tag  | count
2018-12-21 00:00:00 | 2018-12-21 00:45:00 | user1 | tag1 | 4
2018-12-21 00:00:00 | 2018-12-21 00:00:00 | user2 | tag1 | 1
2018-12-21 00:15:00 | 2018-12-21 00:15:00 | user2 | tag2 | 1
2018-12-21 01:45:00 | 2018-12-21 01:45:00 | user1 | tag1 | 1
ttygqcqt

ttygqcqt1#

此查询依赖于将日期四舍五入到半小时的默认时间段函数:

SELECT user, tag, eventCount, length(sessionStartDateArray) sessionCount, sessionStartDateArray
FROM
(
    SELECT
        user,
        tag,
        -- a count of events with rounded date (remove DISTINCT-clause from nested query to get a real count of events).
        count() as eventCount,
        -- an array of rounded dates
        groupArray(roundedDate) AS roundedDateArray,
        -- an array of rounded dates shifted to 30 minutes (where 30 min taken from timeSlot-function)
        arrayMap(i -> (i + 1800), roundedDateArray) AS shiftedRoundedDateArray,
        -- to intersect two arrays to find the dates when sessions start
        arrayFilter(x -> (has(shiftedRoundedDateArray, x) = 0), roundedDateArray) AS sessionStartDateArray
    FROM
    (
        SELECT DISTINCT
            user,
            tag,
            -- rounds down the time to the half hour.
            timeSlot(date) AS roundedDate
        FROM test01
    )
    GROUP BY user, tag
)
ORDER BY user, tag;
sigwle7e

sigwle7e2#

create table C (D DateTime',' user String',' tag String) Engine = Memory;

insert into C values
('2018-12-21 00:00:00','user1','tag1'),
('2018-12-21 00:00:00','user2','tag1'),
('2018-12-21 00:15:00','user1','tag1'),
('2018-12-21 00:15:00','user2','tag2'),
('2018-12-21 00:30:00','user1','tag1'),
('2018-12-21 00:45:00','user1','tag1'),
('2018-12-21 01:45:00','user1','tag1'),

SELECT user, tag,
       toDateTime(((arrayJoin(arraySplit((k, j) -> j, Arr, 
              arrayMap(i -> i > 1800, 
              arrayDifference(arraySort(groupArray(toUnixTimestamp(D))) as Arr))))) 
              as R)[1]) b, 
       toDateTime(R[-1]) e,
       length(R) c
from C    
group by user, tag

┌─user──┬─tag──┬───────────────────b─┬───────────────────e─┬─c─┐
│ user2 │ tag2 │ 2018-12-21 00:15:00 │ 2018-12-21 00:15:00 │ 1 │
│ user1 │ tag1 │ 2018-12-21 00:00:00 │ 2018-12-21 00:45:00 │ 4 │
│ user1 │ tag1 │ 2018-12-21 01:45:00 │ 2018-12-21 01:45:00 │ 1 │
│ user2 │ tag1 │ 2018-12-21 00:00:00 │ 2018-12-21 00:00:00 │ 1 │
└───────┴──────┴─────────────────────┴─────────────────────┴───┘

相关问题