sql—提高查询性能

kt06eoxx  于 2021-05-29  发布在  Hadoop
关注(0)|答案(1)|浏览(351)

我的背景是oracle,但我们已经转到aws上的hadoop,我正在使用hivesql访问我们的日志。我被要求返回一个报告,其中任何给定类型的系统上的高严重性错误数在正常运行时间之前的30天滚动期间超过9(9,但我在示例中使用2来降低示例数据量)。我已经编写了代码来实现这一点,但我不太了解hive中的性能调优。我在甲骨文学到的很多东西似乎都不适用。
这能改进吗?
数据大致是

CREATE TABLE LOG_TABLE
(SYSTEM_ID  VARCHAR(1),
 EVENT_TYPE VARCHAR(2),
 EVENT_ID   VARCHAR(3),
 EVENT_DATE DATE,
 UPTIME INT);

INSERT INOT LOG_TABLE
VALUES
('1','A1','138','2018-10-29',34),
('1','A2','146','2018-11-13',49),
('1','A3','140','2018-11-02',38),
('1','B1','130','2018-10-13',18),
('1','B1','150','2018-11-19',55),
('1','B2','137','2018-10-27',32),
('2','A1','128','2018-10-11',59),
('2','A1','131','2018-10-16',64),
('2','A1','136','2018-10-25',73),
('2','A2','139','2018-10-31',79),
('2','A2','145','2018-11-11',90),
('2','A2','147','2018-11-14',93),
('2','A3','135','2018-10-24',72),
('2','B1','124','2018-10-03',51),
('2','B1','133','2018-10-19',67),
('2','B2','134','2018-10-22',70),
('2','B2','142','2018-11-06',85),
('2','B2','148','2018-11-15',94),
('2','B2','149','2018-11-17',96),
('3','A2','127','2018-10-10',122),
('3','A3','123','2018-10-01',113),
('3','A3','125','2018-10-06',118),
('3','A3','126','2018-10-07',119),
('3','A3','141','2018-11-05',148),
('3','A3','144','2018-11-10',153),
('3','B1','132','2018-10-18',130),
('3','B1','143','2018-11-08',151),
('3','B2','129','2018-10-12',124);

代码如下。我在日志表上执行一个自连接,以返回所有记录之间的间隔,并包括那些间隔为30天或更短的记录。然后,我选择那些有2个以上事件的事件进入第二个cte,并从中按系统和正常运行时间范围计算不同的事件类型和事件id

WITH EVENTGAP AS  
(SELECT T1.EVENT_TYPE,
       T1.SYSTEM_ID,
       T1.EVENT_ID,
       T2.EVENT_ID AS EVENT_ID2,
       T1.EVENT_DATE,
       T2.EVENT_DATE AS EVENT_DATE2,
       T1.UPTIME,
       DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) AS EVENT_GAP
FROM LOG_TABLE T1
  INNER JOIN LOG_TABLE T2
  ON (T1.EVENT_TYPE=T2.EVENT_TYPE
  AND T1.SYSTEM_ID=T2.SYSTEM_ID)
WHERE DATEDIFF(T2.EVENT_DATE,T1.EVENT_DATE) BETWEEN 0 AND 30
  AND T1.UPTIME BETWEEN 0 AND 299
  AND T2.UPTIME BETWEEN 0 AND 330),

 EVENTCOUNT
AS (SELECT EVENT_TYPE,
       SYSTEM_ID,
       EVENT_ID,
       EVENT_DATE,
       COUNT(1)
FROM EVENTGAP
GROUP BY EVENT_TYPE,
       SYSTEM_ID,
       EVENT_ID,
       EVENT_DATE
HAVING COUNT(1)>2)

SELECT EVENTGAP.SYSTEM_ID, 
       CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
        WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
        WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
        WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
        WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
        WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END AS UPTIME_BAND,
       COUNT(DISTINCT EVENTGAP.EVENT_ID2) AS EVENT_COUNT, 
       COUNT(DISTINCT EVENTGAP.EVENT_TYPE) AS TYPE_COUNT 
FROM EVENTGAP
WHERE EVENTGAP.EVENT_ID IN (SELECT DISTINCT EVENTCOUNT.EVENT_ID FROM EVENTCOUNT)
GROUP BY EVENTGAP.SYSTEM_ID,
      CASE WHEN FLOOR(UPTIME/50) = 0 THEN '0-49'
        WHEN FLOOR(UPTIME/50) = 1 THEN '50-99'
        WHEN FLOOR(UPTIME/50) = 2 THEN '100-149'
        WHEN FLOOR(UPTIME/50) = 3 THEN '150-199'
        WHEN FLOOR(UPTIME/50) = 4 THEN '200-249'
        WHEN FLOOR(UPTIME/50) = 5 THEN '250-299' END

这将产生以下结果,即在任何30天的滚动期间内发生3个或更多事件的事件ID和事件类型的唯一计数。有些事件可能在多个期间内发生,但只统计一次。

EVENTGAP.SYSTEM_ID  UPTIME_BAND EVENT_COUNT TYPE_COUNT
2   50-99   10  3
3   100-149 4   1
jei2mxaa

jei2mxaa1#

在hive和oracle中,您都希望使用window函数和windowframe子句来实现这一点。两个数据库的确切逻辑不同。
在Hive里你可以用 range between 如果你皈依 event_date 一个数字。一个典型的方法是从中减去一个固定值。另一种方法是使用unix时间戳:

select lt.*
from (select lt.*,
             count(*) over (partition by event_type
                            order by unix_timestamp(event_date)
                            range between 60*24*24*30 preceding and current row
                           ) as rolling_count
      from log_table lt
     ) lt
where rolling_count >= 2  -- or 9

相关问题