需要大量时间的sql生产hadoop查询

mnowg1ta 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(403)

当前状态
我们有一个运行了2个多小时的查询。在检查进度时，查询在与表t5连接期间和查询的最后阶段花费了大量时间。我们有没有办法简化这个查询？我无法使用聚合函数代替rank（），因为使用的orderby有点复杂。
我们已经尝试过的
我们已经将子查询转换为select子句中的case语句，并帮助减少了执行时间，但这并不重要。我们简化了t3、t4和t6的相关查询。

SELECT * FROM 
        (SELECT T2.f1, T2.f2 .... T5.f19, T5.f20, 
                   case when T1.trxn_id is null then T2.crt_ts
                        when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
                        when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts
                    end as crt_ts , 
                    row_number() over ( partition by T2.w_trxn_id,
                                            if(T1.trxn_id is null, 'NULL', T1.trxn_id)
                                            order by T2.business_effective_ts desc,
                                            case when T1.trxn_id is null then T2.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts
                                            when T1.trxn_id is not null and T5.acct_trxn_id is null then T2.crt_ts end desc
                                        ) as rnk
                FROM(SELECT * FROM T3 WHERE title_name = 'CAPTURE' and tr_dt IN (SELECT tr_dt FROM DT_LKP))
                T2
                LEFT JOIN (SELECT * FROM T6 WHERE tr_dt IN (SELECT tr_dt FROM DT_LKP)) 
                T1 ON T2.w_trxn_id = T1.w_trxn_id AND T2.business_effective_ts = T1.business_effective_ts
                LEFT JOIN (SELECT f1, f3. ... f20 FROM T4 WHERE tr_dt IN (SELECT tr_dt FROM DT_LKP)) 
                T5 ON T1.trxn_id = T5.acct_trxn_id
                WHERE if(T1.trxn_id is null, 'NULL', T1.trxn_id) = if(T5.acct_trxn_id is null, 'NULL', T5.acct_trxn_id)
        ) FNL WHERE rnk = 1

sql hadoop Hive query-optimization

来源：https://stackoverflow.com/questions/55595417/production-hadoop-query-that-takes-lot-of-time

1条答案

按热度按时间

i7uaboj41#

我不确定这是否对你有帮助。有一个很奇怪的where子句：

WHERE if(T1.trxn_id is null, 'NULL', T1.trxn_id) = if(T5.acct_trxn_id is null, 'NULL', T5.acct_trxn_id)

这可能是为了加入 NULL 以及正常值。那么它就不起作用了，因为首先连接条件是 T5 ON T1.trxn_id = T5.acct_trxn_id 这意味着空值没有联接，那么 WHERE 联接后用作筛选器。如果 T5 则t5.acct\u trxn\u id转换为where中的“null”字符串，并与not null t1.trxn\u id值进行比较，很可能被过滤掉，在本例中类似于内部联接。如果t1.trxn\u id为null（驱动表），它将转换为字符串'null'，并与总是字符串'null'进行比较（因为根据on子句无论如何都没有连接），这样的行就被传递了（尽管我没有测试它）。这个逻辑看起来很奇怪，我认为它并没有按预期的那样工作，也没有转化为内在的。如果您想连接所有包含空值的对象，请将这个where移到join on子句。
如果有许多行具有空值，那么使用字符串'null'替换的空值上的联接将使行相乘，并将导致重复。
实际上，在调查join性能不佳时，检查两件事：
联接键不重复或应重复
联接键（以及按行\号中的列进行分区）不倾斜，请参见以下内容：https://stackoverflow.com/a/53333652/2700344 还有这个：https://stackoverflow.com/a/51061613/2700344
如果一切看起来都很好，那么调整适当的减速器平行度，减少 hive.exec.reducers.bytes.per.reducer 让更多的减速器运转
同时减少 DT_LKP 尽可能多地，即使您知道它包含一些日期，这些日期肯定不是/不应该是事实表，如果可能的话，使用cte对其进行过滤。
同时简化一点逻辑（这不会提高性能，但会简化代码）。选择案例：

when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts >= T5.crt_ts then T2.crt_ts
when T1.trxn_id is not null and T5.acct_trxn_id is not null and T2.crt_ts < T5.crt_ts then T5.crt_ts

<=>

else greatest(T2.trxn_id,T5.crt_ts)

如果t5.crt\u ts为null，case语句将返回null，greatest（）也将返回null
行中的case语句\u编号简化：

case when case when (T1.trxn_id is null) or (T5.acct_trxn_id is null) then T2.crt_ts
     else greatest(T2.trxn_id,T5.crt_ts)
 end

还有： if(T1.trxn_id is null, 'NULL', T1.trxn_id) <=> NVL(T1.trxn_id,'NULL') 当然，这些只是建议，我没有测试他们

赞(0）回复(0）举报 2021-05-29

我来回答

需要大量时间的sql生产hadoop查询

1条答案

相关问题

热门标签

最新问答