重新编写连接查询

chhkpiq4 于 2021-06-01 发布在 Hadoop

关注(0)|答案(1)|浏览(303)

我有个关于Hive的问题。让我向你解释一下情况：
我在用Hive来对付oozie；我有一个查询在不同的表上进行连续左连接；
要插入的行总数约为3500万行；
首先，由于内存不足，作业正在崩溃，因此我设置了“set hive.auto.convert.join=false”查询执行得很好，但需要4个小时；
我试图重写左连接的顺序，将大表放在末尾，但结果相同，大约要执行4个小时；
查询如下：

INSERT OVERWRITE TABLE final_table
SELECT 
T1.Id,
T1.some_field_name,
T1.another_filed_name,

T2.also_another_filed_name,

FROM table1 T1
LEFT JOIN table2 T2 ON ( T2.Id = T1.Id ) -- T2 is the smallest table
LEFT JOIN table3 T3 ON ( T3.Id = T1.Id )
LEFT JOIN table4 T4 ON ( T4.Id = T1.Id ) -- T4 is the biggest table

所以，知道查询的结构后，有没有办法重写它，这样就可以避免过多的连接？
提前谢谢
ps：即使矢量化也给了我同样的时间

hadoop Hive left-join query-optimization

来源：https://stackoverflow.com/questions/44197839/re-writing-a-join-query

1条答案

按热度按时间

insrf1ej1#

评论太长，稍后将被删除。
（1）当前查询无法编译。
（2）您没有从中选择任何内容 T3 以及 T4 ，这毫无意义。
（3）更改表的顺序不太可能对基于成本的优化器产生任何影响。
（4）基本上，我建议收集统计表上的数据，特别是在 id 但在你的情况下我有种感觉 id 在多个表中不是唯一的。
将以下查询的结果添加到帖子中：

select      *
           ,    case when cnt_1 = 0 then 1 else cnt_1 end
            *   case when cnt_2 = 0 then 1 else cnt_2 end
            *   case when cnt_3 = 0 then 1 else cnt_3 end
            *   case when cnt_4 = 0 then 1 else cnt_4 end   as product

from       (select      id
                       ,count(case when tab = 1 then 1 end) as cnt_1
                       ,count(case when tab = 2 then 1 end) as cnt_2
                       ,count(case when tab = 3 then 1 end) as cnt_3
                       ,count(case when tab = 4 then 1 end) as cnt_4

            from       (            select 1 as tab,id from table1
                        union all   select 2 as tab,id from table2  
                        union all   select 3 as tab,id from table3
                        union all   select 4 as tab,id from table4 
                        ) t

            group by    id

            having      greatest (cnt_1,cnt_2,cnt_3,cnt_4) >= 10
            ) t 

order by    product desc

limit       10
;

赞(0）回复(0）举报 2021-06-01

我来回答

重新编写连接查询

1条答案

相关问题

热门标签

最新问答