使用in子句参数并行执行配置单元查询

ne5o7dgx 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(471)

我有一个Hive查询，如下所示：

select a.x as column from table1 a where a.y in (<long comma-separated list of parameters>)
union all
select b.x as column from table2 b where b.y in (<long comma-separated list of parameters>)

我已经准备好了 hive.exec.parallel 作为 true 这有助于我在union all之间实现两个查询之间的并行性。
但是，我的 IN 子句有许多逗号分隔的值，每个值在一个作业中取一次，然后取下一个值。这实际上是按顺序执行的。
是否有任何配置单元参数，如果启用该参数，可以帮助我为中的参数并行获取数据 IN 条款？
目前，我的解决方案是使用 = 多次而不是一次 IN 条款。

hadoop Hive performance query-optimization hiveql

来源：https://stackoverflow.com/questions/48484391/execute-hive-query-with-in-clause-parameters-in-parallel

1条答案

按热度按时间

vfh0ocws1#

为了获得更好的并行性，不需要在不同的查询中多次读取相同的数据。调整适当的Map器和减速器的并行度。
首先，使用矢量化启用ppd，使用cbo和tez：

SET hive.optimize.ppd=true;
SET hive.optimize.ppd.storage=true;
SET hive.vectorized.execution.enabled=true;
SET hive.vectorized.execution.reduce.enabled = true;
SET hive.cbo.enable=true;
set hive.stats.autogather=true;
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.execution.engine=tez;
SET hive.stats.fetch.column.stats=true;
SET hive.tez.auto.reducer.parallelism=true;

tez上Map器的示例设置：

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set tez.grouping.max-size=32000000;
set tez.grouping.min-size=32000;

如果决定在mr而不是tez上运行，则Map器的示例设置：

set mapreduce.input.fileinputformat.split.minsize=32000; 
set mapreduce.input.fileinputformat.split.maxsize=32000000;

--减速器设置示例：

set hive.exec.reducers.bytes.per.reducer=32000000; --decrease this to increase the number of reducers, increase to reduce parallelism

播放这些设置。成功的标准是更多的Map器/还原器，你的Map和还原阶段运行得更快。
阅读本文以更好地了解如何调整tez：https://community.hortonworks.com/articles/14309/demystify-tez-tuning-step-by-step.html

赞(0）回复(0）举报 2021-05-29

我来回答

使用in子句参数并行执行配置单元查询

1条答案

相关问题

热门标签

最新问答