hadoop—在配置单元中的许多表上执行有效联接

sulc1iza 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(301)

我在hive1.2中加入了大约14个表来创建基表。每个表都有数百万条记录，这些是执行查询时使用的参数

hive.exec.dynamic.partition=true;  
hive.exec.max.dynamic.partitions.pernode=200000;  
hive.exec.max.dynamic.partitions=200000;  
hive.exec.max.created.files=250000;  
hive.enforce.bucketing=true;  
hive.auto.convert.join=false;  
mapreduce.map.memory.mb=8192;  
mapreduce.reduce.memory.mb=8192;  
mapred.reduce.child.java.opts=-Xmx8096m;  
mapred.map.child.java.opts=-Xmx8096m;  
hive.exec.dynamic.partition.mode=nonstrict;

我使用的是orc文件格式，并基于id对表进行bucketing，基于年、季和月对表进行分区。该表显然在连接方面执行了大量计算。请让我知道任何其他参数或执行不同的策略，可以用来执行更有效的连接

hadoop Hive Join optimization query-optimization

来源：https://stackoverflow.com/questions/37870487/performing-effective-joins-on-many-tables-in-hive

1条答案

按热度按时间

xxhby3vn1#

您还可以查看源表中文件和文件块的大小。完成的每个连接基本上都是对每个文件块执行的，因此增加文件/块的大小意味着需要执行的连接更少。另一方面，更大的文件/块意味着更少的并行化，因此需要进行一些测试以找到适当的平衡。您可以通过使用下面的设置进行小文件合并来调整块大小。这些设置还将为每个文件生成1个块，这对于大多数情况下的性能非常理想。

-- config settings to be added to the DML that loads your source tables
-- these will merge the files into 500MB files with only one block per file
-- as long as the block size is set higher than the file size then only one block will be produced
set hive.merge.smallfiles.avgsize = 524288000;
set dfs.block.size = 1073741824;

赞(0）回复(0）举报 2021-06-02

我来回答

hadoop—在配置单元中的许多表上执行有效联接

1条答案

相关问题

热门标签

最新问答