如何调整配置单元以查询元数据？

yruzcnhs 于 2021-06-02 发布在 Hadoop

关注(0)|答案(1)|浏览(274)

如果我在某个分区列的表上运行一个配置单元下的查询，我想确保配置单元没有执行完整的表扫描，只是从元数据本身中找出结果。有什么办法可以做到这一点吗？

Select max(partitioned_col) from hive_table ;

现在，当我运行这个查询时，它的启动map减少了任务，我确信它正在进行数据扫描，同时它可以很好地从元数据本身中找出值。

hadoop Hive hdfs performance tez

来源：https://stackoverflow.com/questions/41947751/how-to-tune-hive-to-query-metadata

1条答案

按热度按时间

xzv2uavs1#

每次更改数据时计算表统计信息。

ANALYZE TABLE hive_table PARTITION(partitioned_col) COMPUTE STATISTICS FOR COLUMNS;

启用cbo和统计信息自动收集：

set hive.cbo.enable=true;
set hive.stats.autogather=true;

使用这些设置可以使用统计信息启用cbo：

set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=true;
set hive.stats.fetch.column.stats=true;

如果没有任何帮助，我建议应用这种方法快速查找最后一个分区：使用表位置的shell脚本解析最大分区键。下面的命令将打印所有表文件夹路径、排序、获取最新排序、获取最后一个子文件夹名称、解析分区文件夹名称和提取值。你只需要初始化 TABLE_DIR 变量和put the number of partition subfolder in the path :

last_partition=$(hadoop fs -ls $TABLE_DIR/* | awk '{ print $8 }' | sort -r | head -n1 | cut -d / -f [number of partition subfolder in the path here] | cut -d = -f 2

然后使用 $last_partition 变量传递给脚本作为

hive -hiveconf last_partition="$last_partition" -f your_script.hql

赞(0）回复(0）举报 2021-06-03

我来回答

如何调整配置单元以查询元数据？

1条答案

相关问题

热门标签

最新问答