spark作业优化spark.sql.hive.filesourcepartitionfilecachesize

nzkunb0c 于 2021-07-14 发布在 Spark

关注(0)|答案(0)|浏览(963)

需要帮助优化Spark计数查询Parquet文件。
我有Parquet文件，比如在旧Parquet文件的位置路径上。在旧Parquet文件的路径上，我做了一些转换来添加一个新列（account id），并将整个新数据集保存在新位置（比如新Parquet文件的路径）。

spark version 2.3.0.cloudera3
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_281)

问题：当在旧位置上运行低于计数的查询时，我最多在7分钟内得到结果。下面是查询、它正在创建的分区和计数。

spark.read.parquet("path_of_old_parquet_file").count
[Stage 5:==========================>                          (108 + 106) / 214]
[Stage 7:=================================================>  (5327 + 96) / 5604]

count:3450809070

现在，在新数据集上运行相同的查询（仅添加一个新列）时，所花费的时间超过40分钟。它不再创建5604分区（在旧位置的情况下），而是创建89355个分区，这次我还收到了关于表缓存的警告消息（spark.sql.hive.filesourcepartitionfilecachesize=262144000 bytes）

spark.read.parquet("<path_of_new_parquet file>").count
21/04/15 09:45:59 WARN datasources.SharedInMemoryCache: Evicting cached table partition metadata from memory due to size constraints (spark.sql.hive.filesourcePartitionFileCacheSize = 262144000 bytes). This may impact query planning performance.
21/04/15 09:52:40 WARN util.Utils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf.
[Stage 0:==========================>                          (108 + 109) / 217]
[Stage 3:==================================>              (63690 + 200) / 89355]

Count: 3426213132

对于这两种情况，我使用以下资源。我也试过调整。

master yarn
num-executors 50 
executor-cores 5 
driver-memory 50G 
executor-memory 14G

我也尝试过添加下面的配置，但没有运气。 --conf spark.sql.shuffle.partitions=2000 附加信息：新Parquet文件的路径和旧Parquet文件的路径按日期（每天）和源（共3个源）划分总共约300个日期分区

hdfs dfs -du -h path_of_new_parquet_file
684.2 G  1.3 T  path_of_new_parquet_file

hdfs dfs -du -h path_of_old_parquet_file
562.6 G  1.1 T  path_of_old_parquet_file

能不能请大家指点一下做什么（方法），这样我可以把计数查询时间拉回接近老时间。
请注意：现在我没有访问spark webui服务器来检查统计数据。我正在努力。

Hive apache-spark apache-spark-sql parquet query-optimization

来源：https://stackoverflow.com/questions/67107113/spark-job-optimization-spark-sql-hive-filesourcepartitionfilecachesize