spark分区-使用distribute by选项

cld4siwp 于 2021-06-26 发布在 Hive

关注(0)|答案(1)|浏览(485)

我们有一个Spark的环境，应该处理50毫米行。这些行包含一个键列。唯一的钥匙数接近2000把。我想并行处理这2000把钥匙。因此，我们使用如下sparksql

hiveContext.sql("select * from BigTbl DISTRIBUTE by KEY")

随后，我们有了一个mappartitions，它可以很好地并行处理所有分区。但是问题是，默认情况下它只创建200个分区。使用如下命令，我可以增加分区

hiveContext.sql("set spark.sql.shuffle.partitions=500");

然而，在实际生产运行中，我不知道什么是唯一键的数量。我希望这是自动管理的。请问有什么办法吗。
谢谢
巴拉

Hive apache-spark hadoop-partitioning

来源：https://stackoverflow.com/questions/43271837/spark-partitions-using-distribute-by-option

1条答案

按热度按时间

fdx2calv1#

我建议您使用“重新分区”函数，然后将重新分区的表注册为一个新的临时表，并进一步缓存它以加快处理速度。

val distinctValues = hiveContext.sql("select KEY from BigTbl").distinct().count() // find count distinct values 

hiveContext.sql("select * from BigTbl DISTRIBUTE by KEY")
       .repartition(distinctValues.toInt) // repartition to number of distinct values
       .registerTempTable("NewBigTbl") // register the repartitioned table as another temp table

hiveContext.cacheTable("NewBigTbl") // cache the repartitioned table for improving query performance

如需进一步查询，请使用“newbigtbl”

赞(0）回复(0）举报 2021-06-26

我来回答

spark分区-使用distribute by选项

1条答案

相关问题

热门标签

最新问答