emr上“过度”并行性的调优spark

vx6bjr1n  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(805)

我有一个spark任务,它读入一些tb的数据并执行两个窗口函数。这个作业在较小的块(4tb上的50k随机分区)中运行得很好,但是当我将数据输入增加到150k-200k时,15tb节点的随机分区开始失败。
发生这种情况有两个原因:
遗嘱执行人:
洗牌时超时
遗嘱执行人

20/07/01 15:58:14 ERROR YarnClusterScheduler: Lost executor 92 on ip-10-102-125-133.ec2.internal: Container killed by YARN for exceeding memory limits.  22.0 GB of 22 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead or disabling yarn.nodemanager.vmem-check-enabled because of YARN-4714.

我已经增加了驱动程序的大小,以适应大的随机播放:
spark.driver.memory = 16g spark.driver.maxResultSize = 8g 执行器是r5.xlarge,具有以下配置: spark.executor.cores = 4 spark.executor.memory = 18971M spark.yarn.executor.memoryOverheadFactor = 0.1875 这远低于aws规定的最大值:https://docs.aws.amazon.com/emr/latest/releaseguide/emr-hadoop-task-config.html#emr-hadoop-task-config-r5 yarn.nodemanager.resource.memory-mb = 24576 我知道我需要调整 spark.yarn.executor.memoryOverheadFactor 这里是为了给这么多分区带来的巨大开销留出空间。希望这将是那里需要的最后改变。
洗牌超时

20/07/01 15:59:39 ERROR TransportChannelHandler: Connection to ip-10-102-116-184.ec2.internal/10.102.116.184:7337 has been quiet for 600000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.network.timeout if this is wrong.
20/07/01 15:59:39 ERROR TransportResponseHandler: Still have 8 requests outstanding when connection from ip-10-102-116-184.ec2.internal/10.102.116.184:7337 is closed
20/07/01 15:59:39 ERROR OneForOneBlockFetcher: Failed while starting block fetches

我已将此超时调整如下: spark.network.timeout = 600 我可以进一步提高 spark.network.timeout 在conf中,让它安静下来,等待更长时间。我宁愿降低价格 Shuffle Read Blocked Time ,从1分钟到30分钟不等。有没有办法提高节点之间的通信速率?
我已尝试调整以下设置,但似乎无法提高此速度: spark.reducer.maxSizeInFlight = 512m spark.shuffle.io.numConnectionsPerPeer = 5 spark.shuffle.io.backLog = 128 我需要调整什么来降低 Shuffle Read Blocked Time 在aws emr上?

vi4fp9gy

vi4fp9gy1#

对于遗嘱执行人,请这样做。它为我们解决了问题。发件人:https://aws.amazon.com/blogs/big-data/best-practices-for-successfully-managing-memory-for-apache-spark-applications-on-amazon-emr/

Even if all the Spark configuration properties are calculated and set correctly, virtual out-of-memory errors can still occur rarely as virtual memory is bumped up aggressively by the OS. To prevent these application failures, set the following flags in the YARN site settings.

Best practice 5: Always set the virtual and physical memory check flag to false.

"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"

理由是:“容器因超过内存限制而被Yarn杀死。在具有75gb内存的emr群集上使用10.4 gb的10.4 gb物理内存
要解决洗牌超时问题,请尝试增加存储(ebs卷)。

相关问题