我尝试使用最新的EMR加载一个具有1TB数据的数据库以在AWS上触发。运行时间很长,甚至6小时都没有完成,但在运行6h30m后,我收到一些错误,宣布Container在 lost 节点上释放,然后作业失败。错误是这样的:
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
我非常确定我的网络设置是有效的,因为我已经尝试在同一环境中的一个小得多的表上运行这个脚本。
此外,我知道有人在6个月前发布了一个问题,要求同样的问题:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released,但我仍然要问,因为没有人回答这个问题。
8条答案
按热度按时间pinkon5k1#
看起来其他人也有同样的问题,所以我只是张贴一个答案,而不是写评论。我不确定这将解决问题,但这应该是一个想法。
如果你使用现货示例,你应该知道,现货示例将被关闭,如果价格高于你的输入,你会遇到这个问题。即使你只是使用现货示例作为一个从。所以我的解决方案是不使用任何现货示例长期运行的工作。
另一个想法是将作业分割成许多独立的步骤,这样您就可以将每个步骤的结果保存为S3上的文件。如果发生任何错误,只需从缓存的文件开始执行该步骤。
bn31dyow2#
是动态分配内存吗?我也遇到过类似的问题,我通过计算执行器内存、执行器内核和执行器来进行静态分配,解决了这个问题。在Spark中尝试静态分配,以处理巨大的工作负载。
kmb7vmvb3#
这意味着你的YARN容器关闭了,要调试发生的事情,你必须阅读YARN日志,使用官方CLI
yarn logs -applicationId
或随意使用并贡献给我的项目https://github.com/ebuildy/yoga作为Web应用程序的YARN查看器。您应该会看到许多Worker错误。
6yoyoihd4#
我也碰到了同样的问题。我在DZone上的这篇文章中找到了一些线索:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr
这个问题可以通过增加DataFrame分区的数量来解决(在本例中,从1,024增加到2,048),这减少了每个分区所需的内存。
因此,我尝试增加DataFrame分区的数量,这解决了我的问题。
ddhy6vgd5#
AWS已将此作为常见问题解答发布
对于EMR:https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/
对于胶水作业:https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/
tjrkku2a6#
亚马逊已经提供了他们的解决方案,通过资源分配来处理,没有从用户Angular 出发的处理方式
xesrikrc7#
在我的例子中,我们使用了GCPDataproc集群和2个Pre-Emptible(默认)Secondary Workers。
这对于短时间运行的作业来说不是问题,因为主工作线程和辅助工作线程完成得都相当快。
然而,对于长时间运行的作业,据观察,所有主要工人完成分配的任务相当快,相对于次要工人。
由于可抢占的特性,分配给辅助工人的任务在运行3小时后就会丢失容器。因此,导致
Container losts
错误。我建议不要在任何长时间运行的作业中使用辅助工人。
a2mppw5e8#
检查托管容器的节点的CloudWatch指标和示例状态日志:该节点由于磁盘利用率高而被标记为不健康,或者存在硬件问题。
在前一种情况下,您应该在AWS EMR UI中的“MR不健康节点”度量中看到非零值,在后一种情况下,在“MR丢失节点”度量中看到非零值。请注意,磁盘利用率阈值在YARN中使用
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage
设置进行配置,默认情况下为90%
。与容器日志类似,AWS EMR将带有示例状态快照的日志导出到S3,其中包含大量有用信息,如磁盘利用率、CPU利用率、内存利用率和堆栈跟踪,因此请查看它们。要查找节点的EC2示例ID,将容器日志中的IP地址与AWS EMR UI中的ID相匹配。有关更多信息,请参阅下面的资源。