在yarn模式下的Spark以“退出状态：-100,诊断：容器在 lost 节点上释放”结束

bnl4lu3b 于 8个月前发布在 Apache

关注(0)|答案(8)|浏览(96)

我尝试使用最新的EMR加载一个具有1TB数据的数据库以在AWS上触发。运行时间很长，甚至6小时都没有完成，但在运行6h30m后，我收到一些错误，宣布Container在 lost 节点上释放，然后作业失败。错误是这样的：

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

我非常确定我的网络设置是有效的，因为我已经尝试在同一环境中的一个小得多的表上运行这个脚本。
此外，我知道有人在6个月前发布了一个问题，要求同样的问题：spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released，但我仍然要问，因为没有人回答这个问题。

apache-spark

来源：https://stackoverflow.com/questions/38155421/spark-on-yarn-mode-end-with-exit-status-100-diagnostics-container-released

8条答案

按热度按时间

pinkon5k1#

看起来其他人也有同样的问题，所以我只是张贴一个答案，而不是写评论。我不确定这将解决问题，但这应该是一个想法。
如果你使用现货示例，你应该知道，现货示例将被关闭，如果价格高于你的输入，你会遇到这个问题。即使你只是使用现货示例作为一个从。所以我的解决方案是不使用任何现货示例长期运行的工作。
另一个想法是将作业分割成许多独立的步骤，这样您就可以将每个步骤的结果保存为S3上的文件。如果发生任何错误，只需从缓存的文件开始执行该步骤。

赞(0）回复(0）举报 8个月前

bn31dyow2#

是动态分配内存吗？我也遇到过类似的问题，我通过计算执行器内存、执行器内核和执行器来进行静态分配，解决了这个问题。在Spark中尝试静态分配，以处理巨大的工作负载。

赞(0）回复(0）举报 8个月前

kmb7vmvb3#

这意味着你的YARN容器关闭了，要调试发生的事情，你必须阅读YARN日志，使用官方CLI yarn logs -applicationId或随意使用并贡献给我的项目https://github.com/ebuildy/yoga作为Web应用程序的YARN查看器。
您应该会看到许多Worker错误。

赞(0）回复(0）举报 8个月前

6yoyoihd4#

我也碰到了同样的问题。我在DZone上的这篇文章中找到了一些线索：
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr
这个问题可以通过增加DataFrame分区的数量来解决（在本例中，从1，024增加到2，048），这减少了每个分区所需的内存。
因此，我尝试增加DataFrame分区的数量，这解决了我的问题。

赞(0）回复(0）举报 8个月前

ddhy6vgd5#

AWS已将此作为常见问题解答发布
对于EMR：https://aws.amazon.com/premiumsupport/knowledge-center/emr-exit-status-100-lost-node/
对于胶水作业：https://aws.amazon.com/premiumsupport/knowledge-center/container-released-lost-node-100-glue/

赞(0）回复(0）举报 8个月前

tjrkku2a6#

亚马逊已经提供了他们的解决方案，通过资源分配来处理，没有从用户Angular 出发的处理方式

赞(0）回复(0）举报 8个月前

xesrikrc7#

在我的例子中，我们使用了GCPDataproc集群和2个Pre-Emptible（默认）Secondary Workers。
这对于短时间运行的作业来说不是问题，因为主工作线程和辅助工作线程完成得都相当快。
然而，对于长时间运行的作业，据观察，所有主要工人完成分配的任务相当快，相对于次要工人。
由于可抢占的特性，分配给辅助工人的任务在运行3小时后就会丢失容器。因此，导致Container losts错误。
我建议不要在任何长时间运行的作业中使用辅助工人。

赞(0）回复(0）举报 8个月前

a2mppw5e8#

检查托管容器的节点的CloudWatch指标和示例状态日志：该节点由于磁盘利用率高而被标记为不健康，或者存在硬件问题。
在前一种情况下，您应该在AWS EMR UI中的“MR不健康节点”度量中看到非零值，在后一种情况下，在“MR丢失节点”度量中看到非零值。请注意，磁盘利用率阈值在YARN中使用yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage设置进行配置，默认情况下为90%。与容器日志类似，AWS EMR将带有示例状态快照的日志导出到S3，其中包含大量有用信息，如磁盘利用率、CPU利用率、内存利用率和堆栈跟踪，因此请查看它们。要查找节点的EC2示例ID，将容器日志中的IP地址与AWS EMR UI中的ID相匹配。

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/                                                 
                           PRE containers/
                           PRE node/
                           PRE steps/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/ 
                           PRE applications/                                                                                                                                                          
                           PRE daemons/                                                                                                                                                               
                           PRE provision-node/                                                                                                                                                        
                           PRE setup-devices/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/daemons/instance-state/
2023-09-24 13:13:33        748 console.log-2023-09-24-12-08.gz
2023-09-24 13:18:34      55742 instance-state.log-2023-09-24-12-15.gz
...
2023-09-24 17:33:58      60087 instance-state.log-2023-09-24-16-30.gz
2023-09-24 17:54:00      66614 instance-state.log-2023-09-24-16-45.gz
2023-09-24 18:09:01      60932 instance-state.log-2023-09-24-17-00.gz

cat /tmp/instance-state.log-2023-09-24-16-30.gz
...
# amount of disk free
df -h
Filesystem        Size  Used Avail Use% Mounted on
...
/dev/nvme0n1p1     10G  5.7G  4.4G  57% /
/dev/nvme0n1p128   10M  3.8M  6.2M  38% /boot/efi
/dev/nvme1n1p1    5.0G   83M  5.0G   2% /emr
/dev/nvme1n1p2    1.8T  1.7T  121G  94% /mnt
/dev/nvme2n1      1.8T  1.7T  120G  94% /mnt1
...

有关更多信息，请参阅下面的资源。

How can I resolve "Exit status: -100. Diagnostics: Container released on a lost node" errors in Amazon EMR?，AWS知识中心
Unhealthy node on the cluster讨论，StackOverflow。
NodeManager，YARN文档。
View log files，AWS EMR文档。

赞(0）回复(0）举报 8个月前

我来回答

在yarn模式下的Spark以“退出状态：-100,诊断：容器在 lost 节点上释放”结束

8条答案

相关问题

热门标签

最新问答

在yarn模式下的Spark以“退出状态：-100,诊断：容器在 *lost* 节点上释放”结束

8条答案

相关问题

热门标签

最新问答

在yarn模式下的Spark以“退出状态：-100,诊断：容器在 lost 节点上释放”结束