在yarn模式下的Spark以“退出状态:-100,诊断:容器在 *lost* 节点上释放”结束

bnl4lu3b  于 8个月前  发布在  Apache
关注(0)|答案(8)|浏览(96)

我尝试使用最新的EMR加载一个具有1TB数据的数据库以在AWS上触发。运行时间很长,甚至6小时都没有完成,但在运行6h30m后,我收到一些错误,宣布Container在 lost 节点上释放,然后作业失败。错误是这样的:

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node

我非常确定我的网络设置是有效的,因为我已经尝试在同一环境中的一个小得多的表上运行这个脚本。
此外,我知道有人在6个月前发布了一个问题,要求同样的问题:spark-job-error-yarnallocator-exit-status-100-diagnostics-container-released,但我仍然要问,因为没有人回答这个问题。

pinkon5k

pinkon5k1#

看起来其他人也有同样的问题,所以我只是张贴一个答案,而不是写评论。我不确定这将解决问题,但这应该是一个想法。
如果你使用现货示例,你应该知道,现货示例将被关闭,如果价格高于你的输入,你会遇到这个问题。即使你只是使用现货示例作为一个从。所以我的解决方案是不使用任何现货示例长期运行的工作。
另一个想法是将作业分割成许多独立的步骤,这样您就可以将每个步骤的结果保存为S3上的文件。如果发生任何错误,只需从缓存的文件开始执行该步骤。

bn31dyow

bn31dyow2#

是动态分配内存吗?我也遇到过类似的问题,我通过计算执行器内存、执行器内核和执行器来进行静态分配,解决了这个问题。在Spark中尝试静态分配,以处理巨大的工作负载。

kmb7vmvb

kmb7vmvb3#

这意味着你的YARN容器关闭了,要调试发生的事情,你必须阅读YARN日志,使用官方CLI yarn logs -applicationId或随意使用并贡献给我的项目https://github.com/ebuildy/yoga作为Web应用程序的YARN查看器。
您应该会看到许多Worker错误。

6yoyoihd

6yoyoihd4#

我也碰到了同样的问题。我在DZone上的这篇文章中找到了一些线索:
https://dzone.com/articles/some-lessons-of-spark-and-memory-issues-on-emr
这个问题可以通过增加DataFrame分区的数量来解决(在本例中,从1,024增加到2,048),这减少了每个分区所需的内存。
因此,我尝试增加DataFrame分区的数量,这解决了我的问题。

tjrkku2a

tjrkku2a6#

亚马逊已经提供了他们的解决方案,通过资源分配来处理,没有从用户Angular 出发的处理方式

xesrikrc

xesrikrc7#

在我的例子中,我们使用了GCPDataproc集群和2个Pre-Emptible(默认)Secondary Workers。
这对于短时间运行的作业来说不是问题,因为主工作线程和辅助工作线程完成得都相当快。
然而,对于长时间运行的作业,据观察,所有主要工人完成分配的任务相当快,相对于次要工人。
由于可抢占的特性,分配给辅助工人的任务在运行3小时后就会丢失容器。因此,导致Container losts错误。
我建议不要在任何长时间运行的作业中使用辅助工人。

a2mppw5e

a2mppw5e8#

检查托管容器的节点的CloudWatch指标和示例状态日志:该节点由于磁盘利用率高而被标记为不健康,或者存在硬件问题。
在前一种情况下,您应该在AWS EMR UI中的“MR不健康节点”度量中看到非零值,在后一种情况下,在“MR丢失节点”度量中看到非零值。请注意,磁盘利用率阈值在YARN中使用yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage设置进行配置,默认情况下为90%。与容器日志类似,AWS EMR将带有示例状态快照的日志导出到S3,其中包含大量有用信息,如磁盘利用率、CPU利用率、内存利用率和堆栈跟踪,因此请查看它们。要查找节点的EC2示例ID,将容器日志中的IP地址与AWS EMR UI中的ID相匹配。

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/                                                 
                           PRE containers/
                           PRE node/
                           PRE steps/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/ 
                           PRE applications/                                                                                                                                                          
                           PRE daemons/                                                                                                                                                               
                           PRE provision-node/                                                                                                                                                        
                           PRE setup-devices/

aws s3 ls s3://LOGS_LOCATION/CLUSTER_ID/node/EC2_INSTANCE_ID/daemons/instance-state/
2023-09-24 13:13:33        748 console.log-2023-09-24-12-08.gz
2023-09-24 13:18:34      55742 instance-state.log-2023-09-24-12-15.gz
...
2023-09-24 17:33:58      60087 instance-state.log-2023-09-24-16-30.gz
2023-09-24 17:54:00      66614 instance-state.log-2023-09-24-16-45.gz
2023-09-24 18:09:01      60932 instance-state.log-2023-09-24-17-00.gz

cat /tmp/instance-state.log-2023-09-24-16-30.gz
...
# amount of disk free
df -h
Filesystem        Size  Used Avail Use% Mounted on
...
/dev/nvme0n1p1     10G  5.7G  4.4G  57% /
/dev/nvme0n1p128   10M  3.8M  6.2M  38% /boot/efi
/dev/nvme1n1p1    5.0G   83M  5.0G   2% /emr
/dev/nvme1n1p2    1.8T  1.7T  121G  94% /mnt
/dev/nvme2n1      1.8T  1.7T  120G  94% /mnt1
...

有关更多信息,请参阅下面的资源。

相关问题