hadoop Spark Cluster具有心跳超时和Yarn超时

wlsrxk51  于 12个月前  发布在  Hadoop
关注(0)|答案(1)|浏览(166)

我已经检查了Stackoverflow上的其他相关帖子,但似乎没有一个与我直接相关,因为他们引用的spark版本是5年前的**,而我使用的spark版本是Spark 3.3.3**
我正在运行一个Apache Spark集群,以Yarn为主机,同时使用Apriyter Labs作为IDE。当我运行命令来启动集群时,它开始使用:

from pyspark.sql import SparkSession

import getpass
username = getpass.getuser()

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '0'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Kafka and Spark Integration'). \
    master('yarn'). \
    getOrCreate()

计划是从HDFS中的24个JSON文件中读取流,每个文件大约80MB,然后写入流,同时将数据划分为三列,并将它们存储为HDFS中的另一个文件夹中的parquet
这是我使用的命令:

file_df. \
    writeStream. \
    partitionBy('created_year', 'created_month', 'created_dayofmonth'). \
    format('parquet'). \
    option("checkpointLocation", f"/user/{username}/file_df/streaming/checkpoint/file_df"). \
    option("path", f"/user/{username}/file_df/streaming/data/files_parq"). \
    trigger(once=True). \
    start()

然后我在笔记本上运行时得到这个输出

23/09/13 21:34:55 WARN ResolveWriteToStream: spark.sql.adaptive.enabled is not supported in streaming DataFrames/Datasets and will be disabled.

<pyspark.sql.streaming.StreamingQuery at 0x7fd22851adc0>

[Stage 3:===============>                                          (5 + 2) / 19]

23/09/13 21:42:56 WARN HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 120168 ms exceeds timeout 120000 ms
23/09/13 21:42:56 ERROR YarnScheduler: Lost executor 2 on sparkde.camp.300123.internal: Executor heartbeat timed out after 120168 ms
23/09/13 21:42:56 WARN TaskSetManager: Lost task 1.0 in stage 3.0 (TID 41) (sparkde.camp.300123.internal executor 2): ExecutorLostFailure (executor 2 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 120168 ms

[Stage 3:===============>                                          (5 + 1) / 19]

23/09/13 21:42:58 WARN TransportChannelHandler: Exception in connection from /10.172.0.3:41888
java.io.IOException: Connection reset by peer
    at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
    at sun.nio.ch.IOUtil.read(IOUtil.java:192)
    at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
    at io.netty.buffer.PooledByteBuf.setBytes(PooledByteBuf.java:258)
    at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1132)
    at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:350)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:151)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:750)

[Stage 3:========================>                                 (8 + 2) / 19]

23/09/13 21:45:55 ERROR YarnScheduler: Lost executor 3 on sparkde.camp.300123.internal: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143. 
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN TaskSetManager: Lost task 1.1 in stage 3.0 (TID 47) (sparkde.camp.300123.internal executor 3): ExecutorLostFailure (executor 3 exited caused by one of the running tasks) Reason: Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143. 
[2023-09-13 21:45:55.151]Killed by external signal
.
23/09/13 21:45:55 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 3 for reason Container from a bad node: container_1694638458719_0001_01_000004 on host: sparkde.camp.300123.internal. Exit status: 143. Diagnostics: [2023-09-13 21:45:55.150]Container killed on request. Exit code is 143
[2023-09-13 21:45:55.151]Container exited with a non-zero exit code 143. 
[2023-09-13 21:45:55.151]Killed by external signal

我注意到阶段继续进行,最终它停止并给出了所看到的错误

[Stage 3:=======================================>                 (13 + 1) / 19]

23/09/13 21:54:46 WARN TaskSetManager: Lost task 14.0 in stage 3.0 (TID 58) (sparkde.camp.300123.internal executor 6): TaskKilled (Stage cancelled)
23/09/13 22:03:05 WARN SparkContext: Executor 1 might already have stopped and can not request thread dump from it.

这是spark job的截图

请帮助

j1dl9f46

j1dl9f461#

所以我需要增加分配给YARN的内存,所以我需要更新yarn-site.xml中的属性值

<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>21000</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>15048</value>
    </property>

所以我只是更新了值,之前我运行时只有1GB:)

然后过了一段时间,我仍然得到了同样的错误,但这一次是从我的虚拟机,我需要分配更多的RAM,因为我的配置是在1VCPU和8 GB RAM
所以我的spark提交作业现在看起来像这样:

spark = SparkSession. \
    builder. \
    config('spark.ui.port', '4040'). \
    config('spark.sql.warehouse.dir', f'/user/{username}/warehouse'). \
    config("spark.executor.memory", "10g"). \
    config("spark.driver.memory", "10g"). \
    enableHiveSupport(). \
    appName(f'{username} | Python - Incremental Loads using Spark Structurd Streaming'). \
    master('yarn'). \
    getOrCreate()

相关问题