hortonworks hadoop nn和rm heap在高利用率时过载,但没有应用程序运行(java.io.ioexception:设备上没有剩余空间)

zd287kbt  于 2021-07-13  发布在  Hadoop
关注(0)|答案(1)|浏览(381)

最近从hadoop(hdp-3.1.0.0)客户机节点启动了一些spark作业,引发了一些
线程“main”org.apache.hadoop.fs.fserror中出现异常:java.io.ioexception:设备上没有剩余空间
错误,现在我看到nn和rm堆似乎停留在高利用率水平(例如80-95%),尽管在rm/ui中存在挂起或运行的作业。
在ambari Jmeter 板上我看到了

但是在rm ui中,似乎没有任何东西在运行:


我在最近的spark作业中看到的失败错误是。。。

[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO YarnScheduler: Removed TaskSet 10.0, whose tasks have all completed, from pool
[2021-02-11 22:05:20,981] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: ResultStage 10 (csv at NativeMethodAccessorImpl.java:0) finished in 8.558 s
[2021-02-11 22:05:20,982] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO DAGScheduler: Job 7 finished: csv at NativeMethodAccessorImpl.java:0, took 8.561029 s
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Job null committed.
[2021-02-11 22:05:20,992] {bash_operator.py:128} INFO - 21/02/11 22:05:20 INFO FileFormatWriter: Finished processing stats for job null.
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - writing to local FS staging area
[2021-02-11 22:05:20,994] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO - Exception in thread "main" org.apache.hadoop.fs.FSError: java.io.IOException: No space left on device
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:262)
[2021-02-11 22:05:23,455] {bash_operator.py:128} INFO -     at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:57)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at java.io.DataOutputStream.write(DataOutputStream.java:107)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:96)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:129)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination$TargetFileSystem.writeStreamToFile(CommandWithDestination.java:485)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyStreamToTarget(CommandWithDestination.java:407)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.copyFileToTarget(CommandWithDestination.java:342)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:277)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPath(CommandWithDestination.java:262)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:367)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:352)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:441)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.recursePath(CommandWithDestination.java:305)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathInternal(Command.java:369)
[2021-02-11 22:05:23,456] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPaths(Command.java:331)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processPathArgument(Command.java:304)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processPathArgument(CommandWithDestination.java:257)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:286)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:270)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.CommandWithDestination.processArguments(CommandWithDestination.java:228)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.FsCommand.processRawArguments(FsCommand.java:120)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.shell.Command.run(Command.java:177)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.run(FsShell.java:328)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.FsShell.main(FsShell.java:391)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO - Caused by: java.io.IOException: No space left on device
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.writeBytes(Native Method)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at java.io.FileOutputStream.write(FileOutputStream.java:326)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.write(RawLocalFileSystem.java:260)
[2021-02-11 22:05:23,457] {bash_operator.py:128} INFO -     ... 29 more
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - 
[2021-02-11 22:05:23,946] {bash_operator.py:128} INFO - Traceback (most recent call last):
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -   File "/home/airflow/projects/hph_etl_airflow/common_prep.py", line 112, in <module>
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO -     assert get.returncode == 0, "ERROR: failed to copy to local dir"
[2021-02-11 22:05:23,947] {bash_operator.py:128} INFO - AssertionError: ERROR: failed to copy to local dir
[2021-02-11 22:05:24,034] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Invoking stop() from shutdown hook
[2021-02-11 22:05:24,040] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO AbstractConnector: Stopped Spark@599cff94{HTTP/1.1,[http/1.1]}{0.0.0.0:4041}
[2021-02-11 22:05:24,048] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkUI: Stopped Spark web UI at http://airflowetl.ucera.local:4041
[2021-02-11 22:05:24,092] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Interrupting monitor thread
[2021-02-11 22:05:24,106] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Shutting down all executors
[2021-02-11 22:05:24,107] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO - (serviceOption=None,
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO -  services=List(),
[2021-02-11 22:05:24,114] {bash_operator.py:128} INFO -  started=false)
[2021-02-11 22:05:24,115] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO YarnClientSchedulerBackend: Stopped
[2021-02-11 22:05:24,123] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
[2021-02-11 22:05:24,154] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO MemoryStore: MemoryStore cleared
[2021-02-11 22:05:24,155] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManager: BlockManager stopped
[2021-02-11 22:05:24,157] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO BlockManagerMaster: BlockManagerMaster stopped
[2021-02-11 22:05:24,162] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
[2021-02-11 22:05:24,173] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO SparkContext: Successfully stopped SparkContext
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Shutdown hook called
[2021-02-11 22:05:24,174] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-57e1dfa3-26e8-490b-b7ca-94bce93e36d7
[2021-02-11 22:05:24,176] {bash_operator.py:128} INFO - 21/02/11 22:05:24 INFO ShutdownHookManager: Deleting directory /tmp/spark-f8837f34-d781-4631-b302-06fcf74d5506/pyspark-225760d8-f365-49fe-8333-6d0df3cb99bd
[2021-02-11 22:05:24,646] {bash_operator.py:132} INFO - Command exited with return code 1
[2021-02-11 22:05:24,663] {taskinstance.py:1088} ERROR - Bash command failed

注意:无法进行更多的调试,因为通过ambari重新启动了集群(某些日常任务需要它,所以不能离开它),并且它将nn和rm堆分别设置为10%和25%。
有人知道这里会发生什么吗?还有其他地方可以(仍然)检查进一步的调试信息吗?

rseugnpd

rseugnpd1#

跑步 df -h 以及 du -h -d1 /some/paths/of/interest 在执行spark调用的机器上,仅从错误(running)中的“writing to local fs”和“no space on disk”消息中进行猜测 clush -ab df -h / 在所有的hadoop节点中,我可以看到启动spark作业的客户机节点是唯一一个磁盘利用率高的节点),我发现调用spark作业的机器上只剩下1gb的磁盘空间(由于其他问题),这最终导致了其中一些作业出现此错误,并已修复了该问题,但不确定这是否相关(据我所知,spark在集群中的其他节点上进行实际处理)。
我怀疑这就是问题所在,但如果有更多经验的人能够解释更多的问题,这将对以后的调试非常有帮助,并能更好地回答这篇文章。如。
为什么其中一个群集节点(在本例中为客户机节点)上缺少可用磁盘空间会导致rm堆保持如此高的利用率,即使在rm ui中没有报告运行任何作业时也是如此?
为什么本地机器上磁盘空间不足会影响spark作业(据我所知,spark在集群中的其他节点上进行实际处理)?
如果调用spark作业的本地计算机上的磁盘空间确实是问题所在,则此问题可能会被标记为与此处所回答问题的重复:https://stackoverflow.com/a/18365738/8236733.

相关问题