spark emr作业失败：原因：org.apache.spark.memory.sparkoutofmemoryerror:无法获取16384字节的内存，获得0

gblwokeq 于 2021-07-14 发布在 Spark

关注(0)|答案(0)|浏览(518)

我使用spark2.4.6和emr5.31.0运行一个sparkscalaemr作业，它有12个执行器，m5.4x1大型示例类型，51gb堆。在我的emr集群中，我还有以下配置：

[{"classification":"spark", "properties":{"maximizeResourceAllocation":"true"}, "configurations":[]}]

我不明白我的oom错误是从哪里来的，我可以做什么来修复它。在我写作之前，我先做一个练习 join , groupBy , groupBy ，和另一个 join . 我怎样才能找出导致oom错误的原因？

DF.withColumn("hour", hour(col("start")))
  .withColumn("day", dayofmonth(col("start")))
  .withColumn("month", month(col("start")))
  .withColumn("year", year(col("start")))
  .write
  .partitionBy("year", "month", "day", "hour")
  .mode(SaveMode.Overwrite)
  .parquet(outputPath)

Caused by: org.apache.spark.memory.SparkOutOfMemoryError: Unable to acquire 16384 bytes of memory, got 0
    at org.apache.spark.memory.MemoryConsumer.throwOom(MemoryConsumer.java:157)
    at org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:98)
    at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.<init>(UnsafeInMemorySorter.java:128)
    at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.<init>(UnsafeExternalSorter.java:161)
    at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.create(UnsafeExternalSorter.java:128)
    at org.apache.spark.sql.execution.ExternalAppendOnlyUnsafeRowArray.add(ExternalAppendOnlyUnsafeRowArray.scala:115)
    at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.fetchNextPartition(WindowExec.scala:343)
    at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.next(WindowExec.scala:369)
    at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.next(WindowExec.scala:303)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage22.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
    at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:188)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.run(Task.scala:123)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

失败阶段的dag可视化

失败阶段执行者的聚合度量

“执行者”选项卡