java.lang.outofmemoryerror可能在驱动程序中:无法保存word2vec模型

7d7tgy0s  于 2021-06-02  发布在  Hadoop
关注(0)|答案(0)|浏览(252)

使用spark v1.6.0
我正在试着在我们的集群上输入单词向量。然而,我得到了一个 java.lang.OutOfMemoryError 异常(参见下面的日志输出)。该任务有40gb的可用ram,输入是一个5.5gb的json文件。
由于所有阶段都在完成它的任务,我认为问题已经到了关键时刻

model.save(sc, outputFile)

但我不是100%确定,即使我是,我也不知道如何才能避免这个问题。请参见下面的程序框架。如果需要,我可以提供完整的源代码-我删除了预处理部分以简化代码。

object Word2VecOnCluster {

  def main(args: Array[String]) {

    val inputFile = new File(args(0))
    val outputDirectory = new File(args(1))
    val conf = new SparkConf().setAppName("Word2VecOnCluster")

    if(args.length > 2) {
      conf.setMaster(args(2))
    }

    outputDirectory.mkdirs()

    val sc = new SparkContext(conf)
    val file = sc.textFile(inputFile.getAbsolutePath)

    val wordSequence = file
      .repartition(500)
      .mapPartitions( lineIterator => {
          // Does some preprocessing ..
      })    
      .map(line => line.split("\\s+").toSeq)

    val word2vec = new Word2Vec()    
    val model = word2vec
      .setNumPartitions(500)
      .fit(wordSequence)

    val outputFile =  outputDirectory.getAbsolutePath + File.separator + inputFile.getName
    model.save(sc, outputFile)

    System.exit(0)    
  }
}

这是阶段2.0的最后一个任务完成后的日志输出:

16/04/06 12:06:33 INFO TaskSetManager: Finished task 428.0 in stage 2.0 (TID 970) in 701 ms on node12.hadoop.company.at (499/500)
16/04/06 12:06:33 INFO TaskSetManager: Finished task 439.0 in stage 2.0 (TID 981) in 680 ms on node12.hadoop.company.at (500/500)
16/04/06 12:06:33 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 
16/04/06 12:06:33 INFO DAGScheduler: ResultStage 2 (collect at Word2Vec.scala:170) finished in 2.356 s
16/04/06 12:06:33 INFO DAGScheduler: Job 0 finished: collect at Word2Vec.scala:170, took 439.713694 s
16/04/06 12:06:33 INFO Word2Vec: trainWordsCount = 685345162
16/04/06 12:06:33 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 3.9 KB, free 357.7 KB)
16/04/06 12:06:33 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.0 KB, free 361.7 KB)
16/04/06 12:06:33 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.0.95:42051 (size: 4.0 KB, free: 511.1 MB)
16/04/06 12:06:33 INFO SparkContext: Created broadcast 4 from broadcast at Word2Vec.scala:290
16/04/06 12:06:33 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 150.6 MB, free 150.9 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.0 MB, free 154.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 507.1 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece1 stored as bytes in memory (estimated size 4.0 MB, free 158.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece1 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 503.1 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece2 stored as bytes in memory (estimated size 4.0 MB, free 162.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece2 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 499.1 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece3 stored as bytes in memory (estimated size 2.9 MB, free 165.8 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece3 in memory on 192.168.0.95:42051 (size: 2.9 MB, free: 496.2 MB)
16/04/06 12:06:35 INFO SparkContext: Created broadcast 5 from broadcast at Word2Vec.scala:291
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 33.2 MB, free 199.0 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 4.0 MB, free 203.0 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 492.2 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_6_piece1 stored as bytes in memory (estimated size 870.5 KB, free 203.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_6_piece1 in memory on 192.168.0.95:42051 (size: 870.5 KB, free: 491.3 MB)
16/04/06 12:06:35 INFO SparkContext: Created broadcast 6 from broadcast at Word2Vec.scala:292
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:329)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
    at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:328)
    at masterthesis.code.wordvectors.Word2VecOnCluster$.main(Word2VecOnCluster.scala:112)
    at masterthesis.code.wordvectors.Word2VecOnCluster.main(Word2VecOnCluster.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.0.95:42051 in memory (size: 2.2 KB, free: 491.3 MB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node10.hadoop.company.at:36555 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node04.hadoop.company.at:33602 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node13.hadoop.company.at:53455 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node11.hadoop.company.at:50336 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node02.hadoop.company.at:52435 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node06.hadoop.company.at:49865 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node09.hadoop.company.at:44672 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node01.hadoop.company.at:33026 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node14.hadoop.company.at:38802 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node03.hadoop.company.at:48959 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node12.hadoop.company.at:60505 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node05.hadoop.company.at:43832 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node07.hadoop.company.at:56636 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node08.hadoop.company.at:51583 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node15.hadoop.company.at:41850 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO ContextCleaner: Cleaned accumulator 3
16/04/06 12:06:38 INFO ContextCleaner: Cleaned accumulator 2
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.0.95:42051 in memory (size: 2.2 KB, free: 491.3 MB)
16/04/06 12:06:38 INFO SparkContext: Invoking stop() from shutdown hook
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node14.hadoop.company.at:38802 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node05.hadoop.company.at:43832 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node12.hadoop.company.at:60505 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node08.hadoop.company.at:51583 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node03.hadoop.company.at:48959 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node15.hadoop.company.at:41850 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node07.hadoop.company.at:56636 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node13.hadoop.company.at:53455 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node06.hadoop.company.at:49865 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node10.hadoop.company.at:36555 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node04.hadoop.company.at:33602 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node02.hadoop.company.at:52435 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node09.hadoop.company.at:44672 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node11.hadoop.company.at:50336 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node01.hadoop.company.at:33026 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO ContextCleaner: Cleaned accumulator 1
16/04/06 12:06:38 INFO ContextCleaner: Cleaned shuffle 1
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/04/06 12:06:38 INFO SparkUI: Stopped Spark web UI at http://192.168.0.95:4040
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Shutting down all executors
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Interrupting monitor thread
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/04/06 12:06:38 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Stopped
16/04/06 12:06:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/04/06 12:06:38 INFO MemoryStore: MemoryStore cleared
16/04/06 12:06:38 INFO BlockManager: BlockManager stopped
16/04/06 12:06:38 INFO BlockManagerMaster: BlockManagerMaster stopped
16/04/06 12:06:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/04/06 12:06:38 INFO SparkContext: Successfully stopped SparkContext
16/04/06 12:06:38 INFO ShutdownHookManager: Shutdown hook called
16/04/06 12:06:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-3f6c9bf3-be08-4eac-bc12-f6110beedb60
16/04/06 12:06:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-3f6c9bf3-be08-4eac-bc12-f6110beedb60/httpd-cc65e6a2-705b-4da0-8845-54992f85ddb8
16/04/06 12:06:38 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.

如您所见,阶段结束,指向驱动程序。请忽略“(1失败)”这只是一个特殊的节点,有时拒绝工作^^

我能做些什么来解决这个问题?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题