在aws emr上运行spark submit，但在访问s3时失败

我编写了一个spark应用程序，编译成一个.jar文件，我可以从 spark-shell --jars myApplication.jar 在我的emr群集的主节点上运行：

scala> // pass in the existing spark context to the doSomething function, run with a particular argument.
scala> com.MyCompany.MyMainClass.doSomething(spark, "dataset1234")
...

一切都这么顺利。我还设置了我的 main 功能，以便我可以提交 spark-shell :

package com.MyCompany
import org.apache.spark.sql.SparkSession
object MyMainClass {
  val spark = SparkSession.builder()
    .master(("local[*]"))
    .appName("myApp")
    .getOrCreate()

  def main(args: Array[String]): Unit = {
    doSomething(spark, args(0))
  }

  // implementation of doSomething(...) omitted
}

用一个非常简单的main方法 args ，我确认我可以调用 main 方法 spark-submit . 但是，当我尝试在集群上提交实际的生产作业时，它失败了。我是这样提交的：

spark-submit --deploy-mode cluster --class com.MyCompany.MyMainClass s3://my-bucket/myApplication.jar dataset1234

在控制台中，我看到许多消息，包括一些警告，但没有什么特别有用的：

20/11/28 19:28:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/11/28 19:28:47 WARN DependencyUtils: Skip remote jar s3://my-bucket/myApplication.jar.
20/11/28 19:28:47 INFO RMProxy: Connecting to ResourceManager at ip-xxx-xxx-xxx-xxx.region.compute.internal/172.31.31.156:8032
20/11/28 19:28:47 INFO Client: Requesting a new application from cluster with 20 NodeManagers
20/11/28 19:28:48 INFO Configuration: resource-types.xml not found
20/11/28 19:28:48 INFO ResourceUtils: Unable to find 'resource-types.xml'.
20/11/28 19:28:48 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (24576 MB per container)
20/11/28 19:28:48 INFO Client: Will allocate AM container, with 2432 MB memory including 384 MB overhead
20/11/28 19:28:48 INFO Client: Setting up container launch context for our AM
20/11/28 19:28:48 INFO Client: Setting up the launch environment for our AM container
20/11/28 19:28:48 INFO Client: Preparing resources for our AM container
20/11/28 19:28:48 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
20/11/28 19:28:51 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_libs__8971082428743972083.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_libs__8971082428743972083.zip
20/11/28 19:28:53 INFO ClientConfigurationFactory: Set initial getObject socket timeout to 2000 ms.
20/11/28 19:28:53 INFO Client: Uploading resource s3://my-bucket/myApplication.jar -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/myApplication.jar
20/11/28 19:28:54 INFO S3NativeFileSystem: Opening 's3://my-bucket/myApplication.jar' for reading
20/11/28 19:28:54 INFO Client: Uploading resource file:/mnt/tmp/spark-e34d573d-8f23-403c-ac41-aa5154db8ecd/__spark_conf__5385616689365996012.zip -> hdfs://ip-xxx-xxx-xxx-xxx.region.compute.internal:8020/user/hadoop/.sparkStaging/application_1606587406989_0005/__spark_conf__.zip
20/11/28 19:28:54 INFO SecurityManager: Changing view acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls to: hadoop
20/11/28 19:28:54 INFO SecurityManager: Changing view acls groups to:
20/11/28 19:28:54 INFO SecurityManager: Changing modify acls groups to:
20/11/28 19:28:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
20/11/28 19:28:54 INFO Client: Submitting application application_1606587406989_0005 to ResourceManager
20/11/28 19:28:54 INFO YarnClientImpl: Submitted application application_1606587406989_0005
20/11/28 19:28:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:28:55 INFO Client:
         client token: N/A

然后，每秒一次，持续几分钟（在本例中为6分钟），我用 state: ACCEPTED 直到它失败，没有有用的信息。

20/11/28 19:28:56 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
...
... (lots of these messages)
...
20/11/28 19:31:55 INFO Client: Application report for application_1606587406989_0005 (state: ACCEPTED)
20/11/28 19:34:52 INFO Client: Application report for application_1606587406989_0005 (state: FAILED)
20/11/28 19:34:52 INFO Client:
         client token: N/A
         diagnostics: Application application_1606587406989_0005 failed 2 times due to AM Container for appattempt_1606587406989_0005_000002 exited with  exitCode: 13
Failing this attempt.Diagnostics: [2020-11-28 19:32:24.087]Exception from container-launch.
Container id: container_1606587406989_0005_02_000001
Exit code: 13

[2020-11-28 19:32:24.117]Container exited with a non-zero exit code 13. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
elled)
20/11/28 19:32:22 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
20/11/28 19:32:22 WARN TaskSetManager: Lost task 15.0 in stage 1.0 (TID 135, ip-xxx-xxx-xxx-xxx.region.compute.internal, executor driver): TaskKilled (Stage cancelled)

最终，日志将显示：

org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket/dataset1234.parquet;
        at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:759)

我的应用程序首先创建这个文件，当创建失败时，只会默默地忽略它并继续（以防作业被执行、工作并再次被执行，试图覆盖一个文件）。第二部分是它将读取这个文件并做一些额外的工作。因此，从这个错误消息中我知道，我的应用程序正在运行，在第一部分之后继续，但显然spark无法将文件写入s3。从第二条日志消息来看，spark似乎无法从s3下载远程jar文件(在运行之前，我碰巧将文件复制到了~hadoop/ spark-submit ，但我不知道它是否从s3下载失败并找到了本地副本。）
我得到了我的 spark-submit 命令，通过检查emr aws cli export配置显示的步骤，我在web界面中创建了该步骤。这是不是emr没有s3权限的问题？这似乎不太可能，但这里还有什么问题呢？它确实在运行我的作业，但似乎成功地发现文件不存在（即，对我的bucket具有读取权限），但它无法创建文件。
我怎样才能得到更好的调试信息呢？有没有办法确保正确的emr<-->s3权限？

在aws emr上运行spark submit，但在访问s3时失败

1条答案

相关问题

热门标签

最新问答