如何从aws sagemaker保存s3中的Parquet?

bt1cpqcv  于 2021-06-01  发布在  Hadoop
关注(0)|答案(1)|浏览(463)

我想保存一个从aws sagemaker到s3的sparkDataframe。在笔记本里,我跑了 myDF.write.mode('overwrite').parquet("s3a://my-bucket/dir/dir2/") 我明白了
py4jjavaerror:调用o326.parquet时出错:java.lang.runtimeexception:java.lang.classnotfoundexception:class org.apache.hadoop.fs.s3native.natives3filesystem未在org.apache.hadoop.conf.conf.configuration.getclass(configuration)中找到。java:2195)在org.apache.hadoop.fs.filesystem.getfilesystemclass(filesystem。java:2654)在org.apache.hadoop.fs.filesystem.createfilesystem(文件系统)。java:2667)在org.apache.hadoop.fs.filesystem.access$200(文件系统)。java:94)在org.apache.hadoop.fs.filesystem$cache.getinternal(filesystem。java:2703)在org.apache.hadoop.fs.filesystem$cache.get(filesystem。java:2685)在org.apache.hadoop.fs.filesystem.get(filesystem。java:373)在org.apache.hadoop.fs.path.getfilesystem(路径。java:295)在org.apache.spark.sql.execution.datasources.datasource.writeinfileformat(datasource。scala:394)在org.apache.spark.sql.execution.datasources.datasource.write(datasource。scala:471)在org.apache.spark.sql.execution.datasources.saveintodatasourcecommand.run(saveintodatasourcecommand。scala:50) 位于org.apache.spark.sql.execution.command.executedcommandexec.sideeffectresult$lzycompute(命令)。scala:58)在org.apache.spark.sql.execution.command.executedcommandexec.sideeffectresult(commands。scala:56)在org.apache.spark.sql.execution.command.executecommandexec.doexecute(commands。scala:74)在org.apache.spark.sql.execution.sparkplan$$anonfun$执行$1.apply(sparkplan。scala:117)在org.apache.spark.sql.execution.sparkplan$$anonfun$execute$1.apply(sparkplan。scala:117)在org.apache.spark.sql.execution.sparkplan$$anonfun$executequery$1.apply(sparkplan。scala:138)在org.apache.spark.rdd.rddoperationscope$.withscope(rddoperationscope。scala:151)在org.apache.spark.sql.execution.sparkplan.executequery(sparkplan。scala:135)在org.apache.spark.sql.execution.sparkplan.execute(sparkplan。scala:116)在org.apache.spark.sql.execution.queryexecution.tordd$lzycompute(queryexecution。scala:92)在org.apache.spark.sql.execution.queryexecution.tordd(查询执行)。scala:92)位于org.apache.spark.sql.dataframewriter.runcommand(dataframewriter。scala:609)位于org.apache.spark.sql.dataframewriter.save(dataframewriter。scala:233)位于org.apache.spark.sql.dataframewriter.save(dataframewriter。scala:217)在org.apache.spark.sql.dataframewriter.parquet(dataframewriter。scala:508)在sun.reflect.nativemethodaccessorimpl.invoke0(本机方法)在sun.reflect.nativemethodaccessorimpl.invoke(nativemethodaccessorimpl)。java:62)在sun.reflect.delegatingmethodaccessorimpl.invoke(delegatingmethodaccessorimpl。java:43)在java.lang.reflect.method.invoke(方法。java:498)在py4j.reflection.methodinvoker.invoke(methodinvoker。java:244)在py4j.reflection.reflectionengine.invoke(reflectionengine。java:357)在py4j.gateway.invoke(gateway。java:280)在py4j.commands.abstractcommand.invokemethod(abstractcommand。java:132)在py4j.commands.callcommand.execute(callcommand。java:79)在py4j.gatewayconnection.run(网关连接。java:214)在java.lang.thread.run(线程。java:745)原因:java.lang.classnotfoundexception:在org.apache.hadoop.conf.configuration.getclassbyname(配置)中找不到类org.apache.hadoop.fs.s3native.natives3filesystem。java:2101)在org.apache.hadoop.conf.configuration.getclass(配置。java:2193)
我应该如何在笔记本中正确地做它?非常感谢!

uhry853o

uhry853o1#

sagemaker notebook示例没有运行spark代码,也没有hadoop或其他您试图调用的java类。
通常在sagemaker python库(如pandas)中的jupyter笔记本中,您可以使用它来编写parquet文件(例如,https://pandas.pydata.org/pandas-docs/stable/generated/pandas.dataframe.to_parquet.html ).
另一种选择是从jupyter笔记本连接到现有(或新)spark集群,并在那里远程执行命令。有关如何设置此连接的文档,请参见此处:https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/

相关问题