我想保存一个从aws sagemaker到s3的sparkDataframe。在笔记本里，我跑了 myDF.write.mode('overwrite').parquet("s3a://my-bucket/dir/dir2/") 我明白了
py4jjavaerror:调用o326.parquet时出错：java.lang.runtimeexception:java.lang.classnotfoundexception:class org.apache.hadoop.fs.s3native.natives3filesystem未在org.apache.hadoop.conf.conf.configuration.getclass（configuration）中找到。java:2195)在org.apache.hadoop.fs.filesystem.getfilesystemclass（filesystem。java:2654)在org.apache.hadoop.fs.filesystem.createfilesystem（文件系统）。java:2667)在org.apache.hadoop.fs.filesystem.access$200（文件系统）。java:94)在org.apache.hadoop.fs.filesystem$cache.getinternal（filesystem。java:2703)在org.apache.hadoop.fs.filesystem$cache.get（filesystem。java:2685)在org.apache.hadoop.fs.filesystem.get（filesystem。java:373)在org.apache.hadoop.fs.path.getfilesystem（路径。java:295)在org.apache.spark.sql.execution.datasources.datasource.writeinfileformat（datasource。scala:394)在org.apache.spark.sql.execution.datasources.datasource.write（datasource。scala:471)在org.apache.spark.sql.execution.datasources.saveintodatasourcecommand.run（saveintodatasourcecommand。scala:50) 位于org.apache.spark.sql.execution.command.executedcommandexec.sideeffectresult$lzycompute（命令）。scala:58)在org.apache.spark.sql.execution.command.executedcommandexec.sideeffectresult（commands。scala:56)在org.apache.spark.sql.execution.command.executecommandexec.doexecute（commands。scala:74)在org.apache.spark.sql.execution.sparkplan$$anonfun$执行$1.apply（sparkplan。scala:117)在org.apache.spark.sql.execution.sparkplan$$anonfun$execute$1.apply（sparkplan。scala:117)在org.apache.spark.sql.execution.sparkplan$$anonfun$executequery$1.apply（sparkplan。scala:138)在org.apache.spark.rdd.rddoperationscope$.withscope（rddoperationscope。scala:151)在org.apache.spark.sql.execution.sparkplan.executequery（sparkplan。scala:135)在org.apache.spark.sql.execution.sparkplan.execute（sparkplan。scala:116)在org.apache.spark.sql.execution.queryexecution.tordd$lzycompute（queryexecution。scala:92)在org.apache.spark.sql.execution.queryexecution.tordd（查询执行）。scala:92)位于org.apache.spark.sql.dataframewriter.runcommand（dataframewriter。scala:609)位于org.apache.spark.sql.dataframewriter.save（dataframewriter。scala:233)位于org.apache.spark.sql.dataframewriter.save（dataframewriter。scala:217)在org.apache.spark.sql.dataframewriter.parquet（dataframewriter。scala:508)在sun.reflect.nativemethodaccessorimpl.invoke0（本机方法）在sun.reflect.nativemethodaccessorimpl.invoke（nativemethodaccessorimpl）。java:62)在sun.reflect.delegatingmethodaccessorimpl.invoke（delegatingmethodaccessorimpl。java:43)在java.lang.reflect.method.invoke（方法。java:498)在py4j.reflection.methodinvoker.invoke（methodinvoker。java:244)在py4j.reflection.reflectionengine.invoke（reflectionengine。java:357)在py4j.gateway.invoke（gateway。java:280)在py4j.commands.abstractcommand.invokemethod（abstractcommand。java:132)在py4j.commands.callcommand.execute（callcommand。java:79)在py4j.gatewayconnection.run（网关连接。java:214)在java.lang.thread.run（线程。java:745)原因：java.lang.classnotfoundexception:在org.apache.hadoop.conf.configuration.getclassbyname（配置）中找不到类org.apache.hadoop.fs.s3native.natives3filesystem。java:2101)在org.apache.hadoop.conf.configuration.getclass（配置。java:2193)
我应该如何在笔记本中正确地做它？非常感谢！

1条答案

按热度按时间

uhry853o1#

sagemaker notebook示例没有运行spark代码，也没有hadoop或其他您试图调用的java类。
通常在sagemaker python库（如pandas）中的jupyter笔记本中，您可以使用它来编写parquet文件（例如，https://pandas.pydata.org/pandas-docs/stable/generated/pandas.dataframe.to_parquet.html ).
另一种选择是从jupyter笔记本连接到现有（或新）spark集群，并在那里远程执行命令。有关如何设置此连接的文档，请参见此处：https://aws.amazon.com/blogs/machine-learning/build-amazon-sagemaker-notebooks-backed-by-spark-in-amazon-emr/

赞(0）回复(0）举报 2021-06-01

如何从aws sagemaker保存s3中的Parquet？

1条答案

相关问题

热门标签

最新问答