sparksql看不到hdfs文件

yhuiod9q  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(512)

我有一个spark应用程序,它运行在集群aws emr上。
我已将文件添加到hdfs:

javaSparkContext.addFile(filePath, recursive);

文件存在于hdfs上(日志可用:文件可读/可执行/可写),但我无法使用spark sql api从该文件读取信息:

LOGGER.info("Spark working directory: " + path);
 File file = new File(path + "/test.avro");
 LOGGER.info("SPARK PATH:" + file);
 LOGGER.info("read:" + file.canRead());
 LOGGER.info("execute:" + file.canExecute());
 LOGGER.info("write:" + file.canWrite());
 Dataset<Row> load = getSparkSession()
                      .read()
                      .format(AVRO_DATA_BRICKS_LIBRARY)
                      .load(file.getAbsolutePath());

有日志:

17/08/07 15:03:25 INFO SparkContext: Added file /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/container_1502118042722_0001_01_000001/test.avro at spark://HOST:PORT/files/test.avro with timestamp 1502118205059
17/08/07 15:03:25 INFO Utils: Copying /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/container_1502118042722_0001_01_000001/test.avro to /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec/test.avro
17/08/07 15:03:25 INFO AbstractS3Calculator: Spark working directory: /mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec
17/08/07 15:03:25 INFO AbstractS3Calculator: SPARK PATH:/mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec/test.avro
17/08/07 15:03:25 INFO AbstractS3Calculator: read:true
17/08/07 15:03:25 INFO AbstractS3Calculator: execute:true
17/08/07 15:03:25 INFO AbstractS3Calculator: write:true

org.apache.spark.sql.AnalysisException: Path does not exist: hdfs://HOST:PORT/mnt/yarn/usercache/hadoop/appcache/application_1502118042722_0001/spark-d5b494fc-2613-426f-80fc-ca66279c2194/userFiles-44aad2e8-04f4-420b-9b5e-a1ccde5db9ec/test.avro;
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:382)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$14.apply(DataSource.scala:370)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:344)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:152)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:135)
    at odh.spark.services.algorithms.calculators.RiskEngineS3Calculator.getInputMembers(RiskEngineS3Calculator.java:76)
    at odh.spark.services.algorithms.calculators.RiskEngineS3Calculator.getMembersDataSets(RiskEngineS3Calculator.java:124)
    at odh.spark.services.algorithms.calculators.AbstractS3Calculator.calculate(AbstractS3Calculator.java:50)
    at odh.spark.services.ProgressSupport.start(ProgressSupport.java:47)
    at odh.spark.services.Engine.startCalculations(Engine.java:102)
    at odh.spark.services.Engine.startCalculations(Engine.java:135)
    at odh.spark.SparkApplication.main(SparkApplication.java:19)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
2guxujil

2guxujil1#

检查一下你的电脑里有没有那个文件 hdfs : hadoop fs -ls /home/spark/ #或者你的工作目录而不是/home/spark
如果你在hdfs上有这个文件,它看起来像是spark方面的问题,只要按照描述中的说明或者更新你的文件就可以了 Spark 最新版本

ev7lccsx

ev7lccsx2#

默认情况下,所有文件都存储在 /user/hadoop/ hdfs中的文件夹(您可以使用此知识并加载此常量,但更好的是-需要使用绝对路径)
要上载到hdfs并使用此文件-我使用了绝对路径:

new Configuration().get("fs.defaultFS")//get HDFS root
....
 FileSystem hdfs = getHdfsFileSystem();
 hdfs.copyFromLocalFile(true, true, new Path(srcLocalPath), new Path(destHdfsPath));

哪里 destHdfsPath -绝对路径(如 'hdfs://...../test.avro' )
然后您可以从hdfs加载此信息:

return getSparkSession()
                .read()
                .format(AVRO_DATA_BRICKS_LIBRARY)
                .load(absoluteFilePath);

注意:我需要添加一些权限: FileUtil.chmod(hdfsDest, "u+rw,g+rw,o+rw");

相关问题