引用“unit”不明确，可能是：unit，unit

sy5wg1nm 于 2021-05-19 发布在 Spark

关注(0)|答案(1)|浏览(365)

我正在尝试从一个s3 bucket加载所有传入的Parquet文件，并用delta-lake处理它们。我有个例外。

val df = spark.readStream().parquet("s3a://$bucketName/")

df.select("unit") //filter data!
        .writeStream()
        .format("delta")
        .outputMode("append")
        .option("checkpointLocation", checkpointFolder)
        .start(bucketProcessed) //output goes in another bucket
        .awaitTermination()

它抛出一个异常，因为“unit”是不明确的。

我试过调试它。出于某种原因，它会两次找到“unit”。

这是怎么回事？可能是编码问题吗？
编辑：这是我创建spark会话的方式：

val spark = SparkSession.builder()
    .appName("streaming")
    .master("local")
    .config("spark.hadoop.fs.s3a.endpoint", endpoint)
    .config("spark.hadoop.fs.s3a.access.key", accessKey)
    .config("spark.hadoop.fs.s3a.secret.key", secretKey)
    .config("spark.hadoop.fs.s3a.path.style.access", true)
    .config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
    .config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
    .config("spark.sql.caseSensitive", true)
    .config("spark.sql.streaming.schemaInference", true)
    .config("spark.sql.parquet.mergeSchema", true)
    .orCreate

edit2:df.printschema（）的输出

2020-10-21 13:15:33,962 [main] WARN  org.apache.spark.sql.execution.datasources.DataSource -  Found duplicate column(s) in the data schema and the partition schema: `unit`;
root
 |-- unit: string (nullable = true)
 |-- unit: string (nullable = true)

Java apache-spark delta-lake kotlin amazon-s3

来源：https://stackoverflow.com/questions/64440298/reference-unit-is-ambiguous-could-be-unit-unit

1条答案

按热度按时间

9njqaruj1#

像这样读取相同的数据。。。

val df = spark.readStream().parquet("s3a://$bucketName/*")

…解决了问题。不管什么原因。我很想知道为什么…：(

赞(0）回复(0）举报 2021-05-20

我来回答

引用“unit”不明确，可能是：unit，unit

1条答案

相关问题

热门标签

最新问答