如果Dataframe有多个分区，则在spark管理的表中写入数据时复制

gwbalxhn 于 2021-05-26 发布在 Spark

关注(0)|答案(0)|浏览(278)

我使用下面的代码在spark托管表中写入数据。
Dataframedf有三个分区。在执行此代码后，我在flightinfo中看到总共15个文件
文件夹。分区1创建了5个文件，分区2创建了另外5个文件，以此类推。这些文件中有相同的数据。我知道我可以使用coalesce（1）来解决这个问题。然而，我想知道有没有更好的方法来处理这个问题。
提前感谢您关注这个问题。
下面是spark应用程序的实际代码
从pyspark.sql导入sparksession从pyspark.sql.functions从main import init导入spark\u分区\u id
如果name==“main”：obj=init（）

spark = spark = SparkSession.builder\
                        .master("local[3]")\
                        .appName("sparkFlightInfo")\
                         .enableHiveSupport()\
                        .getOrCreate()

df = obj.create_data_frame(spark,"data/flight-time.txt")

print(df.rdd.getNumPartitions())**output is 3**
spark.catalog.setCurrentDatabase("NDW")

df.write.format("csv")\                   //**This code creates 15 files 5 files each paritions and there are duplicate of each other**
        .mode("overwrite")\
        .bucketBy(5,"OP_CARRIER","ORIGIN")\
        .saveAsTable("newflightinfo")
print(spark.catalog.listTables("NDW"))

注意-我使用df.coalesce（1）测试了代码。我有5个文件和数据在5个文件是不重复的

apache-spark pyspark databricks azure-databricks

来源：https://stackoverflow.com/questions/65193218/duplicates-while-writing-data-in-spark-managed-tables-if-data-frame-has-more-tha