apachespark—在pyspark中使用结构化流读取数据，并希望写入文件大小为100mb的数据

jvlzgdj9 于 2021-05-27 发布在 Spark

关注(0)|答案(1)|浏览(219)

希望你们都好。我正在使用结构化流媒体从目录中读取文件

schema = StructType([
    StructField("RowNo", StringType()),
    StructField("InvoiceNo", StringType()),
    StructField("StockCode", StringType()),
    StructField("Description", StringType()),
    StructField("Quantity", StringType()),
    StructField("InvoiceDate", StringType()),
    StructField("UnitPrice", StringType()),
    StructField("CustomerId", StringType()),
    StructField("Country", StringType()),
    StructField("InvoiceTimestamp", StringType())
])

data = spark.readStream.format("orc").schema(schema).option("header", "true").option("path", "<path_here>").load()

在应用一些转换之后，我喜欢保存大小为100mb的输出文件。

apache-spark pyspark apache-spark-sql pyspark-dataframes spark-structured-streaming

来源：https://stackoverflow.com/questions/62869710/reading-data-using-structured-streaming-in-pyspark-and-wants-to-write-data-with

1条答案

按热度按时间

mlnl4t2r1#

您应该覆盖默认的hdfs块大小。

block_size = str(1024 * 1024 * 100)

sc._jsc.hadoopConfiguration().set("dfs.block.size", block_size)

参考资料：如何在pyspark中更改hdfs块大小？

赞(0）回复(0）举报 2021-05-27

我来回答

apachespark—在pyspark中使用结构化流读取数据，并希望写入文件大小为100mb的数据

1条答案

相关问题

热门标签

最新问答