无法使用python spark为orc文件设置条带大小

3wabscal 于 2021-06-25 发布在 Hive

关注(0)|答案(0)|浏览(310)

我已经将sparksession配置为将orc文件条带大小设置为128mb，但是sparkDataframe正在写入文件大小较小的文件（~5mb）。
代码：


# Creating Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local")\
.appName("test-optimal-orc")\
.config("spark.sql.orc.stripe.size", "134217728")\
.config("spark.sql.orc.impl", "native")\
.config("spark.sql.hive.convertMetastoreOrc", "true")\
.config("orc.stripe.size","134217728")\
.getOrCreate()

df.write.mode("overwrite").option("orc.stripe.size", "134217728").orc(<S3://location>)

当我检查一个输出orc文件转储时，它只有一个带300000行的条带，这显然不接受我在代码中的输入。
我的orc文件转储的片段：

File Version: 0.12 with ORC_135
Rows: 300000
Compression: SNAPPY
Compression size: 262144
.
.
.

Stripe Statistics:
Stripe 1:
Column 0: count: 300000 hasNull: false
Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000
Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000
Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000
.
.
.

File Statistics:
Column 0: count: 300000 hasNull: false
Column 1: count: 300000 hasNull: false min: max: 84151168997 sum: 2000000
Column 2: count: 300000 hasNull: false min: max: 800016871046 sum: 1730000
Column 3: count: 300000 hasNull: false min: 8059509582 max: 8065279467 sum: 3000000
.
.
.

Stripes:
Stripe: offset: 3 data: 5446727 rows: 300000 tail: 808 index: 13970
Stream: column 0 section ROW_INDEX start: 3 length 29
Stream: column 1 section ROW_INDEX start: 32 length 537
Stream: column 2 section ROW_INDEX start: 569 length 368
Stream: column 3 section ROW_INDEX start: 937 length 550

File length: 5463339 bytes
Padding length: 0 bytes
Padding ratio: 0%

我正在尝试将orc文件保存在s3中，并为我的hive外部表提供最佳的条带和跨距大小，以获得高效的读取性能。
请建议如何设置条纹和步幅大小，而编写为orc文件使用pyspark代码。我已经在stackoverflow和cloudera社区进行了检查，但没有任何进展。

Hive python apache-spark pyspark orc

来源：https://stackoverflow.com/questions/58270707/unable-to-set-stripe-size-for-the-orc-file-using-python-spark