有没有一种方法可以将1000个列从字符串转换成整数，同时保存为Parquet文件？

g6baxovj 于 2021-06-24 发布在 Hive

关注(0)|答案(1)|浏览(255)

使用pyspark，我从json文件中提取1500个字段，并保存为parquet并创建配置单元外部表。从json提取的所有字段都是字符串格式。在配置单元ddl中，所有列名都应为整数。当我保存为parquet并查询配置单元表时，我看到以下错误：
java.io.ioexception:org.apache.hadoop.hive.ql.metadata.hiveexception:java.lang.classcastexception
有办法处理这个错误吗？
在保存为parquet之前将列转换为int会有所帮助。但是将1500列显式转换为整数是不可能的。

Hive apache-spark pyspark csv parquet

来源：https://stackoverflow.com/questions/56958949/is-there-a-way-to-convert-few-1000-columns-from-string-to-integer-while-saving

1条答案

按热度按时间

aoyhnmkz1#

我知道一种更广泛的方法，如下所示：

>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import col

# Consider df to be the dataframe from reading the JSON file.

>>> df.show()
+-------+------+
|details|header|
+-------+------+
|    def|   2.0|
+-------+------+

>>> df.printSchema()
root
 |-- details: string (nullable = true)
 |-- header: string (nullable = true)

# Convert all columns to integer type.

>>> df_parq=df.select(*(col(c).cast(IntegerType()).alias(c) for c in df.columns))
>>> df_parq.printSchema()
root
 |-- details: integer (nullable = true)
 |-- header: integer (nullable = true)

# Write file with modified column types to Parquet.

>>> df_parq.write.parquet('F:\Parquet\sample_out3')
>>> df_read_parq=spark.read.parquet('F:\Parquet\sample_out3')
>>> df_read_parq.printSchema()
root
 |-- details: integer (nullable = true)
 |-- header: integer (nullable = true)

赞(0）回复(0）举报 2021-06-24

我来回答

有没有一种方法可以将1000个列从字符串转换成整数，同时保存为Parquet文件？

1条答案

相关问题

热门标签

最新问答