spark：嵌套的json数据和重复的列名(pyspark)

doinxwow 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(450)

我要处理的是
json data . 我的目标是将数据扁平化。我知道在调用我想要的嵌套列的情况下，我可以使用下面的表示法来实现这一点 attributes.id ，在哪里 id 嵌套在 attributes 列：

df = df.select('attributes.id')

问题是在中已经有一列了 df 打电话 id 既然spark只保留最后一部分 . 作为列名，我现在有了重复的列名。最好的处理方法是什么？理想情况下，将调用新列 attributes_id 以区别于 id 列。

JSON python apache-spark pyspark python-3.x

来源：https://stackoverflow.com/questions/63454417/spark-nested-json-data-and-duplicate-column-names-pyspark

2条答案

按热度按时间

8wtpewkr1#

使用重命名列 .withColumn （或）
展平Dataframe，然后使用 .toDF() 重命名数据框中的列。 Example: ```

sample json file data

{"id":1,"attributes":{"id":10}}

spark.read.json("<json_file_path>").printSchema()

root

|-- attributes: struct (nullable = true)

| |-- id: long (nullable = true)

|-- id: long (nullable = true)

spark.read.json("<json_file_path>").
withColumn("attributes_id",col("attributes.id")).
drop("attributes").
show()

+---+-------------+

| id|attributes_id|

+---+-------------+

| 1| 10|

+---+-------------+

or using toDF

columns=['id','attributes_id']

spark.read.json("<json_file_path>").
select("id","attributes.*").
toDF(*columns).
show()

+---+-------------+

| id|attributes_id|

+---+-------------+

| 1| 10|

+---+-------------+

如果您想动态展平，请使用此链接。

赞(0）回复(0）举报 2021-05-27

qacovj5a2#

选择后，您可以 .alias("attributes_id")

赞(0）回复(0）举报 2021-05-27

我来回答

spark：嵌套的json数据和重复的列名(pyspark)

2条答案

sample json file data

root

|-- attributes: struct (nullable = true)

| |-- id: long (nullable = true)

|-- id: long (nullable = true)

+---+-------------+

| id|attributes_id|

+---+-------------+

| 1| 10|

+---+-------------+

or using toDF

+---+-------------+

| id|attributes_id|

+---+-------------+

| 1| 10|

+---+-------------+

相关问题

热门标签

最新问答