scala—将sparkDataframe的所有列转换为json格式,然后将json格式的数据作为列包含在另一个/父Dataframe中

mw3dktmi  于 2021-05-27  发布在  Spark
关注(0)|答案(2)|浏览(818)

使用df.tojson将dataframe(比如子dataframe)转换为json
json转换后,模式如下所示:

root
 |-- value: string (nullable = true)

我使用以下建议将子Dataframe放入中间父架构/Dataframe:

scala> parentDF.toJSON.select(struct($"value").as("data")).printSchema
root
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)

现在我仍然需要进一步构建parentdf模式,使其看起来像:

root
 |-- id
 |-- version 
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)

q1)如何使用id from值构建id列(即value.id需要表示为id)
问题2)我需要从一个不同的Dataframe(比如versiondf)获取版本,其中version是一个常量(在所有列中)。我是否从这个versiondf中提取一行来读取version列的值,然后在parentdf中将其填充为literal?
请帮助任何代码片段。

5kgi1eie

5kgi1eie1#

而不是 toJSON 使用 to_jsonselect 语句并选择所需的列以及 to_json 功能。
检查以下代码。

val version =  // Get version value from versionDF
parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version"))

scala> parentDF.select($"id",struct(to_json(struct($"*")).as("value")).as("data"),lit(version).as("version")).printSchema
root
 |-- id: integer (nullable = false)
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)
 |-- version: double (nullable = false)

更新

scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).printSchema
root
 |-- id: integer (nullable = false)
 |-- data: string (nullable = true)
 |-- version: integer (nullable = false)

scala> df.select($"id",to_json(struct(struct($"*").as("value"))).as("data"),lit(version).as("version")).show(false)
+---+------------------------------------------+-------+
|id |data                                      |version|
+---+------------------------------------------+-------+
|1  |{"value":{"id":1,"col1":"a1","col2":"b1"}}|1      |
|2  |{"value":{"id":2,"col1":"a2","col2":"b2"}}|1      |
|3  |{"value":{"id":3,"col1":"a3","col2":"b3"}}|1      |
+---+------------------------------------------+-------+

更新-1

scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).printSchema
root
 |-- id: integer (nullable = false)
 |-- data: string (nullable = true)
 |-- version: integer (nullable = false)

scala> df.select($"id",to_json(struct($"*").as("value")).as("data"),lit(version).as("version")).show(false)
+---+--------------------------------+-------+
|id |data                            |version|
+---+--------------------------------+-------+
|1  |{"id":1,"col1":"a1","col2":"b1"}|1      |
|2  |{"id":2,"col1":"a2","col2":"b2"}|1      |
|3  |{"id":3,"col1":"a3","col2":"b3"}|1      |
+---+--------------------------------+-------+
dsekswqp

dsekswqp2#

试试这个:

scala> val versionDF = List((1.0)).toDF("version")
versionDF: org.apache.spark.sql.DataFrame = [version: double]

scala> versionDF.show
+-------+
|version|
+-------+
|    1.0|
+-------+

scala> val version = versionDF.first.get(0)
version: Any = 1.0

scala>

scala> val childDF = List((1,"a1","b1"),(2,"a2","b2"),(3,"a3","b3")).toDF("id","col1","col2")
childDF: org.apache.spark.sql.DataFrame = [id: int, col1: string ... 1 more field]

scala> childDF.show
+---+----+----+
| id|col1|col2|
+---+----+----+
|  1|  a1|  b1|
|  2|  a2|  b2|
|  3|  a3|  b3|
+---+----+----+

scala>

scala> val parentDF =  childDF.toJSON.select(struct($"value").as("data")).withColumn("id",from_json($"data.value",childDF.schema).getItem("id")).withColumn("version",lit(version))
parentDF: org.apache.spark.sql.DataFrame = [data: struct<value: string>, id: int ... 1 more field]

scala> parentDF.printSchema
root
 |-- data: struct (nullable = false)
 |    |-- value: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- version: double (nullable = false)

scala> parentDF.show(false)
+----------------------------------+---+-------+
|data                              |id |version|
+----------------------------------+---+-------+
|[{"id":1,"col1":"a1","col2":"b1"}]|1  |1.0    |
|[{"id":2,"col1":"a2","col2":"b2"}]|2  |1.0    |
|[{"id":3,"col1":"a3","col2":"b3"}]|3  |1.0    |
+----------------------------------+---+-------+

相关问题