spark嵌套复杂Dataframe

q43xntqr  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(270)

我试图把复杂的数据转换成正常的Dataframe格式
我的数据架构:

root
 |-- column_names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- values: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: string (containsNull = true)
 |-- id: array (nullable = true)
 |    |-- element: string (containsNull = true)

我的数据文件(json格式):

{"column_names":["2_col_name","3_col_name"],"id":["a","b","c","d","e"],"values":[["2_col_1",1],["2_col_2",2],["2_col_3",9],["2_col_4",10],["2_col_5",11]]}

我正在尝试将上述数据转换为以下格式:

+----------+----------+----------+
|1_col_name|2_col_name|3_col_name|
+----------+----------+----------+
|         a|   2_col_1|         1|
|         b|   2_col_2|         2|
|         c|   2_col_3|         9|
|         d|   2_col_4|        10|
|         e|   2_col_5|        11|
+----------+----------+----------+

我尝试在id和值上使用explode函数,但得到不同的输出,如下所示:

+---+-------------+
| id|       values|
+---+-------------+
|  a| [2_col_1, 1]|
|  a| [2_col_2, 2]|
|  a| [2_col_3, 9]|
|  a|[2_col_4, 10]|
|  a|[2_col_5, 11]|
|  b| [2_col_1, 1]|
|  b| [2_col_2, 2]|
|  b| [2_col_3, 9]|
|  b|[2_col_4, 10]|
+---+-------------+
only showing top 9 rows

不知道我哪里做错了

czq61nw1

czq61nw11#

你可以用 array_zip + inline 用于展平然后透视列名的函数:

val df1 = df.select(
    $"column_names",
    expr("inline(arrays_zip(id, values))")
  ).select(
    $"id".as("1_col_name"),
    expr("inline(arrays_zip(column_names, values))")
  )
  .groupBy("1_col_name")
  .pivot("column_names")
  .agg(first("values"))

df1.show
//+----------+----------+----------+
//|1_col_name|2_col_name|3_col_name|
//+----------+----------+----------+
//|e         |2_col_5   |11        |
//|d         |2_col_4   |10        |
//|c         |2_col_3   |9         |
//|b         |2_col_2   |2         |
//|a         |2_col_1   |1         |
//+----------+----------+----------+

相关问题