我有一个具有以下结构的Dataframe:
| a | b | c |
-----------------------------------------------------------------------------
|01 |ABC | {"key1":"valueA","key2":"valueC"} |
|02 |ABC | {"key1":"valueA","key2":"valueC"} |
|11 |DEF | {"key1":"valueB","key2":"valueD", "key3":"valueE"} |
|12 |DEF | {"key1":"valueB","key2":"valueD", "key3":"valueE"} |
我想变成这样:
| a | b | key | value |
--------------------------------------------------------
|01 |ABC | key1 | valueA |
|01 |ABC | key2 | valueC |
|02 |ABC | key1 | valueA |
|02 |ABC | key2 | valueC |
|11 |DEF | key1 | valueB |
|11 |DEF | key2 | valueD |
|11 |DEF | key3 | valueE |
|12 |DEF | key1 | valueB |
|12 |DEF | key2 | valueD |
|12 |DEF | key3 | valueE |
以一种有效的方式,因为数据集可能相当大。
1条答案
按热度按时间lkaoscv71#
尝试使用
from_json
那么函数explode
阵列。Example:
```import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
val df=Seq(("01","ABC","""{"key1":"valueA","key2":"valueC"}""")).toDF("a","b","c")
val Schema = MapType(StringType, StringType)
df.withColumn("d",from_json(col("c"),Schema)).selectExpr("a","b","explode(d)").show(10,false)
//+---+---+----+------+
//|a |b |key |value |
//+---+---+----+------+
//|01 |ABC|key1|valueA|
//|01 |ABC|key2|valueC|
//+---+---+----+------+