我有一个sparkDataframe,它有两个数组,如下所示:
df = spark.createDataFrame(
[((["Person", "Company", "Person", "Person"],
["John", "Company1", "Jenny", "Jessica"]))],
["Type", "Value"])
df.show()
+--------------------+--------------------+
| Type| Value|
+--------------------+--------------------+
|[Person, Company,...|[John, Company1, ...|
+--------------------+--------------------+
我想把它转换成一个整洁的版本,如下所示:
df = spark.createDataFrame(
[
("Person", "John"),
("Company", "Company1"),
("Person", "Jenny"),
("Person", "Jessica"),
],
["Type", "Value"])
df.show()
+-------+--------+
| Type| Value|
+-------+--------+
| Person| John|
|Company|Company1|
| Person| Jenny|
| Person| Jessica|
+-------+--------+
欢迎使用pyspark或sparksql解决方案。蒂亚。
1条答案
按热度按时间mcdcgff01#
从
Spark-2.4.0
使用arrays_zip
函数压缩两个数组(列表),然后执行explode
.为了
Spark < 2.4
使用udf
创建zip。Example:
```df = spark.createDataFrame(
[((["Person", "Company", "Person", "Person"],
["John", "Company1", "Jenny", "Jessica"]))],
["Type", "Value"])
from pyspark.sql.functions import *
df.withColumn("az",explode(arrays_zip(col("Type"),col("Value")))).select("az.*").show()
+-------+--------+
| Type| Value|
+-------+--------+
| Person| John|
|Company|Company1|
| Person| Jenny|
| Person| Jessica|
+-------+--------+
using spark sql
df.createOrReplaceTempView("tmp")
sql("select col.* from (select explode(arrays_zip(Type,Value)) from tmp)q").show()
+-------+--------+
| Type| Value|
+-------+--------+
| Person| John|
|Company|Company1|
| Person| Jenny|
| Person| Jessica|
+-------+--------+