如何用列值联接表

agyaoht7  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(250)

下面有两个Dataframe:
课程:

-------------------------—
Name    | Course
-------------------------—
Tom     | English
Tom     | Mathematics
Lily    | English
Lily    | Chinese

分数:

-------------------------—-------—-------—
Name    | English | Chinese | Mathematics
-------------------------—-------—-------—
Tom     | 78      | 0       | 90
Lily    | 85      | 88      | 0

我想使用course表中的course value连接score表中指定的列以获取值。然后得到如下结果,只显示2个课程列。
结果:

-------------------------—--
Name    | Course1 | Course2
-------------------------—--
Tom     | 78      | 90
Lily    | 85      | 88
rkttyhzu

rkttyhzu1#

您可以通过取消激活Dataframe,然后连接两个Dataframe,然后旋转结果Dataframe来实现这一点

df1 = spark.createDataFrame([('Tom','English'),
('Tom','Mathematics'),
('Lily','English'),
('Lily','Chinese')],['Name','Course'])

df2 = spark.createDataFrame([('Tom',78,0,90),('Lily',85,88,0)],['Name','English','Chinese','Mathematics'])

df3 = df2.select('Name', expr('''stack(3,'English',English,'Chinese',Chinese,'Mathematics',Mathematics) as (Course, score)'''))

df4 = df3.join(df1, (df1.Name==df3.Name) & (df1.Course==df3.Course), 'inner').select(df1.Name,df1.Course,df3.score).\
groupBy('Name').pivot('Course').agg(min('score'))

df4.show()

+----+-------+-------+-----------+
|Name|Chinese|English|Mathematics|
+----+-------+-------+-----------+
| Tom|   null|     78|         90|
|Lily|     88|     85|       null|
+----+-------+-------+-----------+

相关问题