scala—Dataframe中的引用列在连接两个Dataframe时引发不明确错误,其中一个Dataframe具有引用键数组

niknxzdl  于 2021-07-09  发布在  Spark
关注(0)|答案(1)|浏览(217)

我有两个Dataframe如下
Dataframe一

+--------------------------------------------
|______subject_______________|______marks___|
| Maths                      |    89        |
| English                    |    90        |
| Religion                   |    80        |
---------------------------------------------

Dataframe二

+-------------------------------------------------------------
|______name__________________|______subject__________________|
| Liza                       |   [Maths]                     |
| Inter                      |   [Religion, English]         |
| Ovin                       |   [Maths, Religion, English]  |
--------------------------------------------------------------

预期产量

+-------------------------------------------------------------
|______name__________________|______marks____________________|
| Liza                       |   [89]                        |
| Inter                      |   [80, 90]                    |
| Religion                   |   [89, 80, 90]                |
--------------------------------------------------------------

为了得到上面的输出,我需要连接dataframeone和dataframetwo。但在dataframetwo中,subject列具有数组,而dataframeone具有字符串值。我尝试了下面的代码,错误后跟

val newDataframe = dataframeTwo.withColumn("myMarks", struct('marks))
    val studentMarksDataframe = dataframeOne.join(newDataframe, array_contains(subject, subject)).agg(collect_list('myMarks))

错误
线程“main”org.apache.spark.sql.analysisexception中的异常:引用“unicode”不明确,可能是:subject,subject
如何解决上述问题?

tyg4sfes

tyg4sfes1#

您可以尝试:

val studentMarksDataframe = dataframeOne.join(
    dataframeTwo, 
    array_contains(dataframeTwo("subject"), dataframeOne("subject"))
).groupBy("name").agg(collect_list('marks))

相关问题