首先,如果我的英语不好,我很抱歉。我是Spark的初学者。我有一个Dataframe“raw”:
+------------------------+----+------------------------+---+------+
|id |name|phone |sex|source|
+------------------------+----+------------------------+---+------+
|gEzIl5K+6n6GPLD0pAQKFA==|alex|na |M |1 |
|gEzIl5K+6n6GPLD0pAQKFA==|alex|+Uy8Ol77OWiSuuapn5FOUg==|na |2 |
+------------------------+----+------------------------+---+------+
“na”:字符串默认值源:priority,1>2
我期望结果是:
+------------------------+----+------------------------+---+------+
|id |name|phone |sex|source|
+------------------------+----+------------------------+---+------+
|gEzIl5K+6n6GPLD0pAQKFA==|alex|+Uy8Ol77OWiSuuapn5FOUg==|M |1 |
+------------------------+----+------------------------+---+------+
我试过:
val rs = raw.orderBy(source)
.groupBy(col("id"))
.agg(first(when(col("phone") === "na" || col("phone") === ""
, col("phone"))).as("phone")
, first(when(col("sex") === "na" || col("sex") === ""
, col("sex"))).as("sex")
, first(when(col("source") === "na" || col("source") === ""
, col("source"))).as("source")
)
但不是真的。希望能得到你们的帮助。太好了,谢谢!
1条答案
按热度按时间tcbh2hod1#
试试这个。