使用spark dataframe进行分组时,获取带条件的列的第一个值

b4lqfgs4  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(583)

首先,如果我的英语不好,我很抱歉。我是Spark的初学者。我有一个Dataframe“raw”:

+------------------------+----+------------------------+---+------+
|id                      |name|phone                   |sex|source|
+------------------------+----+------------------------+---+------+
|gEzIl5K+6n6GPLD0pAQKFA==|alex|na                      |M  |1     |
|gEzIl5K+6n6GPLD0pAQKFA==|alex|+Uy8Ol77OWiSuuapn5FOUg==|na |2     |
+------------------------+----+------------------------+---+------+

“na”:字符串默认值源:priority,1>2
我期望结果是:

+------------------------+----+------------------------+---+------+
|id                      |name|phone                   |sex|source|
+------------------------+----+------------------------+---+------+
|gEzIl5K+6n6GPLD0pAQKFA==|alex|+Uy8Ol77OWiSuuapn5FOUg==|M  |1     |
+------------------------+----+------------------------+---+------+

我试过:

val rs = raw.orderBy(source)
        .groupBy(col("id"))
        .agg(first(when(col("phone") === "na" || col("phone") === ""
      , col("phone"))).as("phone")
        , first(when(col("sex") === "na" || col("sex") === ""
      , col("sex"))).as("sex")
        , first(when(col("source") === "na" || col("source") === ""
      , col("source"))).as("source")
)

但不是真的。希望能得到你们的帮助。太好了,谢谢!

tcbh2hod

tcbh2hod1#

试试这个。

df.orderBy("source")
  .groupBy(col("id"))
  .agg(min(when(!'phone.isin("na",""), 'phone)).as("phone"),
    min(when(!'sex.isin("na",""),'sex)).as("sex"),
    min(when(!'source.isin("na",""), 'source)).as("source"))
  .show()

+--------------------+--------------------+---+------+
|                  id|               phone|sex|source|
+--------------------+--------------------+---+------+
|gEzIl5K+6n6GPLD0p...|+Uy8Ol77OWiSuuapn...|  M|     1|
+--------------------+--------------------+---+------+

相关问题