我如何groupby和提取scala中该组的不同列的n highest?

7vhp5slm  于 2021-05-24  发布在  Spark
关注(0)|答案(1)|浏览(463)

嗨,怎么样?下面是一个示例Dataframe。。

val team_df = Seq(("yankees","aaron judge",24),("yankees","giancarlo stanton",20),("yankees","brett gardner",11),("dodgers","cody bellinger",20),("dodgers","jock pederson",10),
    ("dodgers","justin turner",15)).toDF("team","player","hits")

以下是表格格式的截图:

假设我想为每个团队返回一个Dataframe,每个团队有2个命中率最高的玩家(或n个最高的)。
所以它应该返回一个数据框给洋基与亚伦法官24和吉亚纳罗斯坦顿20,和一个数据框给道奇与科迪贝林格20和贾斯汀特纳15,在这个玩具的例子。
谢谢,祝你今天愉快!

dgiusagp

dgiusagp1#

def findMultipleDF(df: DataFrame, NHighest:Int): Map[String, DataFrame] = {
  val map = Map[String, DataFrame]()
  val rankedDF = df.withColumn("Rank", rank().over(Window.partitionBy("team").orderBy($"hits".desc)))
  val count = df.groupBy("team").count().collect()
  count.map(x => {
    val tempDF = rankedDF.filter($"team" === x.get(0) && col("Rank").leq(NHighest)).toDF()
    map.+=((x.get(0).toString(), tempDF))
  })
  map
}

val output = findMultipleDF(team_df, 2)
output.map(x=>{
  x._2.show()
})
+-------+--------------+----+----+
|   team|        player|hits|Rank|
+-------+--------------+----+----+
|dodgers|cody bellinger|  20|   1|
|dodgers| justin turner|  15|   2|
+-------+--------------+----+----+
+-------+-----------------+----+----+
|   team|           player|hits|Rank|
+-------+-----------------+----+----+
|yankees|      aaron judge|  24|   1|
|yankees|giancarlo stanton|  20|   2|
+-------+-----------------+----+----+

您可以像上面那样尝试,但不确定为什么要在不同的Dataframe中输出。

相关问题