根据条件替换值

m0rkklqb  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(276)

我有一个数据集,我想用分组id,date替换基于最小数量值的结果列

id,date,quantity,result
1,2016-01-01,245,1
1,2016-01-01,345,3
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1

这里的输出,替换groupby(id,date)中最小值的数量。在这里,行的顺序无关紧要,任何顺序都可以。

id,date,quantity,result
1,2016-01-01,245,2
1,2016-01-01,345,2
1,2016-01-01,123,2
1,2016-01-02,120,5
2,2016-01-01,567,1
2,2016-01-01,568,1
2,2016-01-02,453,1
jexiocij

jexiocij1#

使用 Window 最大值是 max .

import pyspark.sql.functions as f
from pyspark.sql import Window

w = Window.partitionBy('id', 'date')

df.withColumn('result', f.when(f.col('quantity') == f.min('quantity').over(w), f.col('result'))) \
  .withColumn('result', f.max('result').over(w)).show(10, False)

+---+----------+--------+------+
|id |date      |quantity|result|
+---+----------+--------+------+
|1  |2016-01-02|120     |5     |
|1  |2016-01-01|245     |2     |
|1  |2016-01-01|345     |2     |
|1  |2016-01-01|123     |2     |
|2  |2016-01-02|453     |1     |
|2  |2016-01-01|567     |1     |
|2  |2016-01-01|568     |1     |
+---+----------+--------+------+

相关问题