spark获得最大连续值减少

0h4hbjxa 于 2021-07-12 发布在 Spark

关注(0)|答案(2)|浏览(282)

我的要求是得到最大数量的减少值
下面是我的输入数据集：

+---+-------+
| id| amount|
+---+-------+
|  1|   10.0|
|  1|    9.0|
|  1|    7.0|
|  1|    6.0|
|  2|   50.0|
|  2|   60.0|
|  2|   70.0|
|  3|   90.0|
|  3|   80.0|
|  3|   90.0|
+---+-------+

我要求的结果如下：

+---+--------+
| id| outcome|
+---+--------+
|  1|       3|
|  2|       0|
|  3|       2|
+---+--------+

我的结果（新列）基于groupby id和值连续减少3次的次数。对于id 1，即使它减少了4倍，我只希望最多3倍。
如有任何建议或帮助，请使用spark sql或spark dataframe（scala）。

scala DataFrame apache-spark pyspark apache-spark-sql

来源：https://stackoverflow.com/questions/66523742/spark-get-max-consecutive-decrease-in-value

2条答案

按热度按时间

e0bqpujr1#

首先需要一个排序列来计算减少量。在你的例子中没有，所以我们可以建立一个 index 带的列 monotonically_increasing_id . 然后，我们可以用Windows和Windows lag 以及 lead 获取所需功能：

import org.apache.spark.sql.expressions.Window
val win = Window.partitionBy("id").orderBy("index")

df
    .withColumn("index", monotonically_increasing_id)
    // there is a decrease if the amount is less than the next one
    // or greater than the previous one
    .withColumn("decrease", (lag('amount, 1).over(win) > 'amount) ||
                            (lead('amount, 1).over(win) < 'amount) 
    )
    .groupBy("id")
    // we need to cast the boolean to an int to sum them
    .agg(sum('decrease cast "int") as "outcome")
    // capping the outcome to 3
    .withColumn("outcome", when('outcome > 3, lit(3)).otherwise('outcome))
    .orderBy("id").show

+---+-------+                                                                   
| id|outcome|
+---+-------+
|  1|      3|
|  2|      0|
|  3|      2|
+---+-------+

赞(0）回复(0）举报 2021-07-12

6za6bjd02#

下面是一个使用 pyspark 您可以尝试在scala或sql中复制：

w = Window.partitionBy("id").orderBy(F.monotonically_increasing_id())

(df.withColumn("Diff",F.col("amount") - F.lag("amount").over(w))
   .withColumn('k', F.lead("Diff").over(w))
   .fillna(0, subset='k').groupby("id").agg(
  F.sum(F.when((F.isnull("Diff") & (F.col("k")<0))|(F.col("Diff")<0),1).otherwise(0))
  .alias("outcome")
).withColumn("outcome",F.when(F.col("outcome")>=3,3).otherwise(F.col("outcome"))) ).show()

+---+-------+
| id|outcome|
+---+-------+
|  1|      3|
|  2|      0|
|  3|      2|
+---+-------+

赞(0）回复(0）举报 2021-07-12

我来回答

spark获得最大连续值减少

2条答案

相关问题

热门标签

最新问答