如何使用pysparkDataframe窗口函数

htzpubme 于 2021-05-18 发布在 Spark

关注(0)|答案(3)|浏览(446)

我有一个如下的Dataframe

我想得到一个Dataframe，它将具有最新版本和最新日期。第一个筛选条件将是最新版本，然后是最新日期生成的Dataframe如下所示

我是用窗口函数来实现的。我写了下面的一段代码。

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

我不确定我错过了哪里。我只得到一个id为100的输出。请帮我解决这个问题

python DataFrame apache-spark pyspark

来源：https://stackoverflow.com/questions/64717425/how-to-use-pyspark-dataframe-window-function

3条答案

按热度按时间

xxb16uws1#

正如您所提到的，在您的操作中有一个顺序：首先是版本然后是dt基本上，您只需要选择最大版本（删除所有其他内容），然后选择最大dt并删除所有其他内容。您只需切换两行，如下所示：

wind = Window.partitionBy("id")
data = data.withColumn("maxVersion", F.max("version").over(wind)) \
               .where(F.col("version") == F.col("maxVersion")) \
               .withColumn("maxDt", F.max("dt").over(wind)) \
               .where(F.col("maxDt") == F.col("dt")) \
               .drop(F.col("maxVersion")) \
               .drop(F.col("maxDt"))

ID100只有一行的原因是因为在这种情况下，最大版本和最大dt发生在同一行上（你很幸运）。但对于身份证号码200来说不是这样的。

赞(0）回复(0）举报 2021-05-19

t0ybt7op2#

基本上你的公式有几个问题。首先，您需要将日期从字符串更改为正确的日期格式。然后，pyspark中的窗口允许您逐个指定列的顺序。然后就是了 rank() 函数，允许您在窗口中对结果进行排序。最后剩下的就是选择第一个等级。

from pyspark.sql.types import *
from pyspark import SparkContext, SQLContext
import pyspark.sql.functions as F
from pyspark.sql import Window

sc = SparkContext('local')
sqlContext = SQLContext(sc)

data1 = [
        (100,1,"2020-03-19","Nil1"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-04-19","Nil2"),
        (100,2,"2020-05-19","Ni13"),
        (200,1,"2020-09-19","Jay1"),
        (200,2,"2020-07-19","Jay2"),
        (200,2,"2020-08-19","Jay3"),

      ]

df1Columns = ["id", "version", "dt",  "Name"]
df1 = sqlContext.createDataFrame(data=data1, schema = df1Columns)
df1 = df1.withColumn("dt",F.to_date(F.to_timestamp("dt", 'yyyy-MM-dd')).alias('dt'))
print("Schema.")
df1.printSchema()
print("Actual initial data")
df1.show(truncate=False)

wind = Window.partitionBy("id").orderBy(F.desc("version"), F.desc("dt"))

df1 = df1.withColumn("rank", F.rank().over(wind))
print("Ranking over the window spec specified")
df1.show(truncate=False)

final_df = df1.filter(F.col("rank") == 1).drop("rank")
print("Filtering the final result by applying the rank == 1 condition")
final_df.show(truncate=False)

输出：

Schema.
root
 |-- id: long (nullable = true)
 |-- version: long (nullable = true)
 |-- dt: date (nullable = true)
 |-- Name: string (nullable = true)

Actual initial data
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|1      |2020-03-19|Nil1|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-04-19|Nil2|
|100|2      |2020-05-19|Ni13|
|200|1      |2020-09-19|Jay1|
|200|2      |2020-07-19|Jay2|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

Ranking over the window spec specified
+---+-------+----------+----+----+
|id |version|dt        |Name|rank|
+---+-------+----------+----+----+
|100|2      |2020-05-19|Ni13|1   |
|100|2      |2020-04-19|Nil2|2   |
|100|2      |2020-04-19|Nil2|2   |
|100|1      |2020-03-19|Nil1|4   |
|200|2      |2020-08-19|Jay3|1   |
|200|2      |2020-07-19|Jay2|2   |
|200|1      |2020-09-19|Jay1|3   |
+---+-------+----------+----+----+

Filtering the final result by applying the rank == 1 condition
+---+-------+----------+----+
|id |version|dt        |Name|
+---+-------+----------+----+
|100|2      |2020-05-19|Ni13|
|200|2      |2020-08-19|Jay3|
+---+-------+----------+----+

赞(0）回复(0）举报 2021-05-19

tquggr8v3#

一个更整洁的方法可能是：

w = Window.partitionBy("id").orderBy(F.col('version').desc(), F.col('dt').desc())
df1.withColumn('maximum', F.row_number().over(w)).filter('maximum = 1').drop('maximum').show()

赞(0）回复(0）举报 2021-05-18

我来回答

如何使用pysparkDataframe窗口函数

3条答案

相关问题

热门标签

最新问答