pyspark 如何按顺序填充嵌套框中的缺失值?

hfsqlsce  于 5个月前  发布在  Spark
关注(0)|答案(1)|浏览(69)

我在PySpark中有一个 Dataframe ,看起来像这样:

row_id, group_id
1,      1
2,      null
3,      null
4,      null
5,      5
6,      null
7,      null
8,      8
9,      null
10,     null
11,     null
12,     null

字符串
等等:其中row_id是序列号(递增的和唯一的),group_id是从value第一次出现到下一个值的组的唯一id。任务是像这样将所有null填充到 Dataframe :

row_id, group_id
1,      1
2,      1
3,      1
4,      1
5,      5
6,      5
7,      5
8,      8
9,      8
10,     8
11,     8
12,     8


每个组中有未知数量的记录(示例显示少量),但它将以100秒为单位, Dataframe 的长度以百万为单位。

kxeu7u2r

kxeu7u2r1#

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.window import Window

# Create a Spark session
spark = SparkSession.builder.appName("example").getOrCreate()

# Your original dataframe
data = [(1, 1), (2, None), (3, None), (4, None), (5, 5), (6, None), (7, None), (8, 8), (9, None), (10, None), (11, None), (12, None)]
columns = ["row_id", "group_id"]
df = spark.createDataFrame(data, columns)

# Define a window specification
windowSpec = Window.orderBy("row_id").rowsBetween(Window.unboundedPreceding, 0)

# Use the last window function to fill null values with the last non-null value
filled_df = df.withColumn("group_id", F.last("group_id", ignorenulls=True).over(windowSpec))

# Show the resulting dataframe
filled_df.show()
+------+--------+
|row_id|group_id|
+------+--------+
|     1|       1|
|     2|       1|
|     3|       1|
|     4|       1|
|     5|       5|
|     6|       5|
|     7|       5|
|     8|       8|
|     9|       8|
|    10|       8|
|    11|       8|
|    12|       8|
+------+--------+

字符串

相关问题