Pyspark -将描述列详细信息分组到数组中

h43kikqp  于 5个月前  发布在  Spark
关注(0)|答案(1)|浏览(82)

在下面的示例数据集中,我有两个组“A”和“B”。“描述”列包含与组“A”相关的详细信息,每当出现组“B”的示例时,我需要将与组“A”的先前示例相关的描述详细信息添加到数组中,并将其放在新数据集中的组“B”上。
样本数据集:
| 描述|组|
| --|--|
| XYZ|一|
| PQR|一|
| | B |
| DEF|一|
| HIJ|一|
| KLM|一|
| NOP|一|
| | B |
预期输出:
| 组|描述|
| --|--|
| B| [XYZ,PQR]|
| B| [DEF,HIJ,KLM,NOP]|

8oomwypt

8oomwypt1#

假设你有列id,它决定了行的顺序。
计算group number作为group B出现的运行计数,然后使用collect_list聚合,请参阅代码。它是scala,但相同的spark.sql将在pyspark中工作:

println("Initial data:")
val df1 = Seq(
(1, "XYZ",  "A"),
(2, "PQR" , "A"),
(3,null, "B"   ),
(4,"DEF",   "A"),
(5,"HIJ",   "A"),
(6,"KLM",   "A"),
(7,"NOP",   "A"),
(8,null,    "B"    )
).toDF("Id","Description", "Group")

df1.createOrReplaceTempView("df1")
df1.show(100, false)

println("Result:")
spark.sql(""" 
select 'B' Group, collect_list(Description) Description
from
(
select id, Description, Group, 
       --calculate group number
       count(case when Group='B' then 1 else null end) over(order by id) as grp_num
from df1
) s
group by grp_num
having size(collect_list(Description))>0
order by grp_num

""").show(100, false)

字符串
初始数据:

+---+-----------+-----+
|Id |Description|Group|
+---+-----------+-----+
|1  |XYZ        |A    |
|2  |PQR        |A    |
|3  |null       |B    |
|4  |DEF        |A    |
|5  |HIJ        |A    |
|6  |KLM        |A    |
|7  |NOP        |A    |
|8  |null       |B    |
+---+-----------+-----+


测试结果:

+-----+--------------------+
|Group|Description         |
+-----+--------------------+
|B    |[XYZ, PQR]          |
|B    |[DEF, HIJ, KLM, NOP]|
+-----+--------------------+

相关问题