在一行pyspark中组合具有相同名称的mx值

l3zydbqr  于 2021-05-19  发布在  Spark
关注(0)|答案(1)|浏览(364)

我想隐藏这个值

{"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx2.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple1.com","type":"mx","value":"30 aspmx3.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt1.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"20 alt2.aspmx.l.google.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx2.googlemail.com"}
    {"timestamp":"1601093713","name":"exmple2.com","type":"mx","value":"30 aspmx3.googlemail.com"}

    test.printSchema()
root
 |-- name: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- type: string (nullable = true)
 |-- value: string (nullable = true)

将具有相同名称的mx值组合在一行中,得到我想要的结果

{ "timestamp":"1601093713", "name":"exmple1.com", "type":"mx", "value":" alt1.aspmx.l.google.com,alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }
   { "timestamp":"1601093713", "name":"exmple2.com", "type":"mx", "value":" alt1.aspmx.l.google.com, alt2.aspmx.l.google.com , aspmx2.googlemail.com, aspmx3.googlemail.com" }
8dtrkrch

8dtrkrch1#

你可以使用 groupBy , agg ,和 collect_list [文档(外部链接)]。请注意,这将提供一个值列表,而不是一个字符串。如果需要,可以在convert pyspark dataframe column from list to string中找到如何进行转换。

df_grouped = df.groupby('name').agg(F.collect_list('value').alias('values'))

接下来的问题是如何处理其他列。e、 g.时间戳或类型。

相关问题