scala—如何为每个对应的唯一列值分配唯一id?

qmelpv7a  于 2021-05-29  发布在  Spark
关注(0)|答案(2)|浏览(460)

设想如下所示的Dataframe:

+-------+--------+---------+---------+--------+-----------------+---+
|address|lastname|firstname|patientid|policyno|visitid          |id |
+-------+--------+---------+---------+--------+-----------------+---+
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_1| 1 |
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_2| 1 |
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_3| 1 |
|addr2  |Dobs    |Aliya    |patid2   |policy2 |visituid_aliya_1 | 2 |
|addr2  |Dobs    |Aliya    |patid2   |policy2 |visituid_aliya_2 | 2 |
|addr2  |Dobs    |OP       |patid3   |policy3 |visituid_OP_1    | 3 |
+-------+--------+---------+---------+--------+-----------------+---+

当名称( firstname + lastname )是相同的,当名称变得不同时-我想分配一个新的id。
我之所以需要这样做,是因为我想取消标识某个数据集(其中包含敏感的细节),这样我就可以在我的应用程序中使用相同但值已更改的数据。田野 id 将用作与其他Dataframe连接的索引键。
另一部分关于 id 其他虚拟Dataframe也将包含类似的 id 可能会使用 monotonically_increasing_id() 所以最好是 id 从0或1开始,以不断增加的方式。
我怎样才能用scala在spark中实现这一点?

ldioqlga

ldioqlga1#

使用 window 功能。

scala> df.show(false)
+-------+--------+---------+---------+--------+-----------------+
|address|lastname|firstname|patientid|policyno|visitid          |
+-------+--------+---------+---------+--------+-----------------+
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_1|
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_2|
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_3|
|addr2  |Dobs    |Aliya    |patid2   |policy2 |visituid_aliya_1 |
|addr2  |Dobs    |Aliya    |patid2   |policy2 |visituid_aliya_2 |
|addr2  |Dobs    |OP       |patid3   |policy3 |visituid_OP_1    |
|addr4  |AN      |OTHER    |patid4   |policy4 |visituid_OP_1    |
|addr2  |ANO     |THER     |patid5   |policy5 |visituid_OP_1    |
+-------+--------+---------+---------+--------+-----------------+
val expr = Seq("lastname","firstname")
            .map(c => hash(col(c)).as(c))
            .reduce(concat(_,_).asc)
val winSpec = dense_rank()
                .over(Window.orderBy(expr))

输出

scala> df.withColumn("id",winSpec).show(false)

+-------+--------+---------+---------+--------+-----------------+---+
|address|lastname|firstname|patientid|policyno|visitid          |id |
+-------+--------+---------+---------+--------+-----------------+---+
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_1|1  |
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_2|1  |
|addr1  |Pits    |Rodney   |patid1   |policy1 |visituid_rodney_3|1  |
|addr4  |AN      |OTHER    |patid4   |policy4 |visituid_OP_1    |2  |
|addr2  |Dobs    |OP       |patid3   |policy3 |visituid_OP_1    |3  |
|addr2  |Dobs    |Aliya    |patid2   |policy2 |visituid_aliya_1 |4  |
|addr2  |Dobs    |Aliya    |patid2   |policy2 |visituid_aliya_2 |4  |
|addr2  |ANO     |THER     |patid5   |policy5 |visituid_OP_1    |5  |
+-------+--------+---------+---------+--------+-----------------+---+
mtb9vblg

mtb9vblg2#

import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
  ("Xxx", "yyy"),
  ("xxx", "yyy"),
  ("aaa", "yyy")
)).toDF("c1", "c2")

df.withColumn("hashName", hash(concat($"c1", $"c2")))//.show(false)

那么有哪些明显的改进呢?两者之间的分隔符-哈希前的大写或小写字符。如果不清除,则用分隔符替换所有空格和奇数字符。
所以更好的方法是:

df.withColumn("preHashName", lower(trim(concat($"c1", lit("|"), $"c2"))))
  .withColumn("hashName", hash(lower(concat($"c1", lit("|"), $"c2")))).show(false)

不这样做意味着在另一个答案中,另一个答案在散列方面可以与另一个答案相同。
输出:

+---+------+-----------+-----------+
|c1 |c2    |preHashName|hashName   |
+---+------+-----------+-----------+
|Xxx|y yy  |xxx|y yy   |907198499  |
|xxx|yyy   |xxx|yyy    |-1167597858|
|aaa|yyy   |aaa|yyy    |495090835  |
+---+------+-----------+-----------+

相关问题