regex使用scala替换spark dataframe列中多次出现的字符串

llycmphe  于 2021-07-14  发布在  Spark
关注(0)|答案(1)|浏览(322)

我有一个列,其中一个特定的字符串出现多次。出现的次数不是固定的。我可以多次得到这样的字符串。
示例:列说明包含以下数据

The account account has been cancelled for the account account account and with the account

这里基本上我想用单个帐户替换多个并发帐户
预期产量:

The account has been cancelled for the account and with the account
bqucvtff

bqucvtff1#

您可以使用regex模式(源代码:java正则表达式来删除重复的单词)和 regexp_replace 要替换重复的单词:

val df = spark.sql("select 'The account account has been cancelled for the account account account and with the account' col")

df.show(false)
+-------------------------------------------------------------------------------------------+
|col                                                                                        |
+-------------------------------------------------------------------------------------------+
|The account account has been cancelled for the account account account and with the account|
+-------------------------------------------------------------------------------------------+

val df2 = df.withColumn("col", regexp_replace(col("col"), "\\b(\\w+)(\\b\\W+\\b\\1\\b)*", "$1"))

df2.show(false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|The account has been cancelled for the account and with the account|
+-------------------------------------------------------------------+

相关问题