spark用正则表达式替换列子串

drkbr07n  于 2021-05-17  发布在  Spark
关注(0)|答案(2)|浏览(575)

我有一张scala spark的table,上面有:

val df = Seq(("1ST ST","NICK"),("2ND STREET","SAM"),("3RD AVE","ERIC"),("4TH AVENUE","SARAH")).toDF("STREET_NAME","NAME")

我想替换子字符串 STREETST 以及 AVENUEAVE 在列中 STREET_NAME . 我试过了,但没用:

df.withColumn(STREET_NAME,
  regexp_replace(
    $"STREET_NAME",
    lit("STREET"),
    "ST"
  )
)

或者有更好的方法来替换子字符串吗?

jjjwad0x

jjjwad0x1#

lit 不需要,因为第二个参数应该是要匹配的字符串模式,而不是列。

df.withColumn("STREET_NAME", regexp_replace($"STREET_NAME", "STREET", "ST"))
ohtdti5x

ohtdti5x2#

尝试 replace (或) regexp_replace() spark内置功能。

df.show()
//+-----------+-----+
//|STREET_NAME| NAME|
//+-----------+-----+
//|     1ST ST| NICK|
//| 2ND STREET|  SAM|
//|    3RD AVE| ERIC|
//| 4TH AVENUE|SARAH|
//+-----------+-----+

df.createOrReplaceTempView("tmp")

spark.sql("select replace(replace(STREET_NAME,'STREET','ST'),'AVENUE','AVE') as STREET_NAME,NAME from tmp").show()
//+-----------+-----+
//|STREET_NAME| NAME|
//+-----------+-----+
//|     1ST ST| NICK|
//|     2ND ST|  SAM|
//|    3RD AVE| ERIC|
//|    4TH AVE|SARAH|
//+-----------+-----+

//or using regexp_replace function
df.withColumn("STREET_NAME",regexp_replace(regexp_replace(col("STREET_NAME"),"STREET","ST"),"AVENUE","AVE")).show()
//+-----------+-----+
//|STREET_NAME| NAME|
//+-----------+-----+
//|     1ST ST| NICK|
//|     2ND ST|  SAM|
//|    3RD AVE| ERIC|
//|    4TH AVE|SARAH|
//+-----------+-----+

相关问题