在scala中删除标点符号和ascii字符

a64a0gku 于 2021-05-27 发布在 Spark

关注(0)|答案(2)|浏览(817)

我有一个数据集，我在那里阅读一些推文，我必须删除标点符号和非ascii字符，并转换成小字母的文本。如何在Dataframe中实现这一点？有没有一种方法可以使用sparksql。

scala> data.show

+-----+--------------------+
|   id|               tweet|
+-----+--------------------+

|31963|#studiolife #aisl...|
|31964| @user #white #su...|
|31965|safe ways to heal...|
|31966|is the hp and the...|
|31967|  3rd #bihday to ...|
|31968|choose to be   :)...|
|31969|something inside ...|
|31970|#finished#tattoo#...|
|31971| @user @user @use...|

scala apache-spark apache-spark-sql

来源：https://stackoverflow.com/questions/62953699/removing-punctuation-and-no-ascii-characters-in-scala

2条答案

按热度按时间

8mmmxcuj1#

更通用的方法-替换 non-word 字符除外 space 如下所示-

val df = Seq("#studiolife #aisl", "@user #white #su", "oh! yeah #123 #su.").toDF("tweet")
    df.withColumn("clean_tweet", regexp_replace($"tweet", "[\\W&&[^\\s+]]", ""))
      .show(false)

    /**
      * +------------------+---------------+
      * |tweet             |clean_tweet    |
      * +------------------+---------------+
      * |#studiolife #aisl |studiolife aisl|
      * |@user #white #su  |user white su  |
      * |oh! yeah #123 #su.|oh yeah 123 su |
      * +------------------+---------------+
      */

赞(0）回复(0）举报 2021-05-27

zyfwsgd62#

对于df列，请尝试以下操作：用单个字符替换字符串列：

import org.apache.spark.sql.functions._
 regexp_replace(df.col,  "[\\?,\\.,\\$]", ".")) 
 ... 
 val res = df.withColumn("some_col_cleaned", regexp_replace(df("some_col"), "[\\_,\\*,\\$,\\#,\\@]", "")) 
 ...

列的字符串类型为：

val res = df.withColumn("cleansed", regexp_replace(df("tweet"), "[\\_,\\*,\\$,\\#,\\@,\\&]", ""))

工作正常

赞(0）回复(0）举报 2021-05-27

我来回答

在scala中删除标点符号和ascii字符

2条答案

相关问题

热门标签

最新问答