scala—在Dataframe上运行regex，并将结果存储在新的Dataframe中

0x6upsns 于 2021-07-14 发布在 Spark

关注(0)|答案(1)|浏览(359)

我有以下Dataframe

+----------------------------------
|______value______________________|
| I am going to school ?        |
| why are you crying ? ?       |
| You are not very good my friend |

我想用emojis过滤这些行，并将它们放入一个新的Dataframe中。我编写了以下代码，将dataframe转换为一个列表，然后遍历该列表以识别带有emojis的句子。但我不知道如何在Dataframe中应用这些正则表达式。
现有代码

def convertDataFrameToList(combinedDataFrame : DataFrame) : List[Any] = {
    val myList=   combinedDataFrame.select("value").rdd.map(r => r(0)).collect.toList
    myList
  }
val listOutput = convertDataFrameToList(myDaframe)
for(element<- listOutput) {
 val emojiValues =  raw"\p{block=Emoticons}".r.findAllIn(element).toSeq
         val   y =    raw"\p{block=Miscellaneous Symbols and Pictographs}".r.findAllIn(element).toSeq
         val p =  emojiValues ++ y

//process further
}

更新
我试过下面的正则表达式

val emoticonResult = myKafkaDataFrame.filter(regexp_extract(col("value"), raw"([\p{block=Emoticons},\p{block=Miscellaneous Symbols and Pictographs},\uuD83E\uDD00-\uD83E\uDDFF])", 1) =!= "")

结果，包含emojis的行以及不包含任何emoji的行也会被返回。我能知道我的代码有什么问题吗？

scala DataFrame apache-spark apache-spark-sql emoji

来源：https://stackoverflow.com/questions/66728870/run-a-regex-on-a-dataframe-and-store-the-results-in-a-new-dataframe

1条答案

按热度按时间

wnvonmuf1#

你可以用 regexp_extract 使用正则表达式：

val emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) =!= "")
val no_emojis = df.filter(regexp_extract($"value", raw"(\p{block=Emoticons})", 1) === "")

emojis.show(false)
+--------------------------+
|value                     |
+--------------------------+
|I am going to school ?   |
|why are you crying ? ?  |
+--------------------------+

no_emojis.show(false)
+-------------------------------+
|value                          |
+-------------------------------+
|You are not very good my friend|
+-------------------------------+

赞(0）回复(0）举报 2021-07-14

我来回答

scala—在Dataframe上运行regex，并将结果存储在新的Dataframe中

1条答案

相关问题

热门标签

最新问答