scala spark中的stopwords去除器

p4tfgftt  于 2021-05-16  发布在  Spark
关注(0)|答案(1)|浏览(375)

我在斯卡拉有个问题。我需要从rdd[string]txt文件中删除停止字。

val sc = new SparkContext(conf)

val tweetsPath = args(0)
val outputDataset = args(1)

val tweetsRaw: RDD[String] = sc.textFile(tweetsPath)

val stopWords = Array("a","able","about","across","after","all","almost","also","am","among","an","and","any","are","as","at","be","because","been","but","by","can","cannot","could","dear","did","do","does","either","else","ever","every","for","from","get","got","had","has","have","he","her","hers","him","his","how","however","i","if","in","into","is","it","its","just","least","let","like","likely","may","me","might","most","must","my","neither","no","nor","not","of","off","often","on","only","or","other","our","own","rather","said","say","says","she","should","since","so","some","than","that","the","their","them","then","there","these","they","this","tis","to","too","twas","us","wants","was","we","were","what","when","where","which","while","who","whom","why","will","with","would","yet","you","your","ain't","aren't","can't","could've","couldn't","didn't","doesn't","don't","hasn't","he'd","he'll","he's","how'd","how'll","how's","i'd","i'll","i'm","i've","isn't","it's","might've","mightn't","must've","mustn't","shan't","she'd","she'll","she's","should've","shouldn't","that'll","that's","there's","they'd","they'll","they're","they've","wasn't","we'd","we'll","we're","weren't","what'd","what's","when'd","when'll","when's","where'd","where'll","where's","who'd","who'll","who's","why'd","why'll","why's","won't","would've","wouldn't","you'd","you'll","you're","you've")

val cleanTxt = tweetsRaw.
  filter(x => x.startsWith("San Francisco") || x.startsWith("Chicago") || !stopWords.contains(x));

cleanTxt.saveAsTextFile(outputDataset)

我试过了,但没用。我必须保持相同的结构(使用sparkconf而不是移动到sparksession)。我该如何选择从“芝加哥”和“旧金山”开始的所有推文,去除文本中的词尾,并在没有这些词条的情况下逐行输出整个推文?
我做了我的tweetsraw平面图,但我有一个平面图作为一个输出只有词没有停止词,但我需要的是整个行没有停止词,而不仅仅是词。
我希望我清楚我想要什么,希望你能帮助我解决这个问题!
谢谢你们。
p、 我在scala库中用stopwordsremover方法做了很多尝试,但是我不知道如何在没有初始化sparksession和使用sparkconf的情况下使它工作。

e0uiprwp

e0uiprwp1#

我该如何选择从“芝加哥”和“旧金山”开始的所有推文,去除文本中的词尾,并在没有这些词条的情况下逐行输出整个推文?
spark脚本中的下面一行根据您的情况过滤掉tweet。然而,它并没有从行中删除停止词。

val cleanTxt = tweetsRaw.
  filter(x => x.startsWith("San Francisco") || x.startsWith("Chicago") || !stopWords.contains(x));

如果你想删除停止字,那么你必须使用一个Map转换,这将删除一行的停止字,然后你可以保存到文件。
假设每一行代表一条以空格分隔的tweet,下面是我将如何删除停止词。

cleanTxt.map(tweet => tweet.split(" ").filterNot(x => stop.contains(x)).mkString(" ").saveAsTextFile(outputDataset)

相关问题