如何使用pyspark和nltk计算所有np(名词)单词的长度?

1l5u6lss  于 2021-07-12  发布在  Spark
关注(0)|答案(1)|浏览(283)

在使用pyspark和nltk时,我想得到所有“np”单词的长度,并按降序排序。我现在被困在导航的子树。
子树输出示例。


# >>>[(Tree('NP', [Tree('NBAR', [('WASHINGTON', 'NN')])]), 1)

尝试获取所有np单词的长度。然后把这些长度按降序排列。
第一个元素是长度为1的单词和单词数,依此类推。

example: 

# [(1, 6157),6157 words length of one

# (2, 1833),1833 words length of 2

# (3, 654),

# (4, 204),

# (5, 65)]
import nltk
import re

textstring = """This is just a bunch of words to use for this example.  
John gave them to me last night but Kim took them to work.  
Hi Stacy. URL:http://example.com. Jessica, Mark, Tiger, Book, Crow, Airplane, SpaceShip"""

TOKEN_RE = re.compile(r"\b[\w']+\b")

grammar = r"""
    NBAR:
        {<NN.*|JJS>*<NN.*>}

    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}
"""
chunker = nltk.RegexpParser(grammar)
text = sc.parallelize(textstring.split(' ')

dropURL=text.filter(lambda x: "URL" not in x)

words = dropURL.flatMap(lambda dropURL: dropURL.split(" "))
tree = words.flatMap(lambda words: chunker.parse(nltk.tag.pos_tag(nltk.regexp_tokenize(words, TOKEN_RE))))

# data=tree.map(lambda word: (word,len(word))).filter(lambda t : t.label() =='NBAR') -- error

# data=tree.map(lambda x: (x,len(x)))##.filter(lambda t : t[0] =='NBAR')

# >>>[(Tree('NP', [Tree('NBAR', [('WASHINGTON', 'NN')])]), 1)  Trying to get the length of all NP's and in descending order.

# data=tree.map(lambda x: (x,len(x))).reduceByKey(lambda x: x=='NBAR') ##this is an error but I am getting close I think

data=tree.map(lambda x: (x[0][0],len(x[0][0][0])))#.reduceByKey(lambda x : x[1] =='NP') ##Long run time.

things = data.collect()
things
2fjabf4q

2fjabf4q1#

可以为每个条目添加类型检查以防止出现错误:

result = (tree.filter(lambda t: isinstance(t, nltk.tree.Tree) and 
                                t.label() == 'NP'
                     )
              .map(lambda t: (len(t[0][0][0]), 1))
              .reduceByKey(lambda x, y: x + y)
              .sortByKey()
        )

print(result.collect())

# [(2, 1), (3, 2), (4, 5), (5, 5), (7, 2), (8, 1), (9, 1)]

相关问题