如何使用pyspark正确地为非成对的dds设置groupbykey

bnlyeluc  于 2021-06-01  发布在  Hadoop
关注(0)|答案(1)|浏览(357)

我对Python还不熟悉。我也是pysaprk的新手。我试着运行一个代码,它包含一个元组 (id , (span, mention)) 执行 .map(lambda (id, (span, text)): (id, text)) .
我使用的代码是:

m = text\
            .map(lambda (id, (span, text)): (id, text))\
            .mapValues(lambda v: ngrams(v, self.max_ngram))\'''error triggered here'''
            .flatMap(lambda (target, tokens): (((target, t), 1) for t in tokens))\

这就是原始数据的格式化方式 (id, source, span, text) :

{'_id': u'en.wikipedia.org/wiki/Cerambycidae',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens',
  'span': (61, 73),
  'text': u'"Plinthocoelium virens" is a species of beetle in the family Cerambycidae.'},
 {'_id': u'en.wikipedia.org/wiki/Dru_Drury',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens',
  'span': (20, 29),
  'text': u'It was described by Dru Drury in 1770.'}]

我得到这个错误:

for k, v in iterator:
TypeError: tuple indices must be integers, not str

我知道groupbykey在pairwiserdds上工作,所以我想知道如何正确执行groupbykey来解决这个问题?
任何帮助或指导都将不胜感激。
我使用的是python2.7和pyspark 2.3.0。
先谢谢你。

a5g8bdjr

a5g8bdjr1#

首先需要将数据Map到一个具有键和值的窗体,然后再Map到groupbykey。
键和值的形式总是一个元组(a,b),其中键是a,值是b。a和b可能是元组本身。

rdd = sc.parallelize([{'_id': u'en.wikipedia.org/wiki/Cerambycidae',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens',
  'span': (61, 73),
  'text': u'"Plinthocoelium virens" is a species of beetle in the family Cerambycidae.'},
 {'_id': u'en.wikipedia.org/wiki/Dru_Drury',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens',
  'span': (20, 29),
  'text': u'It was described by Dru Drury in 1770.'},
 {'_id': u'en.wikipedia.org/wiki/Dru_Drury',
  'source': 'en.wikipedia.org/wiki/Plinthocoelium_virens2',
  'span': (20, 29, 2),
  'text': u'It was described by Dru Drury in 1770.2'}])

print rdd.map(lambda x: (x["_id"], (x["span"], x["text"]))).groupByKey()\
.map(lambda x: (x[0], list(x[1]))).collect()

[(u'en.wikipedia.org/wiki/dru_drury',[((20,29),u'it was descripted by dru drury in 1770.'),((20,29,2),u'it was descripted by dru drury in 1770.2')],(u'en.wikipedia.org/wiki/cerambycidae',[((61,73),u'plinthocoelium virens“是天牛科甲虫的一种])]

相关问题