我有这样一对夫妇:
rdd = sc.parallelize([{'f':[1,2,3]},{'f':[1,2]}])
pair_rdd = rdd.flatMap(lambda x: x.keys()).zip(rdd.flatMap(lambda x:x.values()))
reduce_rdd = pair_rdd.reduceByKey(lambda x,y: x+y)
输出结果:
[('f', [1, 2, 3, 1, 2])]
作为 reduce_rdd
可能很大,所以我想把 reduce_rdd
.
reduce_rdd.map(lambda x: sc.parallelize(x[1]))
然后出了问题。
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
1条答案
按热度按时间brvekthn1#
您可以使用
reduce_rdd
通过收集值。