如何将pairrdd的值转换为rdd?

e37o9pze  于 2021-05-27  发布在  Spark
关注(0)|答案(1)|浏览(417)

我有这样一对夫妇:

rdd = sc.parallelize([{'f':[1,2,3]},{'f':[1,2]}])
pair_rdd = rdd.flatMap(lambda x: x.keys()).zip(rdd.flatMap(lambda x:x.values()))
reduce_rdd = pair_rdd.reduceByKey(lambda x,y: x+y)

输出结果:

[('f', [1, 2, 3, 1, 2])]

作为 reduce_rdd 可能很大,所以我想把 reduce_rdd .

reduce_rdd.map(lambda x: sc.parallelize(x[1]))

然后出了问题。

PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
brvekthn

brvekthn1#

您可以使用 reduce_rdd 通过收集值。

sc.parallelize(reduce_rdd.flatMap(lambda x:x[1]).collect())

# using flatMap transformation

sc.parallelize(reduce_rdd.flatMap(lambda x:x[1]).collect()).collect()

# [1, 2, 3, 1, 2]

# using map

sc.parallelize(reduce_rdd.map(lambda x:x[1]).collect()).collect()

# [[1, 2, 3, 1, 2]]

from pyspark.sql.functions import *
from pyspark.sql.types import *

# if you want to convert rdd to dataframe

spark.createDataFrame(reduce_rdd.flatMap(lambda x:x[1]).collect(),IntegerType()).show()

相关问题