我问过一个类似的问题,如何计算rdd中每个等距间隔的数目。下面是我实现的代码。
import numpy as np
def count_interval_rdd(input_rdd, buckets):
'''
:param input_rdd: RDD[float]
:param buckets: int, the number of intervals.
:return: array
'''
# reduce, cost much time
max_val = input_rdd.max()
# reduce again!
min_val = input_rdd.min()
# get the size of each intervals.
delta = (max_val - min_val) / buckets
bucket_list = [0 for _ in range(buckets)]
# reduce to get result.
count_dict = input_rdd.map(
lambda x: min(np.floor((x - min_val) / delta), buckets - 1)).countByValue()
for index in sorted(count_dict.keys()):
bucket_list[int(index)] = (count_dict[index])
return bucket_list
rdd = sc.parallelize([1,2,3,4,5,6,6,6,10])
count_interval_rdd(rdd,5)
预期结果:
[2, 2, 4, 0, 1]
但是,我发现代码在计算上花费了很多时间 input_rdd.max()
以及 input_rdd.min()
. 那么如何解决这个问题呢?
暂无答案!
目前还没有任何答案,快来回答吧!