spark:如何有效地计算rdd中每个等距间隔的数目

tcbh2hod  于 2021-05-27  发布在  Spark
关注(0)|答案(0)|浏览(215)

我问过一个类似的问题,如何计算rdd中每个等距间隔的数目。下面是我实现的代码。

import numpy as np
def count_interval_rdd(input_rdd, buckets):
    '''
    :param input_rdd: RDD[float]
    :param buckets: int, the number of intervals.
    :return: array
    '''
    # reduce, cost much time
    max_val = input_rdd.max()
    # reduce again!
    min_val = input_rdd.min()
    # get the size of each intervals.
    delta = (max_val - min_val) / buckets
    bucket_list = [0 for _ in range(buckets)]
    # reduce to get result.
    count_dict = input_rdd.map(
        lambda x: min(np.floor((x - min_val) / delta), buckets - 1)).countByValue()
    for index in sorted(count_dict.keys()):
        bucket_list[int(index)] = (count_dict[index])
    return bucket_list

rdd = sc.parallelize([1,2,3,4,5,6,6,6,10])
count_interval_rdd(rdd,5)

预期结果:

[2, 2, 4, 0, 1]

但是,我发现代码在计算上花费了很多时间 input_rdd.max() 以及 input_rdd.min() . 那么如何解决这个问题呢?

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题