spark：如何有效地计算rdd中每个等距间隔的数目

tcbh2hod 于 2021-05-27 发布在 Spark

关注(0)|答案(0)|浏览(215)

我问过一个类似的问题，如何计算rdd中每个等距间隔的数目。下面是我实现的代码。

import numpy as np
def count_interval_rdd(input_rdd, buckets):
    '''
    :param input_rdd: RDD[float]
    :param buckets: int, the number of intervals.
    :return: array
    '''
    # reduce, cost much time
    max_val = input_rdd.max()
    # reduce again!
    min_val = input_rdd.min()
    # get the size of each intervals.
    delta = (max_val - min_val) / buckets
    bucket_list = [0 for _ in range(buckets)]
    # reduce to get result.
    count_dict = input_rdd.map(
        lambda x: min(np.floor((x - min_val) / delta), buckets - 1)).countByValue()
    for index in sorted(count_dict.keys()):
        bucket_list[int(index)] = (count_dict[index])
    return bucket_list

rdd = sc.parallelize([1,2,3,4,5,6,6,6,10])
count_interval_rdd(rdd,5)

预期结果：