pig如何以0.0001这样的比率有效地采集1tb数据?

btqmn9zl  于 2021-06-21  发布在  Pig
关注(0)|答案(1)|浏览(275)

pig如何实现sample方法?我是否可以通过只读取一次所有数据来对数据进行采样?
编辑:我找到一篇关于这个主题的文章。http://had00b.blogspot.com/2013/07/random-subset-in-mapreduce.html 这很有帮助。

eqzww0vc

eqzww0vc1#

是的,一次传递的数据就足够用任何比例对其进行采样(随它去吧) r ),使用水库取样:

Let k = SIZE * r //SIZE is the size of input array
Let R be the result array (of size k), and S be the original (input) array
//first populate the first k elements or R with first k elements of S
for each i from 1 to k:
    R[i] = S[i]
//then, choose randomly if and which element from R to replace with the new candidate
for each i from k+1 to SIZE:
   j = random(1,i) //uniformly distributed number between 1 to i
   //insert the new element at probability i/k, instead one of the existing elements
   if j < k:
       R[k] = S[i]
return R

最后,每个元素都有 k/SIZE = r 待挑选

相关问题