如何编写查询以避免在select distinct和size collect\U set配置单元查询中使用单个缩减器？

slmsl1lt 于 2021-05-29 发布在 Hadoop

关注(0)|答案(1)|浏览(394)

如何重写这些查询以避免reduce阶段出现单个reducer？它需要很长时间，我失去了使用它的并行性的好处。

select id
, count(distinct locations) AS unique_locations
  from
  mytable
;

和

select id
, size(collect_set(locations)) AS unique_locations
  from
  mytable
;

1条答案

对count（distinct var）使用两个查询：

SELECT
 count(1)
FROM (
 SELECT DISTINCT locations as unique_locations 
 from my_table
 ) t;

我认为同样的情况也适用于尺寸集合：

SELECT
  size(unique_locations)
FROM (
 SELECT collect_set(locations) as unique_locations 
 from my_table
 ) t;