hadoop SQL COUNT DISTINCT，条件基于另一列

sxissh06 于 2023-05-16 发布在 Hadoop

关注(0)|答案(2)|浏览(103)

我被这个问题缠住了，找不到解决的办法。
我有一个表，我想计算每个站和错误类型的零件数量，但如果一个零件在同一个站有多个错误，只需计算按字典顺序最高的错误。
数据如下所示：

| station     |   error   |   uniquepart_id   |
| ----------- | --------- | ----------------- |
| A           | ERR_01    | 0001              |
| A           | ERR_01    | 0001              |
| A           | ERR_02    | 0002              |
| A           | ERR_02    | 0002              |
| A           | ERR_03    | 0001              |
| A           | ERR_03    | 0002              |
| A           | ERR_03    | 0003              |
| A           | ERR_03    | 0004              |
| B           | ERR_01    | 0005              |
| B           | ERR_01    | 0006              |
| B           | ERR_02    | 0007              |
| B           | ERR_02    | 0008              |
| B           | ERR_03    | 0009              |
| B           | ERR_03    | 0010              |
| B           | ERR_03    | 0011              |
| B           | ERR_03    | 0012              |

我写了以下查询：

SELECT station, error, COUNT(DISTINCT uniquepart_id) AS num_parts
       FROM Tablename
       WHERE (process_date= 'xx-xx-xxxx')
       GROUP BY station, error

我得到了这个结果：
| 电台|误差|零件编号|
| --------------|--------------|--------------|
| 一个|ERR_01| 1|
| 一个|ERR_02| 1|
| 一个|ERR_03|四个|
| B| ERR_01|二|
| B| ERR_02|二|
| B| ERR_03|四个|

我在找这个：

电台	误差	零件编号
一个	ERR_03	四个
B	ERR_01	二
B	ERR_02	二
B	ERR_03	四个

我尝试使用MAX和HAVING来过滤每个组中的行，但是我得到了语法错误。我想用内心的疑问是可以解决的。

hadoop

来源：https://stackoverflow.com/questions/76191239/sql-count-distinct-with-condition-based-on-another-column

2条答案

按热度按时间

nwwlzxa71#

分两个阶段处理数据。
首先（内部查询）将数据折叠为每个站每个部件一个错误代码。
然后像你一样处理。

SELECT
  station, error, COUNT(*) AS num_parts
FROM
(
  SELECT station, MAX(error) AS error, uniquepart_id
    FROM Tablename
   WHERE process_date = 'xx-xx-xxxx'
GROUP BY station, uniquepart_id
)
  AS station_error_per_part
GROUP BY
  station, error
ORDER BY
  station, error

赞(0）回复(0）举报 2023-05-16

jum4pzuy2#

select * from
(
SELECT station, error, COUNT(DISTINCT uniquepart_id) AS num_parts
  , dense_rank()over(partition by error order by COUNT(DISTINCT uniquepart_id) desc) rn
FROM test
-- WHERE (process_date= 'xx-xx-xxxx')
GROUP BY station, error
) t
where rn=1
order by station, error

Update 1：@MatBailie，感谢您对错误的说明。

select * from
(
SELECT station, error, COUNT(*) AS num_parts
  , dense_rank()over(partition by station,error order by COUNT(*) desc) rn
FROM (SELECT station,uniquepart_id,max(error) error
    FROM test
    -- WHERE (process_date= 'xx-xx-xxxx')
    GROUP BY station,uniquepart_id
  ) p
GROUP BY station, error
) t
where rn=1
order by station, error

rn值当然是多余的

赞(0）回复(0）举报 2023-05-16

我来回答

hadoop SQL COUNT DISTINCT，条件基于另一列

2条答案

相关问题

热门标签

最新问答