hadoop SQL COUNT DISTINCT,条件基于另一列

sxissh06  于 2023-05-16  发布在  Hadoop
关注(0)|答案(2)|浏览(103)

我被这个问题缠住了,找不到解决的办法。
我有一个表,我想计算每个站和错误类型的零件数量,但如果一个零件在同一个站有多个错误,只需计算按字典顺序最高的错误。
数据如下所示:

| station     |   error   |   uniquepart_id   |
| ----------- | --------- | ----------------- |
| A           | ERR_01    | 0001              |
| A           | ERR_01    | 0001              |
| A           | ERR_02    | 0002              |
| A           | ERR_02    | 0002              |
| A           | ERR_03    | 0001              |
| A           | ERR_03    | 0002              |
| A           | ERR_03    | 0003              |
| A           | ERR_03    | 0004              |
| B           | ERR_01    | 0005              |
| B           | ERR_01    | 0006              |
| B           | ERR_02    | 0007              |
| B           | ERR_02    | 0008              |
| B           | ERR_03    | 0009              |
| B           | ERR_03    | 0010              |
| B           | ERR_03    | 0011              |
| B           | ERR_03    | 0012              |

我写了以下查询:

SELECT station, error, COUNT(DISTINCT uniquepart_id) AS num_parts
       FROM Tablename
       WHERE (process_date= 'xx-xx-xxxx')
       GROUP BY station, error

我得到了这个结果:
| 电台|误差|零件编号|
| --------------|--------------|--------------|
| 一个|ERR_01| 1|
| 一个|ERR_02| 1|
| 一个|ERR_03|四个|
| B| ERR_01|二|
| B| ERR_02|二|
| B| ERR_03|四个|

我在找这个:

电台误差零件编号
一个ERR_03四个
BERR_01
BERR_02
BERR_03四个

我尝试使用MAX和HAVING来过滤每个组中的行,但是我得到了语法错误。我想用内心的疑问是可以解决的。

nwwlzxa7

nwwlzxa71#

分两个阶段处理数据。
首先(内部查询)将数据折叠为每个站每个部件一个错误代码。
然后像你一样处理。

SELECT
  station, error, COUNT(*) AS num_parts
FROM
(
  SELECT station, MAX(error) AS error, uniquepart_id
    FROM Tablename
   WHERE process_date = 'xx-xx-xxxx'
GROUP BY station, uniquepart_id
)
  AS station_error_per_part
GROUP BY
  station, error
ORDER BY
  station, error
jum4pzuy

jum4pzuy2#

select * from
(
SELECT station, error, COUNT(DISTINCT uniquepart_id) AS num_parts
  , dense_rank()over(partition by error order by COUNT(DISTINCT uniquepart_id) desc) rn
FROM test
-- WHERE (process_date= 'xx-xx-xxxx')
GROUP BY station, error
) t
where rn=1
order by station, error

Update 1:@MatBailie,感谢您对错误的说明。

select * from
(
SELECT station, error, COUNT(*) AS num_parts
  , dense_rank()over(partition by station,error order by COUNT(*) desc) rn
FROM (SELECT station,uniquepart_id,max(error) error
    FROM test
    -- WHERE (process_date= 'xx-xx-xxxx')
    GROUP BY station,uniquepart_id
  ) p
GROUP BY station, error
) t
where rn=1
order by station, error

rn值当然是多余的

相关问题