impala-获取具有多个不同值计数的错误

olmpazwi  于 2021-05-29  发布在  Hadoop
关注(0)|答案(2)|浏览(434)

我使用的是cdh-5.4.4cloudera版本,我在hdfs位置有一个csv文件,我的要求是在hadoop环境(oltp)上执行实时sql查询。
所以我决定使用impala,我创建了一个csv文件的metastore表,然后在impala编辑器中执行查询(在应用程序中)。
当我执行下面的查询时,我得到了如下错误
“analysisexception:所有不同的聚合函数都需要与count(distinct city)具有相同的参数集;偏离函数:count(distinct country)”。

CSV File

OrderID,CustomerID,City,Country
Ord01,Cust01,Aachen,Germany
Ord02,Cust01,Albuquerque,USA
Ord03,Cust01,Aachen,Germany
Ord04,Cust02,Arhus,Denmark
Ord05,Cust02,Arhus,Denmark

Problamatic Query

Select CustomerID,Count(Distinct City),Count(Distinct Country) From CustomerOrders Group by CustomerID

问题:
无法使用查询中的多个不同值执行impala查询。。我在网上搜索过,他们提供ndv()方法作为解决方法,但ndv方法只返回不同值的近似计数,我需要多个字段的精确唯一计数。
期望值:
对多个字段进行精确唯一计数的最佳方法是什么?请修改上述查询,以与 Impala 。
注:这不是我原来的表格,我已经复制了论坛的问题。

ecfsfe2w

ecfsfe2w1#

我认为这可以做得更干净(未经测试):

WITH
countries AS
(
 SELECT CustomerID
       ,COUNT(DISTINCT City) AS nr_of_countries
 FROM CustomerOrders
 GROUP BY 1
)
,
cities AS
(
 SELECT CustomerID
       ,COUNT(DISTINCT City) AS nr_of_cities
 FROM CustomerOrders
 GROUP BY 1
)
SELECT CustomerID
      ,nr_of_cities
      ,nr_of_countries
 FROM cities INNER JOIN countries USING (CustomerID)
tquggr8v

tquggr8v2#

我在 Impala 也有同样的问题。以下是我的解决方法:

SELECT CustomerID
    ,sum(nr_of_cities)
    ,sum(nr_of_countries)
FROM (
    SELECT CustomerID
        ,Count(DISTINCT City) AS nr_of_cities
        ,0 AS nr_of_countries
    FROM CustomerOrders
    GROUP BY CustomerID

    UNION ALL

    SELECT CustomerID
        ,0 AS nr_of_cities
        ,Count(DISTINCT Country) AS nr_of_countries
    FROM CustomerOrders
    GROUP BY CustomerID
) AS aa
GROUP BY CustomerID

相关问题