sql—获取分区中某一行(列)的序列号，而不使用函数上的row\ u number()

nwsw7zdq 于 2021-06-26 发布在 Impala

关注(0)|答案(3)|浏览(281)

我需要按分区（或组）对行进行排序，即如果我的源表是：

NAME PRICE
---- -----
AAA  1.59
AAA  2.00
AAA  0.75
BBB  3.48
BBB  2.19
BBB  0.99
BBB  2.50

我想得到目标表：

RANK NAME PRICE
---- ---- -----
1    AAA  0.75
2    AAA  1.59
3    AAA  2.00
1    BBB  0.99
2    BBB  2.19
3    BBB  2.50
4    BBB  3.48

通常我会用 ROW_NUMBER() OVER 函数，所以在apache hive中：

select
  row_number() over (partition by NAME order by PRICE) as RANK,
  NAME,
  PRICE
from
  MY_TABLE
;

不幸的是，cloudera impala不支持（目前） ROW_NUMBER() OVER 函数，所以我正在寻找解决方法。最好不要使用udaf，因为在政治上很难说服将其部署到服务器。
谢谢你的帮助。

sql Hive impala

来源：https://stackoverflow.com/questions/23425484/get-sequential-number-of-a-row-rank-within-a-partition-without-using-row-numbe

3条答案

按热度按时间

3pvhb19x1#

对于不支持窗口功能的系统，通常的解决方法如下：

select name, 
       price,
       (select count(*) 
        from my_table t2 
        where t2.name = t1.name  -- this is the "partition by" replacement
        and t2.price < t1.price) as row_number
from my_table t1
order by name, price;

sqlfiddle示例：http://sqlfiddle.com/#!2/3b027/2号

赞(0）回复(0）举报 2021-06-27

zpgglvta2#

这并不是如何使用impala的答案，但是hadoop上还有其他sql解决方案，它们已经提供了分析和子查询选项。如果没有这些功能，您可能将不得不依赖多步骤流程或一些udaf。
我是英菲尼德的建筑师
infinidb支持分析函数和子查询。
http://infinidb.co
在radiant advisors的基准测试中查看查询8，它是您所追求的类似样式的查询，使用秩分析函数。presto也能够以较慢（80倍）的速度运行这种风格的查询http://radiantadvisors.com/wp-content/uploads/2014/04/radiantadvisors_benchmark_sql-on-hadoop_2014q1.pdf
来自基准的查询（查询8）

SELECT
    sub.visit_entry_idaction_url,
    sub.name,
    lv.referer_url,
    sum(visit_ total_time) total_time,
    count(sub.idvisit),
    RANK () OVER (PARTITION BY sub. visit_entry_idaction_url
ORDER BY
    count(sub.idvisit)) rank_by_visits,
    DENSE_RANK() OVER (PARTITION BY sub.visit_entry_idaction_url
ORDER BY
    count(visit_total_time)) rank_by_ time_spent
FROM
    log_visit lv,
    (
SELECT
    visit_entry_idaction_url,
    name,
    idvisit
FROM
    log_visit JOIN log_ action
        ON
        (visit_entry_idaction_url = log_action.idaction)
WHERE
    visit_ entry_idaction_url between 2301400 AND
    2302400) sub
WHERE
    lv.idvisit = sub.idvisit
GROUP BY
    1, 2, 3
ORDER BY
    1, 6, 7;

结果

Hive 0.12       Not Executable  
Presto 0.57     506.84s  
InfiniDB 4.0    6.37s  
Impala 1.2      Not Executable

赞(0）回复(0）举报 2021-06-26

5jvtdoz23#

如果不能使用相关子查询执行此操作，则仍然可以使用联接执行此操作：

select t1.name, t1.price,
       coalesce(count(t2.name) + 1, 1)
from my_table t1 join
     my_table t2
     on t2.name = t1.name and
        t2.price < t1.price
order by t1.name, t1.price;

请注意，这并不完全适用 row_number() 除非所有的价格都是不同的 name . 这个公式实际上相当于 rank() .
为了 row_number() ，需要唯一的行标识符。
顺便说一句，以下是相当于 dense_rank() :

select t1.name, t1.price,
       coalesce(count(distinct t2.name) + 1, 1)
from my_table t1 join
     my_table t2
     on t2.name = t1.name and
        t2.price < t1.price
order by t1.name, t1.price;

赞(0）回复(0）举报 2021-06-26

我来回答

sql—获取分区中某一行(列)的序列号，而不使用函数上的row\ u number()

3条答案

相关问题

热门标签

最新问答