rank-over-function在hive中

xv8emn3q 于 2021-06-02 发布在 Hadoop

关注(0)|答案(4)|浏览(387)

我试图在配置单元中运行这个查询，只返回adimpression表中出现频率最高的前10个url。

select
        ranked_mytable.url,
        ranked_mytable.cnt

from
        ( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
        from
                ( select url, count(*) cnt
                from store.adimpression ai
                        inner join zuppa.adgroupcreativesubscription agcs
                                on agcs.id = ai.adgroupcreativesubscriptionid
                        inner join zuppa.adgroup ag
                                on ag.id = agcs.adgroupid
                where ai.datehour >= '2014-05-15 00:00:00'
                        and ag.siteid = 1240
                group by url
                ) iq
        ) ranked_mytable

where
      ranked_mytable.rnk <= 10

order by
        ranked_mytable.url,
        ranked_mytable.rnk desc

;

不幸的是，我收到一条错误消息，指出：

FAILED: SemanticException [Error 10002]: Line 26:23 Invalid column reference 'rnk'

我试着调试它直到 ranked_mytable 一切都很顺利。我试着评论一下 where ranked_mytable.rnk <= 10 但错误信息不断出现。

hadoop Hive partitioning rank

来源：https://stackoverflow.com/questions/23703248/rank-over-function-in-hive

4条答案

按热度按时间

w3nuxt5m1#

从rank over（）中删除partition by iq.url子句，然后重新运行查询。
感谢和问候，kamleshkumar Gujarati

赞(0）回复(0）举报 2021-06-03

7xzttuei2#

放 as 在 rnk 变量。它应该很好用。

赞(0）回复(0）举报 2021-06-03

ctzwtxfj3#

配置单元无法按不在select语句“输出”中的列排序。要解决此问题，只需在选定列中包含该列：

select
        ranked_mytable.url,
        ranked_mytable.cnt,
        ranked_mytable.rnk

from
        ( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
        from
                ( select url, count(*) cnt
                from store.adimpression ai
                        inner join zuppa.adgroupcreativesubscription agcs
                                on agcs.id = ai.adgroupcreativesubscriptionid
                        inner join zuppa.adgroup ag
                                on ag.id = agcs.adgroupid
                where ai.datehour >= '2014-05-15 00:00:00'
                        and ag.siteid = 1240
                group by url
                ) iq
        ) ranked_mytable

where
      ranked_mytable.rnk <= 10

order by
        ranked_mytable.url,
        ranked_mytable.rnk desc

;

如果您不想在最终输出中使用“rnk”列，我希望您可以将整个内容 Package 到另一个内部查询中，然后只选择“url”和“cnt”字段。

赞(0）回复(0）举报 2021-06-03

bprjcwpo4#

RANK OVER 不是实现这一目标的最佳功能。更好的解决方案是使用 SORT BY 以及 LIMIT . 事实上这是真的 LIMIT 随机选取表中的行，但如果与 SORT BY 功能。从apache wiki：

-- Top k queries. The following query returns the top 5 sales records wrt amount. 
SET mapred.reduce.tasks = 1 SELECT * FROM sales SORT BY amount
DESC LIMIT 5

可以这样重新编写查询：

select
        iq.url,
        iq.cnt

from
        ( select url, count(*) cnt
        from store.adimpression ai
          inner join zuppa.adgroupcreativesubscription agcs
            on agcs.id = ai.adgroupcreativesubscriptionid
          inner join zuppa.adgroup ag
            on ag.id = agcs.adgroupid
        where ai.datehour >= '2014-05-15 00:00:00'
          and ag.siteid = 1240
        group by url ) iq

sort by
        iq.cnt desc

limit
        10

;

赞(0）回复(0）举报 2021-06-03

我来回答

rank-over-function在hive中

4条答案

相关问题

热门标签

最新问答