rank-over-function在hive中

xv8emn3q  于 2021-06-02  发布在  Hadoop
关注(0)|答案(4)|浏览(387)

我试图在配置单元中运行这个查询,只返回adimpression表中出现频率最高的前10个url。

select
        ranked_mytable.url,
        ranked_mytable.cnt

from
        ( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
        from
                ( select url, count(*) cnt
                from store.adimpression ai
                        inner join zuppa.adgroupcreativesubscription agcs
                                on agcs.id = ai.adgroupcreativesubscriptionid
                        inner join zuppa.adgroup ag
                                on ag.id = agcs.adgroupid
                where ai.datehour >= '2014-05-15 00:00:00'
                        and ag.siteid = 1240
                group by url
                ) iq
        ) ranked_mytable

where
      ranked_mytable.rnk <= 10

order by
        ranked_mytable.url,
        ranked_mytable.rnk desc

;

不幸的是,我收到一条错误消息,指出:

FAILED: SemanticException [Error 10002]: Line 26:23 Invalid column reference 'rnk'

我试着调试它直到 ranked_mytable 一切都很顺利。我试着评论一下 where ranked_mytable.rnk <= 10 但错误信息不断出现。

w3nuxt5m

w3nuxt5m1#

从rank over()中删除partition by iq.url子句,然后重新运行查询。
感谢和问候,kamleshkumar Gujarati

7xzttuei

7xzttuei2#

asrnk 变量。它应该很好用。

ctzwtxfj

ctzwtxfj3#

配置单元无法按不在select语句“输出”中的列排序。要解决此问题,只需在选定列中包含该列:

select
        ranked_mytable.url,
        ranked_mytable.cnt,
        ranked_mytable.rnk

from
        ( select iq.url, iq.cnt, rank() over (partition by iq.url order by iq.cnt desc) rnk
        from
                ( select url, count(*) cnt
                from store.adimpression ai
                        inner join zuppa.adgroupcreativesubscription agcs
                                on agcs.id = ai.adgroupcreativesubscriptionid
                        inner join zuppa.adgroup ag
                                on ag.id = agcs.adgroupid
                where ai.datehour >= '2014-05-15 00:00:00'
                        and ag.siteid = 1240
                group by url
                ) iq
        ) ranked_mytable

where
      ranked_mytable.rnk <= 10

order by
        ranked_mytable.url,
        ranked_mytable.rnk desc

;

如果您不想在最终输出中使用“rnk”列,我希望您可以将整个内容 Package 到另一个内部查询中,然后只选择“url”和“cnt”字段。

bprjcwpo

bprjcwpo4#

RANK OVER 不是实现这一目标的最佳功能。更好的解决方案是使用 SORT BY 以及 LIMIT . 事实上这是真的 LIMIT 随机选取表中的行,但如果与 SORT BY 功能。从apache wiki:

-- Top k queries. The following query returns the top 5 sales records wrt amount. 
SET mapred.reduce.tasks = 1 SELECT * FROM sales SORT BY amount
DESC LIMIT 5

可以这样重新编写查询:

select
        iq.url,
        iq.cnt

from
        ( select url, count(*) cnt
        from store.adimpression ai
          inner join zuppa.adgroupcreativesubscription agcs
            on agcs.id = ai.adgroupcreativesubscriptionid
          inner join zuppa.adgroup ag
            on ag.id = agcs.adgroupid
        where ai.datehour >= '2014-05-15 00:00:00'
          and ag.siteid = 1240
        group by url ) iq

sort by
        iq.cnt desc

limit
        10

;

相关问题