sql/impala：在第一个groupby输出上应用另一个groupby

cxfofazt 于 2021-06-26 发布在 Impala

关注(0)|答案(0)|浏览(176)

我需要做一个 group by 在树顶上 group by 输出。例如，在下面的 table1 :

id   |   timestamp   |  team
----------------------------
1    |   2016-01-02  |   A
2    |   2016-02-01  |   B
1    |   2016-02-04  |   A
1    |   2016-03-05  |   A
3    |   2016-05-12  |   B
3    |   2016-05-15  |   B
4    |   2016-07-07  |   A
5    |   2016-08-01  |   C
6    |   2015-08-01  |   C
1    |   2015-04-01  |   A

如果我进行查询：

query = select id, max(timestamp) as latest_ts from table1' + \
           ' where timestamp > "2016-01-01 00:00:00" group by id'

我会：

id   |   latest_ts   |
---------------------
2    |   2016-02-01  |  
1    |   2016-03-05  |   
3    |   2016-05-15  |   
4    |   2016-07-07  |   
5    |   2016-08-01  |

不过，我想知道是否有可能包括 team 像下面这样的专栏？

id   |   latest_ts   |  team
----------------------------
2    |   2016-02-01  |   B
1    |   2016-03-05  |   A
3    |   2016-05-15  |   B
4    |   2016-07-07  |   A
5    |   2016-08-01  |   C

最终，我真正需要的是知道2016年每个团队有多少不同的id。我的预期结果应该是：

team  |  count(id)
-------------------
 A    |  2
 B    |  2
 C    |  1

我想再做一次 group by 在第一个上面 group by 结果使用下面的代码，但得到语法错误。

import pandas as pd
query = 'select team, count(id) from ' + \
           '(select id, max(timestamp) as latest_ts from table1' + \
           ' where timestamp > "2016-01-01 00:00:00" group by id)' + \
        'group by team'  

cursor = impala_con.cursor()
cursor.execute('USE history')
cursor.execute(query)
df_result = as_pandas(cursor)
df_result

所以我想知道这是不是可以实现？如果是的话，正确的方法应该是什么？谢谢！

sql impala python

来源：https://stackoverflow.com/questions/38946407/sql-impala-applying-another-group-by-on-the-first-group-by-output