如何为缺失的数据组合添加行,并用0填充相应的字段

weylhg0b  于 2021-06-18  发布在  Mysql
关注(0)|答案(1)|浏览(340)

我有域和月份的组合,以及相应月份的总订单。我想用0值来插补缺失的组合。在pyspark中可以使用哪些最便宜的聚合命令来实现这一点?
我有以下输入表:

domain      month    year   total_orders
google.com  01       2017   20
yahoo.com   02       2017   30
google.com  03       2017   30
yahoo.com   03       2017   40
a.com       04       2017   50
a.com       05       2017   50
a.com       06       2017   50

预期产量:

domain      month    year   total_orders
google.com  01       2017   20
yahoo.com   02       2017   30
google.com  03       2017   30
yahoo.com   03       2017   40
a.com       04       2017   50
a.com       05       2017   50
a.com       06       2017   50
google.com  02       2017   0
google.com  04       2017   0
yahoo.com   04       2017   0
google.com  05       2017   0
yahoo.com   05       2017   0
google.com  06       2017   0
yahoo.com   06       2017   0

在这里,预期的输出顺序并不重要。

js4nwp54

js4nwp541#

最简单的方法是合并每个域的所有月份和年份:

select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my cross join
     (select distinct domain from input) d left join
     t
     on t.month = my.month and t.year = my.year and t.domain = d.domain;

注:这假设每年/每月的组合在数据的某个地方至少发生一次。
获取一个范围内的值是一件痛苦的事情,因为您已经将日期拆分为多个列。让我假设年份都是一样的,就像你的例子:

select my.year, my.month, d.domain, coalesce(t.total_orders, 0) as total_orders
from (select distinct month, year from input) my join
     (select domain, min(month) as min_month, max(month) as max_month
      from input
     ) d
     on my.month >= d.min_month and my.month <= d.max_month left join
     t
     on t.month = my.month and t.year = my.year and t.domain = d.domain

相关问题