sql—根据某些条件替换列的值

uyto3xhc  于 2021-05-27  发布在  Spark
关注(0)|答案(3)|浏览(519)

输入:

item   loc   month    year    qty_name      qty_value
a       x     8        2020    chocolate      10
a       x     8        2020    gum            15
a       x     8        2020    maggi          11
a       x     8        2020    colgate        18
b       y     8        2020    chocolate      20
b       y     8        2020    gum            30
b       y     8        2020    maggi          40
b       y     8        2020    colgate        9
c       s     8        2020    gum            15
c       s     8        2020    maggi          11
c       s     8        2020    colgate        18

预期产量:

item   loc   month    year    qty_name      qty_value
a       x     8        2020    chocolate      10
a       x     8        2020    gum            15
a       x     8        2020    maggi          0
a       x     8        2020    colgate        0
b       y     8        2020    chocolate      20
b       y     8        2020    gum            30
b       y     8        2020    maggi          0
b       y     8        2020    colgate        0
c       s     8        2020    gum            15
c       s     8        2020    maggi          11
c       s     8        2020    colgate        18

说明:
对于一个 item , loc , month , year 组合:
如果 chocolate>0 ,则除巧克力和口香糖外,所有其他值都将变为0(这在itam a和b中发生)
如果巧克力不存在,那么值将保持原样(这在item=c和loc=s中发生)

w1jd8yoj

w1jd8yoj1#

这是Pypark路。

import pyspark.sql.functions as f

df2 = df.filter("qty_name = 'chocolate' and qty_value > 0").select('item', 'loc', 'month', 'year').withColumn('marker', f.lit('Y'))

df.join(df2, ['item', 'loc', 'month', 'year'], 'left') \
  .withColumn('qty_value', f.when(f.expr("marker = 'Y' and qty_name not in ('chocolate', 'gum')"), 0).otherwise(f.col('qty_value'))) \
  .drop('marker').show(12, False)

+----+---+-----+----+---------+---------+
|item|loc|month|year|qty_name |qty_value|
+----+---+-----+----+---------+---------+
|a   |x  |8    |2020|chocolate|10       |
|a   |x  |8    |2020|gum      |15       |
|a   |x  |8    |2020|maggi    |0        |
|a   |x  |8    |2020|colgate  |0        |
|b   |y  |8    |2020|chocolate|20       |
|b   |y  |8    |2020|gum      |30       |
|b   |y  |8    |2020|maggi    |0        |
|b   |y  |8    |2020|colgate  |0        |
|c   |s  |8    |2020|gum      |15       |
|c   |s  |8    |2020|maggi    |11       |
|c   |s  |8    |2020|colgate  |18       |
+----+---+-----+----+---------+---------+
bakd9h0s

bakd9h0s2#

如果您使用的是mysql版本8或更高版本,那么可以使用窗口函数。在这里 COUNT() OVER() 对另一列中的巧克力计数,并使所有行的值相同。然后你可以查询上面的查询结果。

SELECT ITEM,
       LOC,
       MONTH,
       YEAR,
       QTY_NAME,
       CASE
          WHEN QTY_NAME NOT IN ('chocolate', 'gum') AND CNT > 0 THEN 0
          ELSE QTY_NAME
       END
          QTY_NAME
  FROM (  SELECT ITEM,
                 LOC,
                 MONTH,
                 YEAR,
                 QTY_NAME,
                 QTY_VALUE,
                 COUNT (CASE WHEN QTY_NAME = 'chocolate' THEN 1 ELSE NULL END)
                    OVER ()
                    CNT
            FROM TEST_TABLE
        GROUP BY ITEM,
                 LOC,
                 MONTH,
                 YEAR,
                 QTY_NAME,
                 QTY_VALUE)
x7yiwoj4

x7yiwoj43#

下面的解决方案假设在给定的组合中没有多个“巧克力”记录 item , loc , month , year . 就像你的样本数据一样。有了这个假设,就不需要对每个组合进行聚合。
只需将所有不是“巧克力”或“口香糖”的记录的数量更新为零,其中存在相同组合的记录,且“巧克力”的数量大于0。
样本数据

create table quantities
(
  item nvarchar(1),
  loc nvarchar(1),
  month int,
  year int,
  qty_name nvarchar(10),
  qty_value int
);

insert into quantities (item, loc, month, year, qty_name, qty_value) values
('a', 'x', 8, 2020, 'chocolate', 10),
('a', 'x', 8, 2020, 'gum'      , 15),
('a', 'x', 8, 2020, 'maggi'    , 11),
('a', 'x', 8, 2020, 'colgate'  , 18),
('b', 'y', 8, 2020, 'chocolate', 20),
('b', 'y', 8, 2020, 'gum'      , 30),
('b', 'y', 8, 2020, 'maggi'    , 40),
('b', 'y', 8, 2020, 'colgate'  , 9),
('c', 's', 8, 2020, 'gum'      , 15),
('c', 's', 8, 2020, 'maggi'    , 11),
('c', 's', 8, 2020, 'colgate'  , 18);

解决方案

update quantities q
join quantities q2
  on  q2.item = q.item
  and q2.loc = q.loc
  and q2.month = q.month
  and q2.year = q.year
  and q2.qty_name = 'chocolate'
  and q2.qty_value > 0
set q.qty_value = 0
where q.qty_name not in ('chocolate', 'gum');

结果

select * from quantities;

item    loc month   year    qty_name    qty_value
------- --- ------- ------- ----------- ----------
a       x   8       2020    chocolate   10
a       x   8       2020    gum         15
a       x   8       2020    maggi       0
a       x   8       2020    colgate     0
b       y   8       2020    chocolate   20
b       y   8       2020    gum         30
b       y   8       2020    maggi       0
b       y   8       2020    colgate     0
c       s   8       2020    gum         15
c       s   8       2020    maggi       11
c       s   8       2020    colgate     18

sql小提琴
编辑:这是一个mysql解决方案,因为这个问题以前被标记过。我手头没有apachesparksql引擎来验证这个解决方案。

相关问题