sql—在postgresql上从union all(不使用union)中删除重复项的最快方法?

i2byvkas  于 2021-07-26  发布在  Java
关注(0)|答案(3)|浏览(286)

我有一个包含数亿行的表,我想从同一个表的两个索引列(没有唯一的行id)中获得唯一值的单个列表。
为了说明这一点,假设我们有一个带有 fruits 列和a veggies 列,我想建立一个 healthy_foods 在两列中列出唯一的值。
我尝试了以下查询:
有工会

WITH cte as (
    SELECT fruit, veggie
    FROM recipes
)
SELECT fruit as healthy_food
         FROM cte
         UNION --  <--- 
         SELECT veggie as healthy_food
         FROM cte;

所有的结合都是不同的

WITH cte as (...)
SELECT DISTINCT ON (healthy_food) healthy_food FROM  --  <--- 
(SELECT fruit as healthy_food
     FROM cte
     UNION ALL --  <--- 
     SELECT veggie as healthy_food
     FROM cte) tb;

与联合所有然后分组

WITH cte as (...)
SELECT fruit as healthy_food
         FROM cte
         UNION ALL --  <--- 
         SELECT veggie as healthy_food
         FROM cte
GROUP BY healthy_food; --  <---

(和添加 HAVING COUNT(*) = 1 以及 GROUP BY (从联合体中选择)
工会都得到执行超快,但所有的重复删除组合我尝试了+15分钟。
考虑到这两个字段/列来自同一个表并被索引,我如何优化这个查询?
(或者,什么是跟踪所有独特价值的最便宜方法?可能是一个触发器插入一个唯一的表,或者一个视图?)

wxclj1h5

wxclj1h51#

如果水果和/或蔬菜之间有许多重复项,但水果和蔬菜之间没有那么多重复项(如您的示例中的名称所示),并且由于您对这两种水果和/或蔬菜都有索引,模拟索引跳过扫描(又称松散索引扫描)将产生奇迹:

WITH RECURSIVE fruit AS (
   (
   SELECT fruit
   FROM   recipes
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT fruit
           FROM   recipes
           WHERE  fruit > t.fruit
           ORDER  BY 1
           LIMIT  1)
   FROM   fruit t
   WHERE  t.fruit IS NOT NULL
   )
 , veggie AS (
   (
   SELECT veggie
   FROM   recipes
   ORDER  BY 1
   LIMIT  1
   )
   UNION ALL
   SELECT (SELECT veggie
           FROM   recipes
           WHERE  veggie > t.veggie
           ORDER  BY 1
           LIMIT  1)
   FROM   veggie t
   WHERE  t.veggie IS NOT NULL
   )
SELECT DISTINCT healthy_food
FROM  (
   SELECT fruit AS healthy_food FROM fruit
   UNION  ALL
   SELECT veggie AS healthy_food FROM veggie
   ) sub
WHERE  healthy_food IS NOT NULL;

只是 DISTINCT 而不是 DISTINCT ON (就像你试过的)在外面 SELECT ,因为我们处理的是单个列。
请参见:
select distinct比postgresql中我的表上预期的慢
按查询优化分组以检索每个用户的最新行
你不妨用 UNION 而不是 UNION ALL + DISTINCT 在外面 SELECT . 只是因为你明确要求才避免这样做。但我不明白重点。

q3aa0525

q3aa05252#

我建议:

select distinct r.fruit
from recipe r
union all
select distinct r.veggie
from recipe r
where not exists (select 1 from recipe r2 where r2.fruit = r.veggie);

然后,对于此查询,您需要索引:
recipe(fruit) recipe(veggie) 如果您的统计数据是最新的,并且您使用的是postgres的最新版本,那么 select distinct 应该使用索引。然后是 not exists 也应该用。查询不应有任何额外的重复消除。

x7yiwoj4

x7yiwoj43#

如果你有单独的索引 fruit 以及 veggie 这个查询应该相当快

SELECT fruit AS healthy FROM recipes
 UNION 
SELECT veggie AS healthy FROM recipes;

但这两列上需要单独的索引,以便联合的每一侧都有一个有用的索引。
您可以为此应用程序使用物化视图。

CREATE OR REPLACE MATERIALIZED VIEW healthy_food AS 
SELECT fruit AS healthy FROM recipes
 UNION 
SELECT veggie AS healthy FROM recipes;
CREATE INDEX healthy ON healthy_food (healthy);
...

SELECT healthy FROM healthy_food ORDER BY healthy;

从视图中选择将很快。但你必须像这样定期更新视图。刷新操作将导致基本查询的开销。

REFRESH MATERIALIZED VIEW healthy_food;

如果您的应用程序可以容忍它在刷新之间过期,那么这是一个很好的方法。如果基本查询仍然太贵,您可以每天零点三十分刷新一次。

相关问题