如何使postgres避免对seek分页查询进行双重顺序扫描?

fwzugrvs  于 2021-07-26  发布在  Java
关注(0)|答案(2)|浏览(301)

架构
我在一个表中存储了一堆帖子(feed\u items)
我有一个表,其中包含哪个用户id喜欢/不喜欢哪个feed\u item\u id(feed\u item\u likes\u dislikes)
我有另一个表,其中包含哪个用户id喜欢/愤怒哪个feed\u item\u id(feed\u item\u love\u anger)
我有第四个表,其中包含哪个feed\u item\u id有哪些标记,其中标记是varchar数组(feed\u item\u标记)
每个帖子的喜欢/不喜欢的总数存储在具体化视图中(feed\u item\u likes\u dislikes\u aggregate)
爱/怒的总数存储在另一个物化视图中(feed\u item\u love\u anger\u agregate)
喜欢不喜欢爱和愤怒是分开存储的,因为一篇文章可以同时被喜欢/不喜欢和喜欢/愤怒(不幸的是业务需求)
我有两列名为标题向量和摘要向量类型tsvector的feed\u项,这有助于通过搜索关键字查找文章(在postgres中全文搜索)
问题
我想按发布日期和提要项id的降序查找所有帖子
有些帖子是同时发布的,我想使用(pubdate,feed\u item\u id)<(value1,value2)这里描述的seek分页方法进行分页
我的第1页查询
查找喜欢度大于0的帖子,其标题或摘要中有scam这个词

SELECT
  fi.feed_item_id,
  pubdate,
  link,
  title,
  summary,
  author,
  feed_id,
  likes,
  dislikes,
  love,
  anger,
  tags 
FROM
  feed_items fi 
  LEFT JOIN
    feed_item_tags t 
    ON fi.feed_item_id = t.feed_item_id 
  LEFT JOIN
    feed_item_love_anger_aggregate bba 
    ON fi.feed_item_id = bba.feed_item_id 
  LEFT JOIN
    feed_item_likes_dislikes_aggregate lda 
    ON fi.feed_item_id = lda.feed_item_id 
WHERE
  (
    title_vector @@ to_tsquery('scam') 
    OR summary_vector @@ to_tsquery('scam')
  )
  AND 'for' = ANY(tags) 
  AND likes > 0 
ORDER BY
  pubdate DESC,
  feed_item_id DESC LIMIT 3;

解释分析第1页

Limit  (cost=2.83..16.88 rows=3 width=233) (actual time=0.075..0.158 rows=3 loops=1)
   ->  Nested Loop Left Join  (cost=2.83..124.53 rows=26 width=233) (actual time=0.074..0.157 rows=3 loops=1)
         ->  Nested Loop  (cost=2.69..116.00 rows=26 width=217) (actual time=0.067..0.146 rows=3 loops=1)
               Join Filter: (t.feed_item_id = fi.feed_item_id)
               Rows Removed by Join Filter: 73
               ->  Index Scan using idx_feed_items_pubdate_feed_item_id_desc on feed_items fi  (cost=0.14..68.77 rows=76 width=62) (actual time=0.016..0.023 rows=3 loops=1)
                     Filter: ((title_vector @@ to_tsquery('scam'::text)) OR (summary_vector @@ to_tsquery('scam'::text)))
                     Rows Removed by Filter: 1
               ->  Materialize  (cost=2.55..8.56 rows=34 width=187) (actual time=0.016..0.037 rows=25 loops=3)
                     ->  Hash Join  (cost=2.55..8.39 rows=34 width=187) (actual time=0.044..0.091 rows=36 loops=1)
                           Hash Cond: (t.feed_item_id = lda.feed_item_id)
                           ->  Seq Scan on feed_item_tags t  (cost=0.00..5.25 rows=67 width=155) (actual time=0.009..0.043 rows=67 loops=1)
                                 Filter: ('for'::text = ANY ((tags)::text[]))
                                 Rows Removed by Filter: 33
                           ->  Hash  (cost=1.93..1.93 rows=50 width=32) (actual time=0.029..0.029 rows=50 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 12kB
                                 ->  Seq Scan on feed_item_likes_dislikes_aggregate lda  (cost=0.00..1.93 rows=50 width=32) (actual time=0.004..0.013 rows=50 loops=1)
                                       Filter: (likes > 0)
                                       Rows Removed by Filter: 24
         ->  Index Scan using idx_feed_item_love_anger_aggregate on feed_item_love_anger_aggregate bba  (cost=0.14..0.32 rows=1 width=32) (actual time=0.002..0.003 rows=0 loops=3)
               Index Cond: (feed_item_id = fi.feed_item_id)
 Planning Time: 0.601 ms
 Execution Time: 0.195 ms
(23 rows)

尽管在所有的表上都有适当的索引,它还是进行了2次顺序扫描
我的第n页查询
从上面的查询中获取第三个结果的发布日期和提要项id,然后加载接下来的3个结果

SELECT
  fi.feed_item_id,
  pubdate,
  link,
  title,
  summary,
  author,
  feed_id,
  likes,
  dislikes,
  love,
  anger,
  tags 
FROM
  feed_items fi 
  LEFT JOIN
    feed_item_tags t 
    ON fi.feed_item_id = t.feed_item_id 
  LEFT JOIN
    feed_item_love_anger_aggregate bba 
    ON fi.feed_item_id = bba.feed_item_id 
  LEFT JOIN
    feed_item_likes_dislikes_aggregate lda 
    ON fi.feed_item_id = lda.feed_item_id 
WHERE
  (
    pubdate,
    fi.feed_item_id
  )
  < ('2020-06-19 19:50:00+05:30', 'bc5c8dfe-13a9-d97a-a328-0e5b8990c500') 
  AND 
  (
    title_vector @@ to_tsquery('scam') 
    OR summary_vector @@ to_tsquery('scam')
  )
  AND 'for' = ANY(tags) 
  AND likes > 0 
ORDER BY
  pubdate DESC,
  feed_item_id DESC LIMIT 3;

解释第n页查询,尽管它正在进行2次连续扫描

Limit  (cost=2.83..17.13 rows=3 width=233) (actual time=0.082..0.199 rows=3 loops=1)
   ->  Nested Loop Left Join  (cost=2.83..121.97 rows=25 width=233) (actual time=0.081..0.198 rows=3 loops=1)
         ->  Nested Loop  (cost=2.69..113.67 rows=25 width=217) (actual time=0.073..0.185 rows=3 loops=1)
               Join Filter: (t.feed_item_id = fi.feed_item_id)
               Rows Removed by Join Filter: 183
               ->  Index Scan using idx_feed_items_pubdate_feed_item_id_desc on feed_items fi  (cost=0.14..67.45 rows=74 width=62) (actual time=0.014..0.034 rows=6 loops=1)
                     Index Cond: (ROW(pubdate, feed_item_id) < ROW('2020-06-19 19:50:00+05:30'::timestamp with time zone, 'bc5c8dfe-13a9-d97a-a328-0e5b8990c500'::uuid))
                     Filter: ((title_vector @@ to_tsquery('scam'::text)) OR (summary_vector @@ to_tsquery('scam'::text)))
                     Rows Removed by Filter: 2
               ->  Materialize  (cost=2.55..8.56 rows=34 width=187) (actual time=0.009..0.022 rows=31 loops=6)
                     ->  Hash Join  (cost=2.55..8.39 rows=34 width=187) (actual time=0.050..0.098 rows=36 loops=1)
                           Hash Cond: (t.feed_item_id = lda.feed_item_id)
                           ->  Seq Scan on feed_item_tags t  (cost=0.00..5.25 rows=67 width=155) (actual time=0.009..0.044 rows=67 loops=1)
                                 Filter: ('for'::text = ANY ((tags)::text[]))
                                 Rows Removed by Filter: 33
                           ->  Hash  (cost=1.93..1.93 rows=50 width=32) (actual time=0.028..0.029 rows=50 loops=1)
                                 Buckets: 1024  Batches: 1  Memory Usage: 12kB
                                 ->  Seq Scan on feed_item_likes_dislikes_aggregate lda  (cost=0.00..1.93 rows=50 width=32) (actual time=0.005..0.014 rows=50 loops=1)
                                       Filter: (likes > 0)
                                       Rows Removed by Filter: 24
         ->  Index Scan using idx_feed_item_love_anger_aggregate on feed_item_love_anger_aggregate bba  (cost=0.14..0.32 rows=1 width=32) (actual time=0.003..0.003 rows=1 loops=3)
               Index Cond: (feed_item_id = fi.feed_item_id)
 Planning Time: 0.596 ms
 Execution Time: 0.236 ms
(24 rows)

链接到小提琴
我已经摆弄了所需的表和索引,有人能告诉我如何修复查询,以便最多使用索引扫描或将顺序扫描的数量减少到1吗?

utugiqy6

utugiqy61#

除了gin索引之外,您当前在tags表上没有其他索引。在你的小提琴里,如果我 create index on feed_item_tags (feed_item_id) 做一个 ANALYZE ,然后两个序列扫描都消失了。这样做可能比重新格式化更好,这样就可以使用gin指数,就像我的另一个答案一样,因为这样可以更有效地利用提前止损的可能性。
但实际上,“feed\u item\u tags”表有什么意义呢?如果要有一个子表来列出标记,通常每行有一个标记/父\u id组合。如果您想要一个标记数组而不是一列标记,为什么不直接将数组粘贴到父表中呢?有时两个表之间有1:1关系的表是有原因的,但不是很常见。

hrysbysz

hrysbysz2#

构造 'for' = ANY(tags) 无法使用gin索引。要使用它,您需要将它重新格式化为 '{for}' <@ tags .
但是,它会选择不使用索引,因为表太小,条件太不可选。如果您想强制使用索引,为了证明它有能力这样做,您可以首先 set enable_seqscan=off .

相关问题