如何在elasticsearch上的列表中有完全匹配时提高分数?

uqdfh47h  于 7个月前  发布在  ElasticSearch
关注(0)|答案(2)|浏览(101)

我对ElasticSearch很陌生,我下面有这个问题。
有这两个记录:

POST test/_doc/1
    {
      "id": 1,
      "authors": [
        {
          "name": "Test Name",
          "url": "/url/1/"
        }
      ]
    }

    POST test/_doc/2
    {
      "id": 2,
      "authors": [
        {
          "name": "Test Name",
          "url": "/url/1/"
        },
            {
          "name": "Another author",
          "url": "/url/another/"
        }
      ]
    }

字符串
这个查询:

GET test/_search
    {
      "query": {
        "function_score": {
          "query": {
            "bool": {
              "should": [
                {
                  "match_phrase": {
                    "authors.name": {
                      "_name": "exact match in authors",
                      "query": "Test Name",
                      "boost": 100,
                      "slop": 1
                    }
                  }
                }
              ]
            }
          }
        }
      }
    }


为什么当有多个作者时,分数会降低?我如何才能使它更高或与只有一个作者的记录相同?

{
      ...
      "hits": {
        "max_score": 42.221836,
        "hits": [
          {
            "_score": 42.221836,
            "_source": {
              "id": 1,
              "authors": [
                {
                  "name": "Test Name",
                  "url": "/url/1/"
                }
              ]
            },
            "matched_queries": [
              "exact match in authors"
            ]
          },
          {
            "_score": 32.088596,
            "_source": {
              "id": 2,
              "authors": [
                {
                  "name": "Test Name",
                  "url": "/url/1/"
                },
                {
                  "name": "Another author",
                  "url": "/url/another/"
                }
              ]
            },
            "matched_queries": [
              "exact match in authors"
            ]
          }
        ]
      }
    }


我在文件上找不到任何关于这个的东西。
下面的详细信息只是为了确保stackoverflow不会显示以下错误:It looks like your post is mostly code; please add some more details.

kqqjbcuj

kqqjbcuj1#

TLDR;

这是因为你的第二个文件有一个较长的字段。你可能不习惯看:

  • 恒定计分
  • 滤波器
  • 功能评分

去理解

这是什么意思?
Elasticsearch在处理一个文档数组时,会像这样存储它们:
最初:

{
  "authors": [
    {
      "name": "A0"
    },
        {
      "name": "A1"
    }
  ]
}

字符串
收件人:

{
  "authors.name": ["A0", "A1"]
}


而文档得分的计算采用TF/IDF,但TF与文档长度有关。

  • 文档% 1 authors.name的长度为% 2
  • Doc 2 authors.name的长度为4

调查:

你可以使用API _explain

GET 77469343/_explain/1
{
  "query": {
    "bool": {
      "should": [
        {
          "match_phrase": {
            "authors.name": {
              "_name": "exact match in authors",
              "query": "Test Name",
              "boost": 100,
              "slop": 1
            }
          }
        }
      ]
    }
  }
}


这将给你给予以下结果:

文档1

{
  "_index": "77469343",
  "_id": "1",
  "matched": true,
  "explanation": {
    "value": 42.221836,
    "description": """weight(authors.name:"test name"~1 in 0) [PerFieldSimilarity], result of:""",
    "details": [
      {
        "value": 42.221836,
        "description": "score(freq=1.0), computed as boost * idf * tf from:",
        "details": [
          {
            "value": 220,
            "description": "boost",
            "details": []
          },
          {
            "value": 0.36464313,
            "description": "idf, sum of:",
            "details": [
              {
                "value": 0.18232156,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 2,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 2,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              },
              {
                "value": 0.18232156,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 2,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 2,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              }
            ]
          },
          {
            "value": 0.5263158,
            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
            "details": [
              {
                "value": 1,
                "description": "phraseFreq=1.0",
                "details": []
              },
              {
                "value": 1.2,
                "description": "k1, term saturation parameter",
                "details": []
              },
              {
                "value": 0.75,
                "description": "b, length normalization parameter",
                "details": []
              },
              {
                "value": 2,
                "description": "dl, length of field",
                "details": []
              },
              {
                "value": 3,
                "description": "avgdl, average length of field",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

文档2

{
  "_index": "77469343",
  "_id": "2",
  "matched": true,
  "explanation": {
    "value": 32.088596,
    "description": """weight(authors.name:"test name"~1 in 1) [PerFieldSimilarity], result of:""",
    "details": [
      {
        "value": 32.088596,
        "description": "score(freq=1.0), computed as boost * idf * tf from:",
        "details": [
          {
            "value": 220,
            "description": "boost",
            "details": []
          },
          {
            "value": 0.36464313,
            "description": "idf, sum of:",
            "details": [
              {
                "value": 0.18232156,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 2,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 2,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              },
              {
                "value": 0.18232156,
                "description": "idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:",
                "details": [
                  {
                    "value": 2,
                    "description": "n, number of documents containing term",
                    "details": []
                  },
                  {
                    "value": 2,
                    "description": "N, total number of documents with field",
                    "details": []
                  }
                ]
              }
            ]
          },
          {
            "value": 0.40000004,
            "description": "tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:",
            "details": [
              {
                "value": 1,
                "description": "phraseFreq=1.0",
                "details": []
              },
              {
                "value": 1.2,
                "description": "k1, term saturation parameter",
                "details": []
              },
              {
                "value": 0.75,
                "description": "b, length normalization parameter",
                "details": []
              },
              {
                "value": 4,
                "description": "dl, length of field",
                "details": []
              },
              {
                "value": 3,
                "description": "avgdl, average length of field",
                "details": []
              }
            ]
          }
        ]
      }
    ]
  }
}

修复

常量评分

如果你仍然想要一个分数,你可能想看看constant_score查询:

GET 77469343/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "constant_score": {
            "filter": {
              "match_phrase": {
                "authors.name": {
                  "_name": "exact match in authors",
                  "query": "Test Name",
                  "boost": 100,
                  "slop": 1
                }
              }
            },
            "boost": 1.2
          }
        }
      ]
    }
  }
}

过滤而不是应该?

如果你使用过滤器,匹配的文档不会影响分数:

GET 77469343/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "match_phrase": {
            "authors.name": {
              "_name": "exact match in authors",
              "query": "Test Name",
              "boost": 100,
              "slop": 1
            }
          }
        }
      ]
    }
  }
}

o4tp2gmn

o4tp2gmn2#

我尝试了@paulo解决方案,但它并不完全适合我,所以我最终添加了一个嵌套字段:

"authors": {
        "type": "nested",
        "properties": {
          "name": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          },
          "url": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword",
                "ignore_above": 256
              }
            }
          }
        }
      },

字符串
并使用此查询:

{
    "nested": {
        "path": "authors",
        "_name": "exact match in authors",
        "query": {
            "bool": {
                "must": {
                    "match_phrase": {
                        "authors.name": {
                            "query": "Test Name",
                            "boost": 100,
                            "slop": 1,
                        }
                    }
                }
            }
        },
    }
}


ElasticSearch文档:https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
经过这些修改后,它工作得很好!

相关问题