lucene 如何在ElasticSearch中获取文本中相似标签的得分之和

syqv5f0l  于 2022-11-07  发布在  Lucene
关注(0)|答案(1)|浏览(106)

我尝试使用Elastic Search(版本6.8)从文本中找到最相似的标签,我希望得到得分相似标签的总和,而不是默认的ElasticSearch的计算(公式)。
例如,我创建my_test_index并插入三个文档:

POST my_test_index/_doc/17
{
  "id": 17,
  "tags": ["devops", "server", "hardware"]
}

POST my_test_index/_doc/20
{
  "id": 20,
  "tags": ["software", "application", "developer", "develop"]
}

POST my_test_index/_doc/21
{
  "id": 21,
  "tags": ["electronic", "electric"]
}

没有Map,默认如下:

{
  "my_test_index" : {
    "aliases" : { },
    "mappings" : {
      "_doc" : {
        "properties" : {
          "id" : {
            "type" : "long"
          },
          "tags" : {
            "type" : "text",
            "fields" : {
              "keyword" : {
                "type" : "keyword",
                "ignore_above" : 256
              }
            }
          }
        }
      }
    },
    "settings" : {
      "index" : {
        "creation_date" : "1585820383702",
        "number_of_shards" : "5",
        "number_of_replicas" : "1",
        "uuid" : "05SgLog6S-GTSShTatrvQw",
        "version" : {
          "created" : "6080199"
        },
        "provided_name" : "my_test_index"
      }
    }
  }
}

所以,我请求下文查询:

GET my_test_index/_search
{
  "query": {
    "more_like_this": {
      "fields": [
        "tags"
      ],
      "like": [
        "i like electric devices and develop some softwares."
      ],
      "min_term_freq": 1,
      "min_doc_freq": 1
    }
  }
}

并得到这样的回应:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "21",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 21,
          "tags" : [
            "electronic",
            "electric"
          ]
        }
      },
      {
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 20,
          "tags" : [
            "software",
            "application",
            "developer",
            "develop"
          ]
        }
      }
    ]
  }
}

如果设置explain:true,则结果为:

{
  "took" : 3,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.2876821,
    "hits" : [
      {
        "_shard" : "[my_test_index][1]",
        "_node" : "maQL1REnQHaff51ekrqMxA",
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "21",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 21,
          "tags" : [
            "electronic",
            "electric"
          ]
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "weight(tags:electric in 0) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "docFreq",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "docCount",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 1.0,
                  "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "termFreq=1.0",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "parameter k1",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "parameter b",
                      "details" : [ ]
                    },
                    {
                      "value" : 2.0,
                      "description" : "avgFieldLength",
                      "details" : [ ]
                    },
                    {
                      "value" : 2.0,
                      "description" : "fieldLength",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      },
      {
        "_shard" : "[my_test_index][2]",
        "_node" : "maQL1REnQHaff51ekrqMxA",
        "_index" : "my_test_index",
        "_type" : "_doc",
        "_id" : "20",
        "_score" : 0.2876821,
        "_source" : {
          "id" : 20,
          "tags" : [
            "software",
            "application",
            "developer",
            "develop"
          ]
        },
        "_explanation" : {
          "value" : 0.2876821,
          "description" : "weight(tags:develop in 0) [PerFieldSimilarity], result of:",
          "details" : [
            {
              "value" : 0.2876821,
              "description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
              "details" : [
                {
                  "value" : 0.2876821,
                  "description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "docFreq",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.0,
                      "description" : "docCount",
                      "details" : [ ]
                    }
                  ]
                },
                {
                  "value" : 1.0,
                  "description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
                  "details" : [
                    {
                      "value" : 1.0,
                      "description" : "termFreq=1.0",
                      "details" : [ ]
                    },
                    {
                      "value" : 1.2,
                      "description" : "parameter k1",
                      "details" : [ ]
                    },
                    {
                      "value" : 0.75,
                      "description" : "parameter b",
                      "details" : [ ]
                    },
                    {
                      "value" : 4.0,
                      "description" : "avgFieldLength",
                      "details" : [ ]
                    },
                    {
                      "value" : 4.0,
                      "description" : "fieldLength",
                      "details" : [ ]
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    ]
  }
}

但是,这不是合适的结果对我来说,我想计算得分类似的标签像下面的总和:我在文本和标签中有“electric”一词,与“electric”标签相似度为1.0分,与“electric”标签相似度为0.7分,在文本和标签中有“developer”一词,与“developer”标签相似度为1.0分,与“developer”标签相似度为1.0分,它得到了~ 0. 8分,与“软件”的相似性得到了~ 0. 9分,等等...
因此,我预计这个结果==〉_id:20的总分= ~2.7,_id:21= ~1.7,并且......
我希望有人能提供一个如何做到这一点的例子,或者至少为我指明正确的方向。

  • 谢谢-谢谢
rvpgvaaj

rvpgvaaj1#

我认为您在Map中没有将text字段用于tags字段,这会导致id 2021具有相同的分数,我在Map中将其定义为text,并且id 21获得了预期的高分。
下面是我解决方案。

索引定义

{
    "mappings": {
        "properties": {
            "id": {
                "type": "integer"
            },
            "tags" : {
                "type" : "text" --> note this
            }
        }
    }
}

您提供的索引示例文档,并使用相同的搜索查询

搜索查询

{
  "query": {
    "more_like_this": {
      "fields": [
        "tags"
      ],
      "like": [
        "i like electric devices and develop some softwares."
      ],
      "min_term_freq": 1,
      "min_doc_freq": 1
    }
  }
}

搜索结果

"hits": [
         {
            "_index": "so_array",
            "_type": "_doc",
            "_id": "3",
            "_score": 1.135697, --> note score
            "_source": {
               "id": 21,
               "tags": [
                  "electronic",
                  "electric"
               ]
            }
         },
         {
            "_index": "so_array",
            "_type": "_doc",
            "_id": "2",
            "_score": 0.86312973, --> note score
            "_source": {
               "id": 20,
               "tags": [
                  "software",
                  "application",
                  "developer",
                  "develop"
               ]
            }
         }
      ]

相关问题