我尝试使用Elastic Search
(版本6.8)从文本中找到最相似的标签,我希望得到得分相似标签的总和,而不是默认的ElasticSearch的计算(公式)。
例如,我创建my_test_index并插入三个文档:
POST my_test_index/_doc/17
{
"id": 17,
"tags": ["devops", "server", "hardware"]
}
POST my_test_index/_doc/20
{
"id": 20,
"tags": ["software", "application", "developer", "develop"]
}
POST my_test_index/_doc/21
{
"id": 21,
"tags": ["electronic", "electric"]
}
没有Map,默认如下:
{
"my_test_index" : {
"aliases" : { },
"mappings" : {
"_doc" : {
"properties" : {
"id" : {
"type" : "long"
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1585820383702",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "05SgLog6S-GTSShTatrvQw",
"version" : {
"created" : "6080199"
},
"provided_name" : "my_test_index"
}
}
}
}
所以,我请求下文查询:
GET my_test_index/_search
{
"query": {
"more_like_this": {
"fields": [
"tags"
],
"like": [
"i like electric devices and develop some softwares."
],
"min_term_freq": 1,
"min_doc_freq": 1
}
}
}
并得到这样的回应:
{
"took" : 4,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.2876821,
"_source" : {
"id" : 21,
"tags" : [
"electronic",
"electric"
]
}
},
{
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.2876821,
"_source" : {
"id" : 20,
"tags" : [
"software",
"application",
"developer",
"develop"
]
}
}
]
}
}
如果设置explain:true,则结果为:
{
"took" : 3,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 0.2876821,
"hits" : [
{
"_shard" : "[my_test_index][1]",
"_node" : "maQL1REnQHaff51ekrqMxA",
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "21",
"_score" : 0.2876821,
"_source" : {
"id" : 21,
"tags" : [
"electronic",
"electric"
]
},
"_explanation" : {
"value" : 0.2876821,
"description" : "weight(tags:electric in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 2.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
},
{
"_shard" : "[my_test_index][2]",
"_node" : "maQL1REnQHaff51ekrqMxA",
"_index" : "my_test_index",
"_type" : "_doc",
"_id" : "20",
"_score" : 0.2876821,
"_source" : {
"id" : 20,
"tags" : [
"software",
"application",
"developer",
"develop"
]
},
"_explanation" : {
"value" : 0.2876821,
"description" : "weight(tags:develop in 0) [PerFieldSimilarity], result of:",
"details" : [
{
"value" : 0.2876821,
"description" : "score(doc=0,freq=1.0 = termFreq=1.0\n), product of:",
"details" : [
{
"value" : 0.2876821,
"description" : "idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:",
"details" : [
{
"value" : 1.0,
"description" : "docFreq",
"details" : [ ]
},
{
"value" : 1.0,
"description" : "docCount",
"details" : [ ]
}
]
},
{
"value" : 1.0,
"description" : "tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:",
"details" : [
{
"value" : 1.0,
"description" : "termFreq=1.0",
"details" : [ ]
},
{
"value" : 1.2,
"description" : "parameter k1",
"details" : [ ]
},
{
"value" : 0.75,
"description" : "parameter b",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "avgFieldLength",
"details" : [ ]
},
{
"value" : 4.0,
"description" : "fieldLength",
"details" : [ ]
}
]
}
]
}
]
}
}
]
}
}
但是,这不是合适的结果对我来说,我想计算得分类似的标签像下面的总和:我在文本和标签中有“electric”一词,与“electric”标签相似度为1.0分,与“electric”标签相似度为0.7分,在文本和标签中有“developer”一词,与“developer”标签相似度为1.0分,与“developer”标签相似度为1.0分,它得到了~ 0. 8分,与“软件”的相似性得到了~ 0. 9分,等等...
因此,我预计这个结果==〉_id:20的总分= ~2.7,_id:21= ~1.7,并且......
我希望有人能提供一个如何做到这一点的例子,或者至少为我指明正确的方向。
- 谢谢-谢谢
1条答案
按热度按时间rvpgvaaj1#
我认为您在Map中没有将
text
字段用于tags
字段,这会导致id20
和21
具有相同的分数,我在Map中将其定义为text
,并且id21
获得了预期的高分。下面是我解决方案。
索引定义
您提供的索引示例文档,并使用相同的搜索查询。
搜索查询
搜索结果