我正在尝试通过Lucene实现显式语义分析(ESA)。在匹配文档时,如何考虑查询中的术语TFIDF?例如:
查询与Doc1的匹配程度应该比与Doc2的匹配程度更好。我希望此功能能够在不影响性能的情况下正常工作。我是通过查询增强来实现的,通过增强与TFIDF相关的术语。有没有更好的办法?
ukxgm1gy1#
Lucene已经支持TF/IDF评分,当然,默认情况下,所以不太确定我是否理解你在寻找什么。实际上,这听起来有点像是要根据查询本身的TF/IDF来衡量查询词的权重。
field:(a a b)
因此,IDF在这种情况下并没有什么意义,TF已经为您完成了。因此,您实际上不需要做任何事情。请记住,思想,还有其他的评分元素虽然!coord因素是重要的在这里。
coord
a b a
a b a a
c d
a b c
a b a c a
d
因此,该特定评分元件将更强烈地给第二文档评分。下面是a b a文档的explain(请参阅IndexSearcher.explain)输出:
explain
0.26880693 = (MATCH) product of: 0.40321037 = (MATCH) sum of: 0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of: 0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.42039964 = fieldWeight in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.07690979 = (MATCH) weight(text:b in 0) [DefaultSimilarity], result of: 0.07690979 = score(doc=0,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 0, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of: 0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.42039964 = fieldWeight in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of: 0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.42039964 = fieldWeight in 0, product of: 1.4142135 = tf(freq=2.0), with freq of: 2.0 = termFreq=2.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=0) 0.6666667 = coord(4/6)
对于文档a b c:
0.43768594 = (MATCH) product of: 0.52522314 = (MATCH) sum of: 0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.07690979 = (MATCH) weight(text:b in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.217584 = (MATCH) weight(text:c in 1) [DefaultSimilarity], result of: 0.217584 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.435168 = queryWeight, product of: 1.0 = idf(docFreq=1, maxDocs=2) 0.435168 = queryNorm 0.5 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.0 = idf(docFreq=1, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of: 0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0 ), product of: 0.25872254 = queryWeight, product of: 0.5945349 = idf(docFreq=2, maxDocs=2) 0.435168 = queryNorm 0.29726744 = fieldWeight in 1, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 0.5945349 = idf(docFreq=2, maxDocs=2) 0.5 = fieldNorm(doc=1) 0.8333333 = coord(5/6)
请注意,根据需要,与术语a的匹配在第一个文档中获得更高的权重,并且您还可以看到每个独立的a在中单独求值并添加到得分中。但是,还要注意第二个文档中术语“c”的coord和idf的差异。这些得分影响只是抵消了添加多个相同术语所带来的提升。如果您向查询中添加足够多的a,它们最终会交换位置。c上的匹配只是被评估为一个 * 远 * 更重要的结果。
a
c
1条答案
按热度按时间ukxgm1gy1#
Lucene已经支持TF/IDF评分,当然,默认情况下,所以不太确定我是否理解你在寻找什么。
实际上,这听起来有点像是要根据查询本身的TF/IDF来衡量查询词的权重。
field:(a a b)
),重复出现的词将获得更大的权重,相当于(尽管不完全相同)提升2。因此,IDF在这种情况下并没有什么意义,TF已经为您完成了。因此,您实际上不需要做任何事情。
请记住,思想,还有其他的评分元素虽然!
coord
因素是重要的在这里。a b a
与四个查询词匹配(a b a a
,但不是c d
)a b c
与五个查询词匹配(a b a c a
,但不是d
)因此,该特定评分元件将更强烈地给第二文档评分。
下面是
a b a
文档的explain
(请参阅IndexSearcher.explain)输出:对于文档
a b c
:请注意,根据需要,与术语
a
的匹配在第一个文档中获得更高的权重,并且您还可以看到每个独立的a
在中单独求值并添加到得分中。但是,还要注意第二个文档中术语“c”的coord和idf的差异。这些得分影响只是抵消了添加多个相同术语所带来的提升。如果您向查询中添加足够多的
a
,它们最终会交换位置。c
上的匹配只是被评估为一个 * 远 * 更重要的结果。