如何在Lucene中使用查询词tfidf作为文档相似度计算的因子

fnvucqvd  于 2022-11-07  发布在  Lucene
关注(0)|答案(1)|浏览(127)

我正在尝试通过Lucene实现显式语义分析(ESA)。
在匹配文档时,如何考虑查询中的术语TFIDF?
例如:

  • 查询:“a B c a d a”
  • 文档1:“a B a”
  • 文档2:“a B c”

查询与Doc1的匹配程度应该比与Doc2的匹配程度更好。
我希望此功能能够在不影响性能的情况下正常工作。
我是通过查询增强来实现的,通过增强与TFIDF相关的术语。
有没有更好的办法?

ukxgm1gy

ukxgm1gy1#

Lucene已经支持TF/IDF评分,当然,默认情况下,所以不太确定我是否理解你在寻找什么。
实际上,这听起来有点像是要根据查询本身的TF/IDF来衡量查询词的权重。

  • TF:Lucene对每个查询词的得分求和。如果同一个查询词出现两次,在一个查询中(如field:(a a b)),重复出现的词将获得更大的权重,相当于(尽管不完全相同)提升2。
  • IDF:idf指的是多个文档语料库中的数据。由于只有一个查询,所以这不适用。或者,如果你想了解它的技术性,所有术语的idf都为1。

因此,IDF在这种情况下并没有什么意义,TF已经为您完成了。因此,您实际上不需要做任何事情。
请记住,思想,还有其他的评分元素虽然!coord因素是重要的在这里。

  • a b a与四个查询词匹配(a b a a,但不是c d
  • a b c与五个查询词匹配(a b a c a,但不是d

因此,该特定评分元件将更强烈地给第二文档评分。
下面是a b a文档的explain(请参阅IndexSearcher.explain)输出:

0.26880693 = (MATCH) product of:
  0.40321037 = (MATCH) sum of:
    0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
      0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.42039964 = fieldWeight in 0, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
    0.07690979 = (MATCH) weight(text:b in 0) [DefaultSimilarity], result of:
      0.07690979 = score(doc=0,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 0, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
    0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
      0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.42039964 = fieldWeight in 0, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
    0.10876686 = (MATCH) weight(text:a in 0) [DefaultSimilarity], result of:
      0.10876686 = score(doc=0,freq=2.0 = termFreq=2.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.42039964 = fieldWeight in 0, product of:
          1.4142135 = tf(freq=2.0), with freq of:
            2.0 = termFreq=2.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=0)
  0.6666667 = coord(4/6)

对于文档a b c

0.43768594 = (MATCH) product of:
  0.52522314 = (MATCH) sum of:
    0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.07690979 = (MATCH) weight(text:b in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.217584 = (MATCH) weight(text:c in 1) [DefaultSimilarity], result of:
      0.217584 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.435168 = queryWeight, product of:
          1.0 = idf(docFreq=1, maxDocs=2)
          0.435168 = queryNorm
        0.5 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          1.0 = idf(docFreq=1, maxDocs=2)
          0.5 = fieldNorm(doc=1)
    0.07690979 = (MATCH) weight(text:a in 1) [DefaultSimilarity], result of:
      0.07690979 = score(doc=1,freq=1.0 = termFreq=1.0
), product of:
        0.25872254 = queryWeight, product of:
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.435168 = queryNorm
        0.29726744 = fieldWeight in 1, product of:
          1.0 = tf(freq=1.0), with freq of:
            1.0 = termFreq=1.0
          0.5945349 = idf(docFreq=2, maxDocs=2)
          0.5 = fieldNorm(doc=1)
  0.8333333 = coord(5/6)

请注意,根据需要,与术语a的匹配在第一个文档中获得更高的权重,并且您还可以看到每个独立的a在中单独求值并添加到得分中。
但是,还要注意第二个文档中术语“c”的coord和idf的差异。这些得分影响只是抵消了添加多个相同术语所带来的提升。如果您向查询中添加足够多的a,它们最终会交换位置。c上的匹配只是被评估为一个 * 远 * 更重要的结果。

相关问题