Lucene计算现有索引术语向量

um6iljoc 于 2022-11-07 发布在 Lucene

关注(0)|答案(1)|浏览(137)

通过Lucene.net，我想获得stackoverflow question中描述的术语向量。
问题是，索引已经用被索引和存储的字段生成，但是没有术语向量。

FieldType type = new FieldType();
type.setIndexed(true);
type.setStored(true);
type.setStoreTermVectors(false);

从理论上讲，应该可以重新计算每个文档的术语向量，然后将其存储在索引中。
你知道这怎么可能，而不删除完整的Lucene索引？

lucene

来源：https://stackoverflow.com/questions/72508138/lucene-calculate-term-vectors-for-existing-index

1条答案

按热度按时间

8tntrjer1#

正如我在问题的注解中提到的，您可以动态生成术语向量数据，这可能有助于避免完全重建索引数据。
在我的场景中，我希望找到我的搜索词在匹配文档中的偏移位置。

我不想过度宣传这种方法-它绝对不能替代重新索引-但如果您的查询是基本的，它可能会有所帮助。
步骤1：执行您目前正在执行的任何查询。

对于命中列表中的每个文档，您将需要重新处理该文档中的相关字段-因此，您要么已经将字段数据存储在现有索引中，要么需要从其原始源中检索它。

步骤2：对于每个这样的字段，您可以重复使用相同的分析器来即时建立记号流。记号流可以设定为不同的属性，例如：

令牌属性
偏移属性
和其他（请参阅here）
范例：

using Lucene.Net.Analysis.Standard;
using Lucene.Net.Analysis.TokenAttributes;
using Lucene.Net.Util;

const LuceneVersion AppLuceneVersion = LuceneVersion.LUCENE_48;

String? fieldName = null;
String fieldContent = "Foo Bar Baz Bar Bat";
String searchTerm = "bar";

var analyzer = new StandardAnalyzer(AppLuceneVersion);
var ts = analyzer.GetTokenStream(fieldName, fieldContent);
var charTermAttr = ts.AddAttribute<ICharTermAttribute>();
var offsetAttr = ts.AddAttribute<IOffsetAttribute>();

try
{
    ts.Reset();
    Console.WriteLine("");
    Console.WriteLine("Token: " + searchTerm);
    while (ts.IncrementToken())
    {
        if (searchTerm.Equals(charTermAttr.ToString())) 
        {
            var start = offsetAttr.StartOffset;
            var end = offsetAttr.EndOffset;
            Console.WriteLine(String.Format("  > offset: {0}-{1}", start, end));
        }
    }
    ts.End();
}
catch (Exception)
{

    throw;
}

上面的示例假设步骤1中的一个命中是包含"Foo Bar Baz Bar Bat"-的字段，搜索项为bar。
生成的输出为：

Token: bar
  > offset: 4-7
  > offset: 12-15

因此，正如您所看到的，您并不是在重新执行查询-您只是在重新处理令牌流。原始搜索词越复杂，就越难按照您可能需要的方式使用这种方法。

赞(0）回复(0）举报 2022-11-07

我来回答

Lucene计算现有索引术语向量

1条答案

相关问题

热门标签

最新问答