如何在Lucene中只突出显示PrefixQuery的结果,而不是整个单词?

zf9nrax1  于 2022-11-07  发布在  Lucene
关注(0)|答案(1)|浏览(235)

我是Lucene的新手,可能做错了什么,所以如果是这样的话,请纠正我。我已经找了几天答案了,不知道该怎么办。
我们的目标是使用Lucene.NET通过部分搜索(如StartsWith)来搜索用户名,并只突出显示找到的部分。
我是这样处理的。
首先索引创建:

using var indexDir = FSDirectory.Open(Path.Combine(IndexDirectory, IndexName));
using var standardAnalyzer = new StandardAnalyzer(CurrentVersion);

var indexConfig = new IndexWriterConfig(CurrentVersion, standardAnalyzer);
indexConfig.OpenMode = OpenMode.CREATE_OR_APPEND;

using var indexWriter = new IndexWriter(indexDir, indexConfig);
if (indexWriter.NumDocs == 0)
{
    //fill the index with Documents
}

文件的建立方式如下:

static Document BuildClientDocument(int id, string surname, string name)
{
    var document = new Document()
    {
        new StringField("Id", id.ToString(), Field.Store.YES),

        new TextField("Surname", surname, Field.Store.YES),
        new TextField("Surname_sort", surname.ToLower(), Field.Store.NO),

        new TextField("Name", name, Field.Store.YES),
        new TextField("Name_sort", name.ToLower(), Field.Store.NO),
    };

    return document;
}

搜索过程如下:

using var multiReader = new MultiReader(indexWriter.GetReader(true)); //the plan was to use multiple indexes per entity types
var indexSearcher = new IndexSearcher(multiReader);

var queryString = "abc"; //just as a sample
var queryWords = queryString.SplitWords();

var query = new BooleanQuery();
queryWords
    .Process((word, index) =>
    {
        var boolean = new BooleanQuery()
        {
            { new PrefixQuery(new Term("Surname", word)) { Boost = 100 }, Occur.SHOULD }, //surnames are most important to match
            { new PrefixQuery(new Term("Name", word)) { Boost = 50 }, Occur.SHOULD }, //names are less important
        };
        boolean.Boost = (queryWords.Count() - index); //first words in a search query are more important than others

        query.Add(boolean, Occur.MUST);
    })
;

var topDocs = indexSearcher.Search(query, 50, new Sort( //sort by relevance and then in lexicographical order
    SortField.FIELD_SCORE,
    new SortField("Surname_sort", SortFieldType.STRING),
    new SortField("Name_sort", SortFieldType.STRING)
));

并突出显示:

var htmlFormatter = new SimpleHTMLFormatter();
var queryScorer = new QueryScorer(query);
var highlighter = new Highlighter(htmlFormatter, queryScorer);
foreach (var found in topDocs.ScoreDocs)
{
    var document = indexSearcher.Doc(found.Doc);
    var surname = document.Get("Surname"); //just for simplicity
    var surnameFragment = highlighter.GetBestFragment(standardAnalyzer, "Surname", surname);
    Console.WriteLine(surnameFragment);
}

问题是荧光笔返回的结果如下:

<b>abc</b>
<b>abcd</b>
<b>abcde</b>
<b>abcdef</b>

因此,它“突出显示”整个单词,即使我正在搜索部分。Explain返回NON-MATCH所有的方式,所以不确定它是否有帮助。
是否可以只突出显示搜索到的零件?就像我的例子一样。

kpbwa7wx

kpbwa7wx1#

在进一步研究这一点的时候,我得出了一个结论,要使这种突出显示工作,需要调整索引生成方法,并按部分拆分索引,以便正确计算偏移量。否则突出显示将只突出显示周围的单词(片段)。
因此,基于此,我已经设法建立了一个简单的荧光笔自己。

public class Highlighter
{
    private const string TempStartToken = "\x02";
    private const string TempEndToken = "\x03";

    private const string SearchPatternTemplate = $"[{TempStartToken}{TempEndToken}]*{{0}}";
    private const string ReplacePattern = $"{TempStartToken}$&{TempEndToken}";

    private readonly ConcurrentDictionary<HighlightKey, Regex> _regexPatternsCache = new();

    private static string GetHighlightTypeTemplate(HighlightType highlightType) =>
        highlightType switch
        {
            HighlightType.Starts => "^{0}",
            HighlightType.Contains => "{0}",
            HighlightType.Ends => "{0}$",
            HighlightType.Equals => "^{0}$",
            _ => throw new ArgumentException($"Unsupported {nameof(HighlightType)}: '{highlightType}'", nameof(highlightType)),
        };

    public string Highlight(string text, IReadOnlySet<string> words, string startToken, string endToken, HighlightType highlightType)
    {
        foreach (var word in words)
        {
            var key = new HighlightKey
            {
                Word = word,
                HighlightType = highlightType,
            };

            var regex = _regexPatternsCache.GetOrAdd(key, _ =>
            {
                var parts = word.Select(w => string.Format(SearchPatternTemplate, Regex.Escape(w.ToString())));
                var pattern = string.Concat(parts);
                var highlightPattern = string.Format(GetHighlightTypeTemplate(highlightType), pattern);

                return new Regex(highlightPattern, RegexOptions.IgnoreCase | RegexOptions.CultureInvariant | RegexOptions.Compiled);

            });

            text = regex.Replace(text, ReplacePattern);
        }

        return text
            .Replace(TempStartToken, startToken)
            .Replace(TempEndToken, endToken)
        ;
    }

    private record HighlightKey
    {
        public string Word { get; init; }
        public HighlightType HighlightType { get; init; }
    }
}

public enum HighlightType
{
    Starts,
    Contains,
    Ends,
    Equals,
}

使用方法如下:

var queries = new[] { "abc" }.ToHashSet();
var search = "a ab abc abcd abcde";

var highlighter = new Highlighter();
var outputs = search
    .Split((string[])null, StringSplitOptions.RemoveEmptyEntries | StringSplitOptions.TrimEntries)
    .Select(w => highlighter.Highlight(w, queries, "<b>", "</b>", HighlightType.Starts))
;

var result = string.Join(" ", outputs).Dump();
Util.RawHtml(result).Dump();

输出如下所示:
第一个
我愿意接受其他更好的解决方案。

相关问题