Apache Lucene文件中的搜索短语

pw9qyyiw 于 8个月前发布在 Lucene

关注(0)|答案(1)|浏览(80)

我正在做一个项目，它使用Apache Lucene搜索文本文件。下面的代码可以很好地使用一个单词。当搜索短语时，它会给出下面列出的结果：
短语：ABC DEF

文件1：包含“ABC DEF”作为短语
文件2：包含“ABC DEF”作为短语
File 3：在不同位置包含ABC和DEF作为单独的单词
File 4：在不同位置包含ABC和DEF作为单独的单词
文件5：包含ABC
文件6：包含DEF
结果

文件名-分数

文件1 1.3502092
下载4.0447767
文件2 1.0047288
文件3 0.97969353
搜索

public static void search(String keyword, String path) throws IOException, ParseException {

   IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(path)));
    IndexSearcher searcher = new IndexSearcher(reader);

    Analyzer analyzer = new StandardAnalyzer();

    QueryParser queryParser = new QueryParser("contents", analyzer);
    queryParser.setDefaultOperator(QueryParser.Operator.AND);
    Query query = queryParser.parse(keyword);

    TopDocs hits = searcher.search(query, 1000);
    ScoreDoc[] document = hits.scoreDocs;
    System.out.println("Total no of hits for content: " + hits.totalHits);

    for (int i = 0; i < document.length; i++) {
        Document doc = searcher.doc(document[i].doc);
        String filePath = doc.get("title");
        System.out.println(filePath + " " + document[i].score);

    }
}

如结果所示，文件4不包含短语，但其分数高于包含短语的文件2。如何解决这一问题？
先谢了。

lucene

来源：https://stackoverflow.com/questions/76871298/apache-lucene-search-phrases-in-files

1条答案

按热度按时间

vs91vp4v1#

file 4匹配的事实表明您不是在搜索短语“ABC DEF”，而是作为术语查询ABC DEF（不带引号）。默认情况下，BM 25不考虑术语的接近性。评分差异可能是由于文档的长度（例如：file 4可能比其他文档短）、词频或其他因素。
使用带引号的短语查询“ABC DEF”，这样file 4就不会被匹配。添加接近运算符“~”，以允许ABC和DEF之间的几个单词。
如果你真的想根据接近度进行排名，你可能想探索一个不同于默认BM 25的相似性模型。我不确定Lucene中是否有内置模型。

赞(0）回复(0）举报 8个月前

我来回答

Apache Lucene文件中的搜索短语

1条答案

相关问题

热门标签

最新问答