使用Lucene Analyzer而不建立索引-我的方法合理吗？

px9o7tmv 于 2022-11-07 发布在 Lucene

关注(0)|答案(1)|浏览(127)

我的目标是利用Lucene的一些标记器和过滤器来转换输入文本，但不创建任何索引。
例如，给定这个（人为的）输入字符串...
someone's texte goes here foo
......还有像这样的Lucene分析器......

Analyzer analyzer = CustomAnalyzer.builder()
        .withTokenizer("icu")
        .addTokenFilter("lowercase")
        .addTokenFilter("icuFolding")
        .build();

我想得到以下输出：
someone's texte goes here foo个
下面的Java方法完成了我想要的任务。

但是否有更好（即更典型和/或更简洁）的方法？

我特别想到了我使用TokenStream和CharTermAttribute的方式，因为我以前从来没有这样使用过它们。感觉很笨重。
下面是代码：
Lucene 8.3.0导入：

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.custom.CustomAnalyzer;

我的方法：

private String transform(String input) throws IOException {

    Analyzer analyzer = CustomAnalyzer.builder()
            .withTokenizer("icu")
            .addTokenFilter("lowercase")
            .addTokenFilter("icuFolding")
            .build();

    TokenStream ts = analyzer.tokenStream("myField", new StringReader(input));
    CharTermAttribute charTermAtt = ts.addAttribute(CharTermAttribute.class);

    StringBuilder sb = new StringBuilder();
    try {
        ts.reset();
        while (ts.incrementToken()) {
            sb.append(charTermAtt.toString()).append(" ");
        }
        ts.end();
    } finally {
        ts.close();
    }
    return sb.toString().trim();
}

lucene

来源：https://stackoverflow.com/questions/59723144/using-lucene-analyzer-without-indexing-is-my-approach-reasonable