elasticsearch 分析器“模式”是否支持其他语言？

nwwlzxa7 于 7个月前发布在 ElasticSearch

关注(0)|答案(1)|浏览(55)

当我在“text”字段上使用analyzer：“pattern”时，它只返回英文结果：

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes."
}

字符串
回应：

{
    "tokens": [
        {
            "token": "the",
            "start_offset": 0,
            "end_offset": 3,
            "type": "word",
            "position": 0
        },
        {
            "token": "2",
            "start_offset": 4,
            "end_offset": 5,
            "type": "word",
            "position": 1
        },
        {
            "token": "quick",
            "start_offset": 6,
            "end_offset": 11,
            "type": "word",
            "position": 2
        },
        {
            "token": "brown",
            "start_offset": 12,
            "end_offset": 17,
            "type": "word",
            "position": 3
        },
        {
            "token": "foxes",
            "start_offset": 18,
            "end_offset": 23,
            "type": "word",
            "position": 4
        }
    ]
}

型
但是当文本是其他语言时，它不会返回结果：

POST _analyze
{
  "analyzer": "pattern",
  "text": "БЫСТРЫХ бурых лисицы."
}

型
回应：

{
    "tokens": []
}

型
分析器模式应支持所有语言

elasticsearch

来源：https://stackoverflow.com/questions/77300037/does-analyzer-pattern-support-other-languages

1条答案

按热度按时间

9q78igpj1#

分析器模式应支持所有语言
这从来不是模式分析器的意图。模式分析器被设计为匹配输入字符串中的正则表达式模式，并在每次匹配时将sting分割为token，而忽略它匹配的部分。它也不打算在其默认配置中使用。您需要指定自己的模式来匹配。您可以这样做的唯一原因是elasticsearch需要所有分析器在没有配置的情况下工作，所以开发人员需要为所有参数选择一些默认值，他们碰巧选择了\W+。
如果你看一下regex documentation\W+只是意味着1个或多个字符以外的拉丁字母字符a-z，大写字符A-Z或数字0-9。所有其他字符，包括西里尔字母字符被视为分词。
分析器“模式”是否支持其他语言？
这是一个好问题。模式分析器不关心语言，它关心的是模式。所以，我可以指定一个模式来匹配所有非西里尔字符：

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_cyrillic_analyzer": {
          "type":      "pattern",
          "pattern":   "[^\u0400-\u04ff]"
        }
      }
    }
  }
}

POST test/_analyze
{
  "analyzer": "my_pattern_analyzer",
  "text": "The 2 бурые лисы jumped over the lazy dog's bone."
}

字符串
在这种情况下，它将产生：

{
  "tokens": [
    {
      "token": "бурые",
      "start_offset": 6,
      "end_offset": 11,
      "type": "word",
      "position": 0
    },
    {
      "token": "лисы",
      "start_offset": 12,
      "end_offset": 16,
      "type": "word",
      "position": 1
    }
  ]
}

型
或者你可以做一些像

DELETE test
PUT test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_alphabetic_analyzer": {
          "type":      "pattern",
          "pattern":   "[^\\p{IsAlphabetic}]"
        }
      }
    }
  }
}

POST test/_analyze
{
  "analyzer": "my_alphabetic_analyzer",
  "text": "The 2 бурые лисы jumped over the lazy dog's bone."
}

型
在这种情况下，它将忽略所有非Unicode字母字符。这是一个强大的工具，但你需要学习如何使用它。如果你只是开始使用elasticsearch，我建议使用standard分析器，除非你有一些具体的问题，你想解决，在这种情况下，请说明问题，我们将尝试建议你一个适当的解决方案来解决它们。

赞(0）回复(0）举报 7个月前

我来回答

elasticsearch 分析器“模式”是否支持其他语言？

1条答案

相关问题

热门标签

最新问答