elasticsearch 7.7.1碎片未分配

ocebsuys  于 2021-06-10  发布在  ElasticSearch
关注(0)|答案(1)|浏览(894)

我们最近将elasticsearch群集从5.6.16升级到7.7.1。
在那之后,有时我观察到一些碎片没有被分配。
我的节点统计数据放在这里。
分配说明 unassigned 碎片如下所示

ubuntu@platform2:~$      curl -X GET "localhost:9200/_cluster/allocation/explain?pretty" -H 'Content-Type: application/json' -d'
> {
>   "index": "denorm",
>   "shard": 14,
>   "primary": false
> }
> '
{
  "index" : "denorm",
  "shard" : 14,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2020-11-19T13:09:42.072Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "0_00hk5IRcmgrHGYjpV1jA",
      "node_name" : "platform2",
      "transport_address" : "10.62.70.178:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    },
    {
      "node_id" : "9ltF-KXGRk-xMF_Ef1DAng",
      "node_name" : "platform3",
      "transport_address" : "10.62.70.179:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[denorm][14], node[9ltF-KXGRk-xMF_Ef1DAng], [P], s[STARTED], a[id=SNyCoFUzSwaiIE4187Tfig]]"
        }
      ]
    },
    {
      "node_id" : "ocKks7zJT7OODhse-yveyg",
      "node_name" : "platform1",
      "transport_address" : "10.62.70.177:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        }
      ]
    }
  ]
}

正如在node stats中提到的,在4.3gb堆中,只有约85mb用于保存内存中的数据结构。
正如在这里讨论的设置 indices.breaker.total.use_real_memory: false 我没有看到 Data too large exception .
有人能告诉我如何确认我是否注意到这里讨论的相同问题吗?
我在elasticsearch 5.6.16中没有看到此问题。

rwqw0loc

rwqw0loc1#

正如@val在评论中指出的,由于您在es7.x版本中的断路器配置,es无法在其他数据节点上分配shard,现在剩下的节点也包含主shard。

{
      "node_id" : "9ltF-KXGRk-xMF_Ef1DAng",
      "node_name" : "platform3",
      "transport_address" : "10.62.70.179:9300",
      "node_decision" : "no",
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:09:42.072Z], failed_attempts[5], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [ocKks7zJT7OODhse-yveyg]: failed to perform indices:data/write/bulk[s] on replica [denorm][14], node[ocKks7zJT7OODhse-yveyg], [R], recovery_source[peer recovery], s[INITIALIZING], a[id=WpZNgGsuSoeyHE-Hg_HBSw], unassigned_info[[reason=ALLOCATION_FAILED], at[2020-11-19T13:08:41.492Z], failed_attempts[4], failed_nodes[[ocKks7zJT7OODhse-yveyg, 0_00hk5IRcmgrHGYjpV1jA]], delayed=false, details[failed shard on node [0_00hk5IRcmgrHGYjpV1jA]: failed recovery, failure RecoveryFailedException[[denorm][14]: Recovery failed from {platform3}{9ltF-KXGRk-xMF_Ef1DAng}{hdd3KH53Sg6Us8Ow2rVY-A}{10.62.70.179}{10.62.70.179:9300}{dimr} into {platform2}{0_00hk5IRcmgrHGYjpV1jA}{0jbndos9TQq9s-DoSMNjgA}{10.62.70.178}{10.62.70.178:9300}{dimr}]; nested: RemoteTransportException[[platform3][10.62.70.179:9300][internal:index/shard/recovery/start_recovery]]; nested: RemoteTransportException[[platform2][10.62.70.178:9300][internal:index/shard/recovery/file_chunk]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4480371114/4.1gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4479322256/4.1gb], new bytes reserved: [1048858/1mb], usages [request=0/0b, fielddata=1619769/1.5mb, in_flight_requests=2097692/2mb, accounting=119863546/114.3mb]]; ], allocation_status[no_attempt]], expected_shard_size[1168665714], failure RemoteTransportException[[platform1][10.62.70.177:9300][indices:data/write/bulk[s][r]]]; nested: CircuitBreakingException[[parent] Data too large, data for [<transport_request>] would be [4516429702/4.2gb], which is larger than the limit of [4476560998/4.1gb], real usage: [4516131576/4.2gb], new bytes reserved: [298126/291.1kb], usages [request=65648/64.1kb, fielddata=1412022/1.3mb, in_flight_requests=29783730/28.4mb, accounting=121642746/116mb]]; ], allocation_status[no_attempt]]]"
        },
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "the shard cannot be allocated to the same node on which a copy of the shard already exists [[denorm][14], node[9ltF-KXGRk-xMF_Ef1DAng], [P], s[STARTED], a[id=SNyCoFUzSwaiIE4187Tfig]]"
        }
      ]
    },

注意错误消息circuitbreakingexception[[parent]数据太大,[<transport\u request>]的数据将是[4480371114/4.1gb]
同样,由于高可用性原因,elasticsearch从不在同一节点上分配主切分片及其副本切分片,因此尝试通过调整其设置来修复断路器异常。
默认分配重试次数仅为5次,在解决问题后,您可以使用以下命令再次重试分配。

curl -XPOST 'localhost:9200/_cluster/reroute?retry_failed

运行上述命令后,如果仍有一些失败,则可能必须手动重新路由碎片,以便重新路由api
更多的背景和详细阅读请点击此链接。

相关问题