incubator-doris [BUG Report] Schema Change Job may failed When Tablet Balance Happended

q43xntqr  于 2022-04-22  发布在  Java
关注(0)|答案(0)|浏览(205)

Describe the bug

alter table job failed

JobId: 14779188
    TableName: xxxxx
   CreateTime: 2020-07-28 15:39:14
   FinishTime: 2020-07-28 15:45:09
    IndexName: xxxxx
      IndexId: 14779189
OriginIndexId: 14465552
SchemaVersion: 3:803621913
TransactionId: 1006404
        State: CANCELLED
          Msg: errCode = 2, detailMessage = schema change task failed after try three times: task type: ALTER, status_code: RUNTIME_ERROR, backendId: Backend [id=10004, host=xxxxx, heartbeatPort=9050, alive=true], signature: 14782162
     Progress: N/A
      Timeout: 86400

The alter job submit time is '15:39:14'

Problem Location

1 log in BE 10004

I0728 15:43:21.105201 22930 task_worker_pool.cpp:451] get alter table task, signature: 14782162
I0728 15:43:21.105213 22930 schema_change.cpp:1203] begin to do request alter tablet: base_tablet_id=14468525, base_schema_hash=825165665, new_tablet_id=14782162, new_schema_hash=803621913, alter_version=
1, alter_version_hash=0
W0728 15:43:21.105237 22930 tablet_manager.cpp:1109] tablet does not exists. tablet_id=14468525
W0728 15:43:21.113729 22930 engine_alter_tablet_task.cpp:42] failed to do alter task. res=-216 base_tablet_id=14468525, base_schema_hash=825165665, new_tablet_id=14782162, new_schema_hash=803621913

1.1 doris wants to create a new table id=14782162 based on base table (id=14468525)
1.2 base table does not exists, so the alter task failed
1.3 Failed time is '15:43:21'

2 log in FE
2.1 Clone Task begins in '15:39:12', the tablet id is 14468525

2020-07-28 15:39:12,285 INFO 40 [TabletScheduler.schedulePendingTablets():412] add clone task to agent task queue: tablet id: 14468525, schema hash: 825165665, storageMedium: HDD, visible version(hash): 1-0, src backend: xxx, src path hash: 796686472853631190, dest backend: 3632012, dest path hash: 8397208614267550439
2020-07-28 15:39:12,314 INFO 826360 [TabletSchedCtx.finishCloneTask():871] clone finished: tablet id: 14468525, status: HEALTHY, state: FINISHED, type: BALANCE. from backend: 3632016, src path hash: 796686472853631190. to backend: 3632012, dest path hash: 8397208614267550439

2.1 When the time is '15:39:38', 14468525 is deleted from 10004 in FE's Catalog

2020-07-28 15:39:38,910 INFO 40 [TabletScheduler.deleteReplicaInternal():911] delete replica. tablet id: 14468525, backend id: 10004. reason: DECOMMISSION state, force: false

2.2 alter job begins time is '15:39:50'

2020-07-28 15:39:36,880 WARN 28 [AlterJobV2.checkTableStable():196] wait table 13908258 to be stable before doing SCHEMA_CHANGE job
2020-07-28 15:39:50,260 INFO 28 [SchemaChangeJobV2.runPendingJob():309] transfer schema change job 14779188 state to WAITING_TXN, watershed txn id: 1006404

It is obvious that even a replica has already delete in FE's catalog, the AlterJob would still send a request to a stale be which doesn't contains the wanted tablet.
The main reason is that AlterJob's partitionIndexMap is generated when it's created, partitionIndexMap contains BE and Replica Info.
But the BE and Replica Info may changed when the job stays pending status.

Solution

I think the best solution is that generate shadowReplica when SchemaChangeJobV2.runPendingJob executes

暂无答案!

目前还没有任何答案,快来回答吧!

相关问题