群集计算—jobmanager不会自动将所有请求重定向到剩余/正在运行的taskmanager

vmpqdwk3  于 2021-06-21  发布在  Flink
关注(0)|答案(1)|浏览(383)

问题描述

2台计算机(203204)
创建了独立模式ha-flinkv1.6.1集群
在每台计算机上都运行jobmanager和taskmanager(2个任务槽)
开始作业后(示例socketwindowwordcount.jar) ./flink run ../examples/streaming/SocketWindowWordCount.jar --hostname 10.1.2.9 --port 9000 )在jobmanager节点上,我杀死了正在工作的taskmanager示例。
我可以看到作业被取消,然后失败。web Jmeter 板图像

state.backend: filesystem
state.checkpoints.dir: hdfs://10.1.2.109:8020/wulin/flink-checkpoints
rest.port: 9081
blob.server.port: 6124
query.server.port: 6125
web.tmpdir: /home/flink/deploy/webTmp
web.log.path: /home/flink/deploy/log
io.tmp.dirs: /home/flink/deploy/taskManagerTmp
high-availability: zookeeper
high-availability.zookeeper.quorum: 10.0.1.79:2181
high-availability.zookeeper.path.root: /flink
high-availability.cluster-id: flink
high-availability.storageDir: hdfs://10.1.2.109:8020/wulin
security.kerberos.login.principal: xxxx
security.kerberos.login.keytab: /home/ctu/flink/flink-1.6/conf/user.keytab

完整日志

log-STANDALONESSESSION-203日志
日志-taskexecutor-203
log-STANDALONESSESSION-204日志

例外情况

杀了工作tm,得到这样的豁免

2018-12-28 11:04:27,877 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,660 WARN  akka.remote.transport.netty.NettyTransport                    - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hz203/10.0.0.203:42861
2018-12-28 11:04:28,660 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@hz203:42861] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@hz203:42861]] Caused by: [Connection refused: hz203/10.0.0.203:42861]
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f timed out.
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Closing TaskExecutor connection 0f41bca09600cd25000e19801076fa1f because: The heartbeat of TaskManager with id 0f41bca09600cd25000e19801076fa1f  timed out.
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Unregister TaskManager dcf3bb5b7ed2208cf45b658d212fd8d2 from the SlotManager.
2018-12-28 11:04:28,678 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (88aa62ad152f4df6b39a969dd32c0249) switched from RUNNING to FAILED.
org.apache.flink.util.FlinkException: The assigned slot 0f41bca09600cd25000e19801076fa1f_0 was removed.
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlot(SlotManager.java:786)
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.removeSlots(SlotManager.java:756)
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.internalUnregisterTaskManager(SlotManager.java:948)
        at org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager.unregisterTaskManager(SlotManager.java:372)
        at org.apache.flink.runtime.resourcemanager.ResourceManager.closeTaskManagerConnection(ResourceManager.java:803)
        at org.apache.flink.runtime.resourcemanager.ResourceManager$TaskManagerHeartbeatListener$1.run(ResourceManager.java:1116)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:70)
        at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
        at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.onReceive(FencedAkkaRpcActor.java:40)
        at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
        at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
        at akka.actor.ActorCell.invoke(ActorCell.scala:495)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
        at akka.dispatch.Mailbox.run(Mailbox.scala:224)
        at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2018-12-28 11:04:28,680 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Socket Window WordCount (61f55876e79934d515c163d095d706a6) switched from state RUNNING to FAILING.

提交作业

./bin/flink run -d ./examples/streaming/SocketWindowWordCount.jar --port 9000 --hostname 10.1.2.9 ,像这样获取jm日志

2018-12-28 19:20:01,354 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Starting execution of job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291)
2018-12-28 19:20:01,354 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job Socket Window WordCount (5cdb91c15ee12ec6e74256eed10b5291) switched from state CREATED to RUNNING.
2018-12-28 19:20:01,356 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,359 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from CREATED to SCHEDULED.
2018-12-28 19:20:01,364 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{e33a40832a3922897470fb76bcf76b29}]
2018-12-28 19:20:01,367 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Connecting to ResourceManager akka.tcp://flink@hz203:46596/user/resourcemanager(b22f96303e74df23645fe4567f884b9e)
2018-12-28 19:20:01,370 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Resolved ResourceManager address, beginning registration
2018-12-28 19:20:01,370 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - Registration at ResourceManager attempt 1 (timeout=100ms)
2018-12-28 19:20:01,371 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService /leader/5cdb91c15ee12ec6e74256eed10b5291/job_manager_lock.
2018-12-28 19:20:01,371 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6@akka.tcp://flink@hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,431 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered job manager 9a31e8b4e8dfbf7b31d6ed3d227648b6@akka.tcp://flink@hz203:46596/user/jobmanager_0 for job 5cdb91c15ee12ec6e74256eed10b5291.
2018-12-28 19:20:01,432 INFO  org.apache.flink.runtime.jobmaster.JobMaster                  - JobManager successfully registered at ResourceManager, leader id: b22f96303e74df23645fe4567f884b9e.
2018-12-28 19:20:01,433 INFO  org.apache.flink.runtime.jobmaster.slotpool.SlotPool          - Requesting new slot [SlotRequestId{e33a40832a3922897470fb76bcf76b29}] and profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} from resource manager.
2018-12-28 19:20:01,434 INFO  org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - Request slot with profile ResourceProfile{cpuCores=-1.0, heapMemoryInMB=-1, directMemoryInMB=0, nativeMemoryInMB=0, networkMemoryInMB=0} for job 5cdb91c15ee12ec6e74256eed10b5291 with allocation id AllocationID{f7a24e609e2ec618ccb456076049fa3b}.
2018-12-28 19:20:01,510 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,511 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Source: Socket Stream -> Flat Map (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,515 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from SCHEDULED to DEPLOYING.
2018-12-28 19:20:01,515 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Deploying Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (attempt #0) to hz203
2018-12-28 19:20:01,674 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Window(TumblingProcessingTimeWindows(5000), ProcessingTimeTrigger, ReduceFunction$1, PassThroughWindowFunction) -> Sink: Print to Std. Out (1/1) (102d04f5aa6fc50cfe5088e20902c72e) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:01,708 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Source: Socket Stream -> Flat Map (1/1) (e30439b9f548c6013d8b8689e30d0dd7) switched from DEPLOYING to RUNNING.
2018-12-28 19:20:43,267 INFO  org.apache.flink.runtime.blob.BlobClient                      - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-513fbe1e6ddf69d10689eccf4c65da97 from hz203/10.0.0.203:6124
2018-12-28 19:20:48,339 INFO  org.apache.flink.runtime.blob.BlobClient                      - Downloading null/t-dd915bb9821ff6ced34dd5e489966b674de5a48f-7ea2600930e5fc5a4fbb7d47ee198789 from hz203/10.0.0.203:6124
2018-12-28 19:20:52,623 INFO  org.apache.flink.runtime.blob.BlobClient                      - Downloading null/t-61808afb630553305c73a0a23f9231ffd6b2b448-0bd1ab86fa4cc54daeb472079bfbea8c from hz203/10.0.0.203:6124

杀死tm

正文限制为30000个字符。杀死tm时请阅读jm日志

aiazj4mn

aiazj4mn1#

日志显示你的 RestartStrategy 已经耗尽了它的重启尝试或者没有 RestartStrategy 已配置。请检查您是否指定了 RestartStrategy 在您的程序中通过 env.setRestartStrategy(RestartStrategies.fixedDelayRestart(10, 0L)) 或在 flink-conf.yaml 通过 restart-strategy: fixed-delay . 如果您想了解更多关于flink的重启策略,请查看文档。

相关问题