Zookeeper 为什么当datanode作业被杀死时,datanode不会在hadoop网站中消失?

raogr8fs  于 2022-12-09  发布在  Apache
关注(0)|答案(1)|浏览(123)

I have a 3 node HA cluster in a CentOS 8 VM. I am using ZK 3.7.0 and Hadoop 3.3.1. In my cluster I have 2 namenodes, node1 is the active namenode and node2 is the standby namenode in case that node1 falls. The other node is the datanode I just start all with the command

start-dfs.sh

In node1 I had the following processes running: NameNode, Jps, QuorumPeerMain and JournalNode In node2 I had the following processes running: NameNode, Jps, QuorumPeerMain, JournalNode and DataNode.
My hdfs-site.xml configuration is the following:

<property>
        <name>dfs.replication</name>
        <value>2</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/datos/namenode</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/datos/datanode</value>
    </property>
    <property>
        <name>dfs.nameservices</name>
        <value>ha-cluster</value>   
    </property>
    <property>
        <name>dfs.ha.namenodes.ha-cluster</name>
        <value>nodo1,nodo2</value>
    </property>
    <property>
        <name>dfs.permissions</name>
        <value>false</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ha-cluster.nodo1</name>
        <value>nodo1:8020</value>
    </property>
    <property>
        <name>dfs.namenode.rpc-address.ha-cluster.nodo2</name>
        <value>nodo2:8020</value>
    </property>
    <property>
        <name>dfs.namenode.http-address.ha-cluster.nodo1</name>
        <value>nodo1:9870</value>
    </property> 
    <property>
        <name>dfs.namenode.http-address.ha-cluster.nodo2</name>
        <value>nodo2:9870</value>
    </property>
    <property>
        <name>dfs.namenode.shared.edits.dir</name>
        <value>qjournal://nodo3:8485;nodo2:8485;nodo1:8485/ha-cluster</value>
    </property>

The problem is that since the node2 is the standby namenode I didn't want it to have the DataNode process running, so I killed it. I used the command kill -9 (I know it's not the best way, I should have used hdfs --daemon stop datanode). Then I entered the hadoop website to check how many datanodes I had. In the node1 (the active namenode) Hadoop website, in the datanode part I only had 1 datanode, node3. The problem is that in the Hadoop website of the node2 (the standby namenode) was like this:

In case u can't see the image:

default-rack/nodo2:9866 (192.168.0.102:9866)    http://nodo2:9864   558s        

/default-rack/nodo3:9866 (192.168.0.103:9866)   http://nodo3:9864   1s

The node2 datanode hasn't been alive for 558s and it doesn't take the node as dead. Does anybody know why does this happen??

dphi5xsq

dphi5xsq1#

在您的hdfs-site.xml中,检查以下各项的值:

  • dfs.heartbeat.interval(确定数据节点心跳间隔,以秒为单位。)
  • dfs.namenode.heartbeat.recheck-interval(此时间决定检查过期数据节点的时间间隔。通过此值和dfs.heartbeat.interval,还可以计算出判断数据节点是否过期的时间间隔。此配置的单位为毫秒。)

检查此处以获取默认值和更多信息:https://hadoop.apache.org/docs/r2.7.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml
有一个公式可以确定节点何时失效:

2 * dfs.namenode.heartbeat.recheck-interval + 10 * (1000 * dfs.heartbeat.interval)

表示:

2 * 300000 + 10 * 3000 = 630000 milliseconds = 10 minutes 30 seconds or **630 seconds**.

源:Hadoop 2.x管理指南(Packt)-配置数据阳极心跳:

Datanode Removal time = (2 x dfs.namenode.heartbeat.recheck-interval ) + (10 X dfs.heartbeat.interval)

相关问题