Don't underestimate the number of blocks on your cluster.
Well we had a feeling that the cluster became heavier and slower since we passed the 700 K blocks per node threshold, but things kept working and we have been waited for the right time to fix it. Until the day the Namenode didn't start up because of JVM pause.
It all started with a human mistake that ended with hdfs restarting. The active namenode and the standby, in an unexpected behavior, were both in standby state and went down after several minutes. Our automatic failover is disabled and the manual failover didn't work out. We tried many combinations of restarting and stopping of the 2 namenodes, but the error log showed that the namenode (the one we wanted to activate) can't tell at what state it is (standby or active). Weird.
We enabled the automatic failover controller. Then we saw at its logs that it can't figure out the namenode service id ip (the one who should be active). Then we saw that the namenode service id that the failover controller mentioned is different than what we see on the hdfs-site.xml (at dfs.namenode.rpc-address confs) or in the zookeeper's server confs. Weird.
We didn't really know where this id comes from so we tried restarting the zookeeper that is responsible for the hadoop-ha, playing with the zookkeper /hadoop-ha directory,and deploying configurations, but nothing helped.
We decided to disable high availability. We installed a secondary namenode and started the hdfs. The namenode got started! But then it crushed! We saw this message: Detected pause in JVM or host machine (eg GC). This is bad.. what should we do?
We decided to go back to high availability mode. It all went good and at this time, there was an elected active namenode and a standby namenode. I guess the configurations got fixed somehow. The problem was that the jvm pause was still happening.
Nimrod ( https://www.linkedin.com/profile... ), suggested increasing the namenode jvm heap size. We found the configuration on the cloudera manager and saw a warning about a ratio between the number of blocks and the namenode heap size: 1 G per 1 M blocks. We had 6 M blocks, and 4 GB! After increasing the heap to 12 GB, the namenode got started and stayed up. Victory :)
There are many unsolved questions about what happend. Why couldn’t the namenode determined its state, and why the zookeeper, at the stage we moved to automatic failover, was not able to elect an active namenode. Why did the failover controller tried communicate with wrong namenode service id. We will look into the hdfs core-site.xml that we didn't check during the problems and will read more about the failover process (who elects the active in a manul state, where did the zookeeper took the namenodes ids).
But there is 1 thing we are certain about and it's the potential disaster of the too many blocks alerts.