HDFS - How to recover corrupted HDFS metadata in Hadoop 1.2.X?

Datetime:2016-08-23 01:47:54          Topic: HDFS  Hadoop           Share

You might have Hadoop in your production. And sometimes Tera-bytes of data is residing in Hadoop. HDFS metadata can get corrupted. Namenode won't start in such cases. When you check Namenode logs you might see exceptions.

ERROR org.apache.hadoop.dfs.NameNode: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at org.apache.hadoop.io.UTF8.readFields(UTF8.java:106)
    at org.apache.hadoop.io.ArrayWritable.readFields(ArrayWritable.java:90)
    at org.apache.hadoop.dfs.FSEditLog.loadFSEdits(FSEditLog.java:433)
    at org.apache.hadoop.dfs.FSImage.loadFSEdits(FSImage.java:759)
    at org.apache.hadoop.dfs.FSImage.loadFSImage(FSImage.java:639)
    at org.apache.hadoop.dfs.FSImage.recoverTransitionRead(FSImage.java:222)
    at org.apache.hadoop.dfs.FSDirectory.loadFSImage(FSDirectory.java:79)
    at org.apache.hadoop.dfs.FSNamesystem.initialize(FSNamesystem.java:254)
    at org.apache.hadoop.dfs.FSNamesystem.<init>(FSNamesystem.java:235)
    at org.apache.hadoop.dfs.NameNode.initialize(NameNode.java:131)
    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:176)
    at org.apache.hadoop.dfs.NameNode.<init>(NameNode.java:162)
    at org.apache.hadoop.dfs.NameNode.createNameNode(NameNode.java:846)
    at org.apache.hadoop.dfs.NameNode.main(NameNode.java:855)

If you have a development environment, you can always format the HDFS and continue. This blog posts even suggest that - https://autofei.wordpress.com/2011/03/27/hadoop-namenode-failed-and-reset/

BUT IF YOU FORMAT HDFS YOU LOOSE ALL THE FILES IN HDFS!!! 

So Hadoop Administrators can't format HDFS simply. But you can recover your HDFS to last checkpoint. You might loose some data files. But more than 90% of the data might be safe. Let's see how to recover corrupted HDFS metadata.

Hadoop is creating checkpoints periodically in Namenode folder. You might see three folders in Namenode directory. They are

  1. current
  2. image
  3. previous.checkpoint

current folder must be corrupted most probably.

  • Stop all the Hadoop services from all the nodes.
  • Backup both "current" and "previous.checkpoint" directories. 
  • Delete "current" directory. 
  • Rename "previous.checkpoint" to "current"
  • Restart Hadoop services.

Steps I followed I have mentioned above. Below commands were ran to recover the HDFS. Commands might slightly change depending on your installation.

/usr/local/hadoop/stop-all.sh

cd <namenode.dir>

cp -r current current.old

cp -r previous.checkpoint previous.checkpoint.old

mv previous.checkpoint current

/usr/local/hadoop/start-all.sh

That's all!!!! Everything was okay after that!





About List