Setting up Hadoop HDFS in Pseudodistributed Mode

Well, new to the big data world.

Following Appendix A in the book Hadoop: The Definitive Guide, 4th Ed , just get it to work. I’m running Ubuntu 14.04.

1. Download and unpack the hadoop package, and set environment variables in your ~/.bashrc .

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=~/hadoop-2.5.2

Verify with:

# hadoop version
Hadoop 2.5.2
Subversion -r cc72e9b000545b86b75a61f4835eb86d57bfafc0
Compiled by jenkins on 2014-11-14T23:45Z
Compiled with protoc 2.5.0
From source with checksum df7537a4faa4658983d397abf4514320
This command was run using /home/gonwan/hadoop-2.5.2/share/hadoop/common/hadoop-common-2.5.2.jar

The 2.5.2 distribution package is build in 64bit, use the 2.4.1 package if you are running a 32bit OS.

2. Edit config files in $HADOOP_HOME/etc/hadoop :

3. Config SSH:Hadoop needs to start daemons on hosts of a cluster via SSH connection. A public key is generated to avoid password input.

# ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# cat ~/.ssh/ >> ~/.ssh/authorized_keys

Verify with:

# ssh localhost

4. Format HDFS filesystem:

# hdfs namenode -format

5. Start HDFS:

Starting namenodes on [localhost]
localhost: starting namenode, logging to /home/gonwan/hadoop-2.5.2/logs/hadoop-gonwan-namenode-gonwan-mate17.out
localhost: starting datanode, logging to /home/gonwan/hadoop-2.5.2/logs/hadoop-gonwan-datanode-gonwan-mate17.out
Starting secondary namenodes [] starting secondarynamenode, logging to /home/gonwan/hadoop-2.5.2/logs/hadoop-gonwan-secondarynamenode-gonwan-mate17.out

Verify running with jps command:

# jps
2535 NameNode
2643 DataNode
2931 Jps
2828 SecondaryNameNode

6. Some tests:

# hadoop fs -ls /
# hadoop fs -mkdir /test
# hadoop fs -put ~/.bashrc /
# hadoop fs -ls /
Found 2 items
-rw-r--r--   1 gonwan supergroup        215 2016-04-19 16:07 /.bashrc
drwxr-xr-x   - gonwan supergroup          0 2016-04-19 16:06 /test

7. Stop HDFS:


8. You can also set HADOOP_CONF_DIR to use separate config directory for convenience:

# HADOOP_CONF_DIR=~/github/hadoop-book/conf/hadoop/pseudo-distributed

NOTE: Make sure to also copy $HADOOP_HOME/etc/hadoop/slaves , or you will get an error when starting the data node, like:

cat: /home/gonwan/github/hadoop-book/conf/hadoop/pseudo-distributed/slaves: No such file or directory