Hadoop – HDFS

In Hadoop, Tutorial on 23/02/2012 by pier0w

So you want to have a go at playing with hadoop, but are a little apprehensive because it just looks so complicated and hard to setup. Well DON’T BE! It’s super easy and I’ll show this by first demonstrating below the simple setup for a fully working HDFS cluster.

To get started lets first explain the HDFS structure. It’s a very simple master slave architecture. There is a master, the Namenode, that manages all the slaves, the Datanodes. See the super duper diagram bellow:

      |                |
Datanode One     Datanode Two

The Namenode manages the distribution of the data within HDFS and the Databodes just store the data. That is HDFS put very very simply. As you can see it doesn’t provide the best redundancy because of the single point of failure. Though, there is a Secondary Namenode that can be used in a time of emergency, but the way to do that is WAY out of the scope of this tutorial. HDFS may not have the most resilient architecture, but it is at least easy to understand.

So how does one set one’s self up the HDFS? Well very easily actually…

Step One: Create a Namenode
To do this get your self a server, an Ubuntu Server VM will do nicely. Make sure the VM is running within a host only network so that it can talk to your host machine and the Datanode VM’s.

When using unix it is always best practice to create a user for the task at hand so lets create a hadoop user and use that from now on. Super users are dangeresque.

#> sudo useradd -m -G users -s /bin/bash hadoop
#> sudo passwd hadoop
# Set the password to what every you like.
#> su - hadoop

Now make sure that the server is running an ssh server. A good way to do this is to login to local host.

#> ssh localhost

If that doesn’t succeed then install the ssh server.

#> sudo apt-get install openssh-server

Now hadoop uses ssh to run it’s commands even when they are on the local machine, because of this you have to make sure that you can automatically ssh into localhost without the need for a password. To do this create an ssh key and add it to you .ssh/authorized_keys2 file.

#> ssh-keygen
# Just press ENTER for all the input request.
#> cat .ssh/ >> .ssh/authorized_keys2

That’s it, you should now be able to ssh to localhost without having to type in your password.

The next thing to do is make sure your server has a static ip address. So set the IP of the host only network interface to be a static IP within the host only network.

#> sudo vim /etc/network/interfaces
# Host Only Network
auto eth0
iface eth0 inet static

#> sudo /etc/init.d/networking restart
#> ifconfig
eth0      Link encap:Ethernet HWaddr 00:00:00:00:00:00
          inet addr: Bcast: Mask:

Lastly hadoop is written in Java so this will need to be installed on the server. It seems that it is only possible now to install the OpenJDK on Ubuntu but this is ok because hadoop runs fine on that JVM.

#> sudo apt-get install openjdk-6-jdk

Now that the server is all ready extract the hadoop tarball somewhere onto the filesystem. /opt/ is as good a place as any. Then do all the configuration that is required for a working Namenode. That is, edit two files.

First edit the conf/core-site.xml file and set the property to be the ip address that you gave the server with a port of your choosing. 9000 seems to be a good default. You’d think that you’d be able to just set the property to be localhost, but unfortunately this does not work. I’m guessing it is because the server will only accept requests for the provided URI and nothing but the local machine can carry out a request to localhost.

#> vim /opt/hadoop/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


Then lastly edit the conf/ file and uncomment and set the JAVA_HOME variable to the directory that contains the JDK files. On Ubuntu Server this is the /usr/lib/jvm/java-6-openjdk/ directory.

#> vim /opt/hadoop/conf/
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

# Extra Java CLASSPATH elements. Optional.

And that’s it, you can now start a perfectly running Namenode.

#> /opt/hadoop/bin/hadoop namenode -format # The Namenode must be formatted before it can start up.
#> /opt/hadoop/bin/

To check that it is running, browse to the Namenodes ip address on port 50070 e.g.
This will take you to the Namenodes monitoring page.

Step Two: Create a Datanode

The preliminary setup of a Datanode VM is very simple, you just have create the hadoop user and switch into it. Leave the Host only network interface as DHCP, the Datanodes do not require static ip addresses, also below you will see why it is actually really good if their ip addresses are allocated dynamically.

Now extract the same hadoop tarball that you used on the name node into the /opt/ directory on the Datanode. Now all you have to do is configure the Datanodes conf/core-site.xml file with the exact same configuration as the Namenode. By exact I mean character for character, to the point where if you wanted to you could just copy the Namenodes conf/core-site.xml file over the top of the newly extracted one on the Datanode. This now means that the Datanode is configured to be aware of the Namenode.

So I know what you’re now thinking “Hold on a minute there. How does the Datanode know to act as a slave if it has the exact same configuration as the Namenode?”. Well it does because we start it in a completely different way to the Namenode. Where the name node is started with the bin/ script we are going to start the Datanode with the bin/ As follows:

#> /opt/hadoop/bin/ --config /opt/hadoop/conf start datanode

That’s it your new Datanode should now be up and running. You can confirm this by refreshing the Namenodes monitoring page and you should now see that the Live Nodes number has increased from 0 to 1.

Note: The Datanode could take a little while to register to the name node, you can check on it’s progress in the logs/hadoop-hadoop-datanode-*.log file.

So as you can see all and all setting up hadoop is a very easy process. You can now add extra Datanodes to you hearts content by simply cloning and your first Datanode and starting it up. Since you have left the Host only network interface as DHCP each Datanode will automatically have it’s own unique ip address.

You may also notice if you have read the hadoop documentation that I don’t mention anything about the conf/slaves file. This is because I think that a hadoop cluster should be configured in an extendible way from the beginning so I have shown you how to start a Datanode so that it registers itself with an existing hadoop cluster. Instead of listing the Datanodes within the Namenodes conf/slaves file and having the Namenode start the Datanodes for you.


2 Responses to “Hadoop – HDFS”

  1. […] do this first run through the previous hadoop hdfs tutorial, but leave out anything to do with ssh and also don’t edit the conf/ Having the […]

  2. […] MySQL cluster is a surprisingly easy, almost as easy as installing HDFS. MySQL Cluster consists of three main parts; a MySQL Server, a Management Node, and a Data Node. […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: