Archive for the ‘Hadoop’ Category


Hadoop – HDFS – Extra

In Hadoop,Tutorial on 25/02/2012 by pier0w

This extra hadoop setup is for those of you who are a little bit more adventurous. It requires some very small edits to the hadoop startup scripts, so if you would rather just run a vanilla hadoop install then don’t bother reading the rest of this extra tutorial.

The goal of this tutorial is to have hadoop running in what some would consider a slightly cleaner way. That is it will run without having to rely on ssh.

To do this first run through the previous hadoop hdfs tutorial, but leave out anything to do with ssh and also don’t edit the conf/ Having the JAVA_HOME variable set in the conf/ file is only required when hadoop executes remote commands through ssh.

The only part our of hadoop cluster that requires ssh by default is the Namenode, so we will be editing the files on this server only.

Now since we have not set the JAVA_HOME variable in the conf/ file it will need to be set as a global environment variable for the server. On Ubuntu server this can be done by exporting the environment variables in a Java specific scrip in the /etc/profile.d/ directory.

#> sudo vim /etc/profile.d/
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export JDK_HOME=/usr/lib/jvm/java-6-openjdk

Once that is setup the first hadoop file to edit is bin/, this script uses rsync to make sure that the hadoop files are on the server it is about to start. The reason it does this is so that you can have a central server that can start up multiple Namenodes on remote servers that don’t actually have to have hadoop installed on them.

We are going to edit line 119 in this file and remove the text -e ssh which forces rsync to run through ssh.

#> vim /opt/hadoop/bin/
    if [ "$HADOOP_MASTER" != "" ]; then
      echo rsync from $HADOOP_MASTER
      rsync -a --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $HADOOP_MASTER/ "$HADOOP_HOME"

Next we are going to edit the bin/ and bin/ scripts and make it so that the Secondary Namenode is started up in the same way as the primary Namenode.

#> vim /opt/hadoop/bin/
# start dfs daemons
# start namenode after datanodes, to minimize time namenode is up w/o data
# note: datanodes will log connection errors until namenode starts
"$bin"/ --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/ --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/ --config $HADOOP_CONF_DIR start secondarynamenode

#> vim /opt/hadoop/bin/
. "$bin"/

"$bin"/ --config $HADOOP_CONF_DIR stop namenode
"$bin"/ --config $HADOOP_CONF_DIR stop datanode
"$bin"/ --config $HADOOP_CONF_DIR stop secondarynamenode

Lastly we will remove any domain names from the conf/slaves file, even localhost. This stops hadoop from trying to start a Datanode up on the Namenode server.

#> vim /opt/hadoop/conf/slaves

And that’s it, you should now be able to start a Namenode on the server without having to have any special ssh setup.



Hadoop – HDFS

In Hadoop,Tutorial on 23/02/2012 by pier0w

So you want to have a go at playing with hadoop, but are a little apprehensive because it just looks so complicated and hard to setup. Well DON’T BE! It’s super easy and I’ll show this by first demonstrating below the simple setup for a fully working HDFS cluster.

To get started lets first explain the HDFS structure. It’s a very simple master slave architecture. There is a master, the Namenode, that manages all the slaves, the Datanodes. See the super duper diagram bellow:

      |                |
Datanode One     Datanode Two

The Namenode manages the distribution of the data within HDFS and the Databodes just store the data. That is HDFS put very very simply. As you can see it doesn’t provide the best redundancy because of the single point of failure. Though, there is a Secondary Namenode that can be used in a time of emergency, but the way to do that is WAY out of the scope of this tutorial. HDFS may not have the most resilient architecture, but it is at least easy to understand.

So how does one set one’s self up the HDFS? Well very easily actually…

Step One: Create a Namenode
To do this get your self a server, an Ubuntu Server VM will do nicely. Make sure the VM is running within a host only network so that it can talk to your host machine and the Datanode VM’s.

When using unix it is always best practice to create a user for the task at hand so lets create a hadoop user and use that from now on. Super users are dangeresque.

#> sudo useradd -m -G users -s /bin/bash hadoop
#> sudo passwd hadoop
# Set the password to what every you like.
#> su - hadoop

Now make sure that the server is running an ssh server. A good way to do this is to login to local host.

#> ssh localhost

If that doesn’t succeed then install the ssh server.

#> sudo apt-get install openssh-server

Now hadoop uses ssh to run it’s commands even when they are on the local machine, because of this you have to make sure that you can automatically ssh into localhost without the need for a password. To do this create an ssh key and add it to you .ssh/authorized_keys2 file.

#> ssh-keygen
# Just press ENTER for all the input request.
#> cat .ssh/ >> .ssh/authorized_keys2

That’s it, you should now be able to ssh to localhost without having to type in your password.

The next thing to do is make sure your server has a static ip address. So set the IP of the host only network interface to be a static IP within the host only network.

#> sudo vim /etc/network/interfaces
# Host Only Network
auto eth0
iface eth0 inet static

#> sudo /etc/init.d/networking restart
#> ifconfig
eth0      Link encap:Ethernet HWaddr 00:00:00:00:00:00
          inet addr: Bcast: Mask:

Lastly hadoop is written in Java so this will need to be installed on the server. It seems that it is only possible now to install the OpenJDK on Ubuntu but this is ok because hadoop runs fine on that JVM.

#> sudo apt-get install openjdk-6-jdk

Now that the server is all ready extract the hadoop tarball somewhere onto the filesystem. /opt/ is as good a place as any. Then do all the configuration that is required for a working Namenode. That is, edit two files.

First edit the conf/core-site.xml file and set the property to be the ip address that you gave the server with a port of your choosing. 9000 seems to be a good default. You’d think that you’d be able to just set the property to be localhost, but unfortunately this does not work. I’m guessing it is because the server will only accept requests for the provided URI and nothing but the local machine can carry out a request to localhost.

#> vim /opt/hadoop/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->


Then lastly edit the conf/ file and uncomment and set the JAVA_HOME variable to the directory that contains the JDK files. On Ubuntu Server this is the /usr/lib/jvm/java-6-openjdk/ directory.

#> vim /opt/hadoop/conf/
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk

# Extra Java CLASSPATH elements. Optional.

And that’s it, you can now start a perfectly running Namenode.

#> /opt/hadoop/bin/hadoop namenode -format # The Namenode must be formatted before it can start up.
#> /opt/hadoop/bin/

To check that it is running, browse to the Namenodes ip address on port 50070 e.g.
This will take you to the Namenodes monitoring page.

Step Two: Create a Datanode

The preliminary setup of a Datanode VM is very simple, you just have create the hadoop user and switch into it. Leave the Host only network interface as DHCP, the Datanodes do not require static ip addresses, also below you will see why it is actually really good if their ip addresses are allocated dynamically.

Now extract the same hadoop tarball that you used on the name node into the /opt/ directory on the Datanode. Now all you have to do is configure the Datanodes conf/core-site.xml file with the exact same configuration as the Namenode. By exact I mean character for character, to the point where if you wanted to you could just copy the Namenodes conf/core-site.xml file over the top of the newly extracted one on the Datanode. This now means that the Datanode is configured to be aware of the Namenode.

So I know what you’re now thinking “Hold on a minute there. How does the Datanode know to act as a slave if it has the exact same configuration as the Namenode?”. Well it does because we start it in a completely different way to the Namenode. Where the name node is started with the bin/ script we are going to start the Datanode with the bin/ As follows:

#> /opt/hadoop/bin/ --config /opt/hadoop/conf start datanode

That’s it your new Datanode should now be up and running. You can confirm this by refreshing the Namenodes monitoring page and you should now see that the Live Nodes number has increased from 0 to 1.

Note: The Datanode could take a little while to register to the name node, you can check on it’s progress in the logs/hadoop-hadoop-datanode-*.log file.

So as you can see all and all setting up hadoop is a very easy process. You can now add extra Datanodes to you hearts content by simply cloning and your first Datanode and starting it up. Since you have left the Host only network interface as DHCP each Datanode will automatically have it’s own unique ip address.

You may also notice if you have read the hadoop documentation that I don’t mention anything about the conf/slaves file. This is because I think that a hadoop cluster should be configured in an extendible way from the beginning so I have shown you how to start a Datanode so that it registers itself with an existing hadoop cluster. Instead of listing the Datanodes within the Namenodes conf/slaves file and having the Namenode start the Datanodes for you.