So you want to have a go at playing with hadoop, but are a little apprehensive because it just looks so complicated and hard to setup. Well DON’T BE! It’s super easy and I’ll show this by first demonstrating below the simple setup for a fully working HDFS cluster.
To get started lets first explain the HDFS structure. It’s a very simple master slave architecture. There is a master, the Namenode, that manages all the slaves, the Datanodes. See the super duper diagram bellow:
Namenode
________|_______
| |
Datanode One Datanode Two
The Namenode manages the distribution of the data within HDFS and the Databodes just store the data. That is HDFS put very very simply. As you can see it doesn’t provide the best redundancy because of the single point of failure. Though, there is a Secondary Namenode that can be used in a time of emergency, but the way to do that is WAY out of the scope of this tutorial. HDFS may not have the most resilient architecture, but it is at least easy to understand.
So how does one set one’s self up the HDFS? Well very easily actually…
Step One: Create a Namenode
To do this get your self a server, an Ubuntu Server VM will do nicely. Make sure the VM is running within a host only network so that it can talk to your host machine and the Datanode VM’s.
When using unix it is always best practice to create a user for the task at hand so lets create a hadoop user and use that from now on. Super users are dangeresque.
#> sudo useradd -m -G users -s /bin/bash hadoop
#> sudo passwd hadoop
# Set the password to what every you like.
#> su - hadoop
Now make sure that the server is running an ssh server. A good way to do this is to login to local host.
#> ssh localhost
If that doesn’t succeed then install the ssh server.
#> sudo apt-get install openssh-server
Now hadoop uses ssh to run it’s commands even when they are on the local machine, because of this you have to make sure that you can automatically ssh into localhost without the need for a password. To do this create an ssh key and add it to you .ssh/authorized_keys2 file.
#> ssh-keygen
# Just press ENTER for all the input request.
#> cat .ssh/id_rsa.pub >> .ssh/authorized_keys2
That’s it, you should now be able to ssh to localhost without having to type in your password.
The next thing to do is make sure your server has a static ip address. So set the IP of the host only network interface to be a static IP within the host only network.
#> sudo vim /etc/network/interfaces
...
# Host Only Network
auto eth0
iface eth0 inet static
address 192.168.69.201
netmask 255.255.255.0
network 192.168.69.0
broadcast 192.168.69.255
~
~
#> sudo /etc/init.d/networking restart
#> ifconfig
eth0 Link encap:Ethernet HWaddr 00:00:00:00:00:00
inet addr:192.168.69.201 Bcast:192.168.69.255 Mask:255.255.255.0
...
Lastly hadoop is written in Java so this will need to be installed on the server. It seems that it is only possible now to install the OpenJDK on Ubuntu but this is ok because hadoop runs fine on that JVM.
#> sudo apt-get install openjdk-6-jdk
Now that the server is all ready extract the hadoop tarball somewhere onto the filesystem. /opt/ is as good a place as any. Then do all the configuration that is required for a working Namenode. That is, edit two files.
First edit the conf/core-site.xml file and set the fs.default.name property to be the ip address that you gave the server with a port of your choosing. 9000 seems to be a good default. You’d think that you’d be able to just set the fs.default.name property to be localhost, but unfortunately this does not work. I’m guessing it is because the server will only accept requests for the provided URI and nothing but the local machine can carry out a request to localhost.
#> vim /opt/hadoop/conf/core-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://192.168.69.201:9000</value>
</property>
</configuration>
~
~
Then lastly edit the conf/hadoop-env.sh file and uncomment and set the JAVA_HOME variable to the directory that contains the JDK files. On Ubuntu Server this is the /usr/lib/jvm/java-6-openjdk/ directory.
#> vim /opt/hadoop/conf/hadoop-env.sh
...
# The only required environment variable is JAVA_HOME. All others are
# optional. When running a distributed configuration it is best to
# set JAVA_HOME in this file, so that it is correctly defined on
# remote nodes.
# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
# Extra Java CLASSPATH elements. Optional.
# export HADOOP_CLASSPATH=
...
And that’s it, you can now start a perfectly running Namenode.
#> /opt/hadoop/bin/hadoop namenode -format # The Namenode must be formatted before it can start up.
#> /opt/hadoop/bin/start-dfs.sh
To check that it is running, browse to the Namenodes ip address on port 50070 e.g. http://192.168.69.201:50070
This will take you to the Namenodes monitoring page.
Step Two: Create a Datanode
The preliminary setup of a Datanode VM is very simple, you just have create the hadoop user and switch into it. Leave the Host only network interface as DHCP, the Datanodes do not require static ip addresses, also below you will see why it is actually really good if their ip addresses are allocated dynamically.
Now extract the same hadoop tarball that you used on the name node into the /opt/ directory on the Datanode. Now all you have to do is configure the Datanodes conf/core-site.xml file with the exact same configuration as the Namenode. By exact I mean character for character, to the point where if you wanted to you could just copy the Namenodes conf/core-site.xml file over the top of the newly extracted one on the Datanode. This now means that the Datanode is configured to be aware of the Namenode.
So I know what you’re now thinking “Hold on a minute there. How does the Datanode know to act as a slave if it has the exact same configuration as the Namenode?”. Well it does because we start it in a completely different way to the Namenode. Where the name node is started with the bin/start-dfs.sh script we are going to start the Datanode with the bin/hadoop-daemon.sh. As follows:
#> /opt/hadoop/bin/hadoop-daemon.sh --config /opt/hadoop/conf start datanode
That’s it your new Datanode should now be up and running. You can confirm this by refreshing the Namenodes monitoring page and you should now see that the Live Nodes number has increased from 0 to 1.
Note: The Datanode could take a little while to register to the name node, you can check on it’s progress in the logs/hadoop-hadoop-datanode-*.log file.
So as you can see all and all setting up hadoop is a very easy process. You can now add extra Datanodes to you hearts content by simply cloning and your first Datanode and starting it up. Since you have left the Host only network interface as DHCP each Datanode will automatically have it’s own unique ip address.
You may also notice if you have read the hadoop documentation that I don’t mention anything about the conf/slaves file. This is because I think that a hadoop cluster should be configured in an extendible way from the beginning so I have shown you how to start a Datanode so that it registers itself with an existing hadoop cluster. Instead of listing the Datanodes within the Namenodes conf/slaves file and having the Namenode start the Datanodes for you.