Hadoop – HDFS – Extra

In Hadoop, Tutorial on 25/02/2012 by pier0w

This extra hadoop setup is for those of you who are a little bit more adventurous. It requires some very small edits to the hadoop startup scripts, so if you would rather just run a vanilla hadoop install then don’t bother reading the rest of this extra tutorial.

The goal of this tutorial is to have hadoop running in what some would consider a slightly cleaner way. That is it will run without having to rely on ssh.

To do this first run through the previous hadoop hdfs tutorial, but leave out anything to do with ssh and also don’t edit the conf/ Having the JAVA_HOME variable set in the conf/ file is only required when hadoop executes remote commands through ssh.

The only part our of hadoop cluster that requires ssh by default is the Namenode, so we will be editing the files on this server only.

Now since we have not set the JAVA_HOME variable in the conf/ file it will need to be set as a global environment variable for the server. On Ubuntu server this can be done by exporting the environment variables in a Java specific scrip in the /etc/profile.d/ directory.

#> sudo vim /etc/profile.d/
export JAVA_HOME=/usr/lib/jvm/java-6-openjdk
export JDK_HOME=/usr/lib/jvm/java-6-openjdk

Once that is setup the first hadoop file to edit is bin/, this script uses rsync to make sure that the hadoop files are on the server it is about to start. The reason it does this is so that you can have a central server that can start up multiple Namenodes on remote servers that don’t actually have to have hadoop installed on them.

We are going to edit line 119 in this file and remove the text -e ssh which forces rsync to run through ssh.

#> vim /opt/hadoop/bin/
    if [ "$HADOOP_MASTER" != "" ]; then
      echo rsync from $HADOOP_MASTER
      rsync -a --delete --exclude=.svn --exclude='logs/*' --exclude='contrib/hod/logs/*' $HADOOP_MASTER/ "$HADOOP_HOME"

Next we are going to edit the bin/ and bin/ scripts and make it so that the Secondary Namenode is started up in the same way as the primary Namenode.

#> vim /opt/hadoop/bin/
# start dfs daemons
# start namenode after datanodes, to minimize time namenode is up w/o data
# note: datanodes will log connection errors until namenode starts
"$bin"/ --config $HADOOP_CONF_DIR start namenode $nameStartOpt
"$bin"/ --config $HADOOP_CONF_DIR start datanode $dataStartOpt
"$bin"/ --config $HADOOP_CONF_DIR start secondarynamenode

#> vim /opt/hadoop/bin/
. "$bin"/

"$bin"/ --config $HADOOP_CONF_DIR stop namenode
"$bin"/ --config $HADOOP_CONF_DIR stop datanode
"$bin"/ --config $HADOOP_CONF_DIR stop secondarynamenode

Lastly we will remove any domain names from the conf/slaves file, even localhost. This stops hadoop from trying to start a Datanode up on the Namenode server.

#> vim /opt/hadoop/conf/slaves

And that’s it, you should now be able to start a Namenode on the server without having to have any special ssh setup.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: