This morning I finally got Hadoop installed on my work laptop, and I wanted to write it all down so that I could repeat this when necessary. As I found out, it's not at all like installing CouchDB which is about as simple as anything could be. No… Hadoop is a far more difficult beast, and I guess I can understand why, but still, it'd be nice to have a simple Homebrew install that set it up in single-node mode and started everything with Launch Control, but that's a wish, not a necessity.
So let's get into it. First, make sure that you have the SSH daemon running on your box. This is controlled in System Preferences -> Sharing -> Remote Login - make sure it's checked, save this, and it should be running just fine. Make sure you can ssh into your box - if necessary, make the SSH keys and put them in your ~/.ssh directory.
Next, you certainly need to install Homebrew, and once that's all going, you need to install the basic Hadoop package:
$ brew install hadoop
at this point, you will need to edit a few of the config files, and make a few directories. Let's start by making the directories. These will be the locations for the actual Hadoop data, the Map/Reduce data, and the NameNode data. I picked to place these next to the Homebrew install of Hadoop so that it's all in one place:
$ cd /usr/local/Cellar/hadoop
$ mkdir data
$ cd data
$ mkdir dfs
$ mkdir mapred
$ mkdir nn
At this point we can go to the directory with the configuration files and update them:
$ cd /usr/local/Cellar/hadoop/1.1.2/libexec/conf
The first update is to handle a Kerberos bug in Hadoop - a known bug. Do this by editing hadoop-env.sh to include:
export HADOOP_OPTS="-Djava.security.krb5.realm= -Djava.security.krb.kdc="
Next, edit the hdfs-site.xml file to include the following:
<configuration>
<property>
<name>dfs.data.dir</name>
<value>/usr/local/Cellar/hadoop/data/dfs</value>
</property>
<property>
<name>dfs.name.dir</name>
<value>/usr/local/Cellar/hadoop/data/nn</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
Next, edit the core-site.xml file to include the following:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/tmp/hdfs-${user.name}</value>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
</configuration>
Finally, edit the mapred-site.xml file to include the following:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.</description>
</property>
<property>
<name>mapred.local.dir</name>
<value>/usr/local/Cellar/hadoop/data/mapred/</value>
</property>
</configuration>
We are finally all configured. At this point, you need to initialize the Name node:
$ hadoop namenode -format
and then you can start all the necessary processes on the box:
$ start-all.sh
At this point, you will be able to hit the endpoints:
and using the WebHDFS REST endpoint, you can use any standard REST client to submit files, delete files, make directories, and generally manipulate the filesystem as needed.
This was interesting, and digging around for what was needed was non-trivial, but it was well worth it. I'll now be able to run my code against the PostgreSQL and Hadoop installs on my box.
Sweet!