Hadoop has been hyped quite a bit in the trade press lately, where it is pitched as a way to analyze data residing across many different servers.
The curious can also set up a Hadoop instance on a single Linux server.
On a single machine, Hadoop offers no distinct advantage in data processing over using any one of a number of other Unix tools, such as awk.
Nonetheless, installing a single-node Hadoop instance will give the administrator a feeling for how Hadoop works. It will also come in handy as more MapReduce-based programs become available. And if you are developing MapReduce jobs, testing it on a single node is the easiest way to go.
Looking to set up my own Hadoop instance on my Ubuntu Linux server, I used Michael Noll's guide. Tom White's O'Reilly book, "Hadoop: The Definitive Guide" also came in handy.
All in all, installing the Hadoop package is fairly easy, in terms of setting up any program from the Linux command line. You download the tarball from an Apache mirror, unzip in the desired parent directory, tweak a few configuration files and you are set. I went with the latest stable version, which was 0.20.2 at the time of this writing.
Myself, I placed the contents of the tarball in /usr/local/hadoop directory.
Prerequisites
You do need to do some preparatory work. A Java Virtual Machine (JVM) needs to be installed on your machine, if it isn't already.
Secondly, Noll suggests setting up a separate Hadoop user account for running Hadoop jobs. Keep in mind, however, setting up a user account named Hadoop provides another possible point of entry for malicious attackers, who are always trying to log-in server SSH ports with the user names of programs, such as MySQL, Nagios, Oracle and others.
I haven't seen any attempted log-ins using Hadoop account name yet on my own server, but it's just a matter of time no doubt. (Another security red flag with this set-up is that you create a key pair for SSH with an empty password. SSH is needed so that Hadoop can log into other nodes, though the empty passwords also makes break-ins less traceable).
Another peculiarity of the Noll's installation is that he asks you to disable the server's IPv6 compatibility, due to the way Hadoop interacts with IPv6. He argues that if your server is not using IPv6, then this shouldn't be a problem.
And this is probably true for most systems now, though disabling IPv6 might be the exact sort of thing you forget about a few years down the road when you are trying to hook your server to an Internet Service Provider via IPv6.
As an alternative, Noll suggests disabling IPv6 on Hadoop itself, which seems like a more reasonable idea.
Lastly, you need to create a dedicated directory where Hadoop can store its work files, and give the Hadoop full permissions to use that directory. I chose, for instance, /usr/local/hadoop/data-store, which resides in the directory of my Hadoop instance.
Configuration
Once Hadoop is downloaded and unzipped, the first thing you need to do is makes some changes in the configuration files.
In the hadoop-env.sh file, you must specify where the Java JVM resides at.
In the core-site.xml file, you must specify where the working directory is, and the user name that will be running Hadoop. In my case, it was "/usr/local/hadoop/data-store/hadoop-${user.name}, with {user.name} to be filled in with "Hadoop"
Noll also offers some configurations additions to add to the mapred-site.xml (for MapReduce configurations) and hdfs-site.xml (for the file system configurations). All these files are found in the Hadoop "conf" subdirectory.
Finally, you need to format the working directory in the HDFS (Hadoop File System). This file system is laid over your current working file system for this directory. The command for that operation is:
$ /usr/local/hadoop/bin/hadoop namenode -formatThis command will format the directory specified in the core-site.xml file (/usr/local/hadoop/data-store in this case).
And that is petty much it, installation-wise.
Starting and Stopping
Starting Hadoop can be done through the command line:
$ /usr/local/Hadoop/bin/start-all.shThis will start a number of services, namely Namenode, Datanode, Jobtracker and a Tasktracker. If all is working properly, the command line will respond with a set of messages indicating each program has been started.
Stopping Hadoop can be done thusly:
$ /usr/local/Hadoop/bin/stop-all.sh