Hadoop at the onlNN pool

Moving to the online Linux pool and trying a 3-node cluster

onl14 = NameNode + worker node

onl13 = JobTracker + worker node

onl12 = worker node

My home directory is shared across these nodes, so if I install hadoop in my home directory, then the configuration files will be shared across all of them.

on each node, "mkdir /scratch/hadoop; mkdir /scratch/hadoop_local_tmp"
and on onl14, "mkdir /scratch/hadoop_logs"

 

Scratch that.  I got that 3-node cluster working, but it was not multi-user friendly, nor adminstration-friendly.

Trying to address both user and administration friendliness (though still rather limited in both) and expand further, have set up a Hadoop cluster with onl01-onl12 serving as worker nodes (Data Nodes and Task Trackers), with onl13 as the JobTracker and onl14 as the Name Node. 

Here's a rundown of the configuration details:

There is a NIS user named "hadoop", which is a member of the hadoop and rhstar users (sidenote - it is also a member of the deniedgateway group).  Its home directory is a standard NFS shared home directory (/ldaphome/hadoop).  Within the home directory, there is a hadoop installation under hadoop/hadoop-0.20.203.0, which is world-accessible.

So a user might find something like this useful (tcsh): 

setenv HADOOP_VERSION 0.20.203.0
setenv HADOOP_HOME /ldaphome/hadoop/hadoop/hadoop-$HADOOP_VERSION
setenv PATH $PATH":$HADOOP_HOME/bin"

Security choices that could be investigated/tightened up:

1) iptables firewalls are wide open amongst the onl01-14 nodes.  Could be tightened up with some research and experimentation to limit ports/protocols to those actually needed.

   basedon the info here: http://icb.med.cornell.edu/wiki/index.php/Hadoop#Firewall, I added firewall rules to each of onl01-14 according to their roles.

2) any onlNN user can run tasks and probably even monitor/manage other users' jobs

3) ACLs in HDFS are not enforced.  I think any user can pretty much read any files in hdfs at this point.

4) hadoop executables can be executed by anyone (such as start and stop scripts)

5) the hadoop user has a passphrase-less SSH key.

 

TODO:

Set it to start automatically - currently Hadoop (HDFS and MapReduce) have to be started by hand.