From My Archives: Hadoop Install


Well, this is was an install document that I am picking from my archives. The hadoop, is at .20v but the one described here is .18v, but i should presume the installation would not have changed much in that course.

First I will start with explaining the a single machine install and then extend it with another slave node. Forgive me, if this too raw to digest.

Good luck and enjoy.

Setting it up on a single machine

Prerequisites
  • Java 1.5, I did it with 1.6 install(JAVA_HOME=/opt/SDK/jdk)
  • ssh and rsync must be installed.
Assumption

You are doing this on Linux box :) Its possible to do it windows as well. But need issue appropriate commands.

Going Hadoop

Create user “hadoop” on your master machine “hadoopm”

Add the user to “hadoop” group.

Setup passwordless session for hadoop user.

hadoop@hadoopm$  ssh-keygen -t rsa
hadoop@hadoopm$  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Get hadoop stable version from http://ftp.wayne.edu/apache/hadoop/core/stable/

hadoop@hadoopm$  wget http:ftp.wayne.edu/apache/hadoop/core/stable/hadoop-0.18.3.tar.gz

Extract it in your home directory and create an easy link to it.

hadoop@hadoopm$  tar -xzvf hadoop-0.18.3.tar.gzz
hadoop@hadoopm$  ln -s hadoop-0.18.3 handoopc
hadoop@hadoopm$  cd handoopc

Edit the conf/handoop-env.sh file and update the JAVA_HOME appropriately

Modify the conf/handoop-site.xml file to look like the following.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
 <name>hadoop.tmp.dir</name>
 <value>/home/hadoop/hadoop-${user.name}</value>
 <description>A base for other temporary directories.</description>
 </property>
 <property>
 <name>fs.default.name</name>
 <value>hdfs://hadoopm:9000</value>
 </property>
 <property>
 <name>mapred.job.tracker</name>
 <value>hadoopm:9001</value>
 </property>
 <property>
 <name>dfs.replication</name>
 <value>1</value>
 </property>
</configuration>

Oops now you need to edit the /etc/hosts file to have the following entry.

xxx.yyy.zzz.aaa        hadoopm

Run the following commands…

hadoop@hadoopm$  bin/handoop namenode -format
hadoop@hadoopm$  bin/start-all.sh
hadoop@hadoopm$  jps
28982 JobTracker
28737 DataNode
28615 NameNode
30570 Jps
29109 TaskTracker
28870 SecondaryNameNode

Congrats your hadoop installation on single node is complete!!!

To stop the process issue the following

hadoop@hadoopm$  bin/stop-all.sh

Do a similar, setup and test, on another machine. I have called that machine “hadoops”[s for slave] And slave its going to be!

Setting it up on a multinode hadoop cluster

Create /etc/hosts enrty for representing each other. And they eventually look something like this

xxx.yyy.zzz.aaa        hadoopm
nnn.ooo.ppp.sss        hadoops

Setup password less ssh between the hadoopm and hadoops.

hadoop@hadoops$  ssh-keygen -t rsa
hadoop@hadoops$  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
hadoop@hadoopm$  scp ~/.ssh/id_rsa.pub hadoop@hadoopm:~/.ssh/
hadoop@hadoops$  cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Make sure the setups are clean. That is, the daemons are stopped, and the ~/hadoop-hadoop is empty.

hadoop@hadoops$ rm -rf ~/hadoop-hadoop/*

Note: All the commands are run from the ~/hadoopc directory, which is the hadoop’s install directory.

Edit the conf/masters on hadoopm and put an entry like:

hadoopm

Edit the conf/slaves on hadoopm and put an entry like:

hadoopm
hadoops

Now edit conf/hadoop-site.xml for hadoopm look like this…

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
 <property>
 <name>hadoop.tmp.dir</name>
 <value>/home/hadoop/hadoop-${user.name}</value>
 <description>A base for other temporary directories.</description>
 </property>
 <property>
 <name>fs.default.name</name>
 <value>hdfs://hadoopm:9000</value>
 </property>
 <property>
 <name>mapred.job.tracker</name>
 <value>hadoopm:9001</value>
 </property>
 <property>
 <name>dfs.replication</name>
 <value>2</value>
 </property>
</configuration>

Please note the change

Copy that xml file from hadoopm to hadoops

hadoop@hadoopm$  scp conf/hadoop-site.xml \
> hadoop@hadoops:~/hadoopc/conf/hadoop-site.xml

Recreate/Reformat the name node

hadoop@hadoopm$  bin/hadoop namenode -format

Start the cluster

hadoop@hadoopm$  bin/start-dfs.sh
starting namenode, logging to /home/hadoop/hadoopc/bin/../logs/hadoop-hadoop-namenode-hadoopm.out
hadoops: starting datanode, logging to /home/hadoo/hadoopc/bin/../logs/hadoop-hadoop-datanode-hadoopm.out
hadoops: starting datanode, logging to /home/hadoop/hadoopc/bin/../logs/hadoop-hadoop-datanode-hadoops.out
hadoopm: starting secondarynamenode, logging to /home/hadoop/hadoopc/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopm.out

Check if the cluster is running

hadoop@hadoopm$  jps
9070 SecondaryNameNode
13825 Jps
8963 DataNode
8867 NameNode

Check the processes on the slave(hadoops)

hadoop@hadoops$  jps
22256 Jps
3567 DataNode
Congrats now you have your distributed filesystem running

Now start your mapreduce daemons

hadoop@hadoopm$  bin/start-mapred.sh
starting jobtracker, logging to /home/hadoop/hadoopc/bin/../logs/hadoop-hadoop-jobtracker-hadoopm.out
hadoops: starting tasktracker, logging to /home/hadoop/hadoopc/bin/../logs/hadoop-hadoop-tasktracker-hadoops.out
hadoopm: starting tasktracker, logging to /home/hadoop/hadoopc/bin/../logs/hadoop-hadoop-tasktracker-hadoopm.out

Check the running processes

hadoop@hadoopm$ jps
8963 DataNode
9169 JobTracker
9070 SecondaryNameNode
8867 NameNode
13825 Jps
9270 TaskTracker

Check for processes on the slave.

hadoop@hadoops$  jps
7706 Jps
3567 DataNode
3652 TaskTracker

Lets experiment…
Create a ~/test directory and dow the following from that directory on the master(hadoopm)

hadoop@hadoopm$ wget http://www.gutenberg.org/files/20417/20417-8.txt
hadoop@hadoopm$ wget http://www.gutenberg.org/dirs/etext04/7ldvc10.txt
hadoop@hadoopm$ wget http://www.gutenberg.org/files/4300/4300-8.txt
hadoop@hadoopm$ wget http://www.gutenberg.org/dirs/etext99/advsh12.txt

Populate the files in the hdfs file system

hadoop@hadoopm$ bin/hadoop dfs -copyFromLocal ~/test/ test

List the hdfs

hadoop@hadoopm$ bin/hadoop dfs -ls
Found 1 items
drwxr-xr-x - hadoop supergroup 0 2008-10-20 12:37 /user/hadoop/test

List the contents of tests

hadoop@hadoopm$ bin/hadoop dfs -ls
Found 4 items
-rw-r--r-- 2 hadoop supergroup 674425 2009-03-26 12:37 /user/hadoop/test/20417-8.txt
-rw-r--r-- 2 hadoop supergroup 1573048 2009-03-26 12:37 /user/hadoop/test/4300-8.txt
-rw-r--r-- 2 hadoop supergroup 1423808 2009-03-26 12:37 /user/hadoop/test/7ldvc10.txt
-rw-r--r-- 2 hadoop supergroup 590093 2009-03-26 12:37 /user/hadoop/test/advsh12.txt

Lets run an example and see what happend…. we will run the word count example.

hadoop@hadoopm$ bin/hadoop jar hadoop-0.18.3-examples.jar wordcount test test-op
09/03/26 12:44:06 INFO mapred.FileInputFormat: Total input paths to process : 4
09/03/26 12:44:06 INFO mapred.FileInputFormat: Total input paths to process : 4
09/03/26 12:44:07 INFO mapred.JobClient: Running job: job_200903261241_0001
09/03/26 12:44:08 INFO mapred.JobClient:  map 0% reduce 0%
09/03/26 12:44:21 INFO mapred.JobClient:  map 50% reduce 0%
09/03/26 12:44:22 INFO mapred.JobClient:  map 91% reduce 0%
09/03/26 12:44:24 INFO mapred.JobClient:  map 100% reduce 0%
09/03/26 12:44:39 INFO mapred.JobClient: Job complete: job_200903261241_0001
09/03/26 12:44:39 INFO mapred.JobClient: Counters: 16
09/03/26 12:44:39 INFO mapred.JobClient:   File Systems
09/03/26 12:44:39 INFO mapred.JobClient:     HDFS bytes read=4261379
09/03/26 12:44:39 INFO mapred.JobClient:     HDFS bytes written=949205
09/03/26 12:44:39 INFO mapred.JobClient:     Local bytes read=2051855
09/03/26 12:44:39 INFO mapred.JobClient:     Local bytes written=3757916
09/03/26 12:44:39 INFO mapred.JobClient:   Job Counters
09/03/26 12:44:39 INFO mapred.JobClient:     Launched reduce tasks=1
09/03/26 12:44:39 INFO mapred.JobClient:     Launched map tasks=4
09/03/26 12:44:39 INFO mapred.JobClient:     Data-local map tasks=4
09/03/26 12:44:39 INFO mapred.JobClient:   Map-Reduce Framework
09/03/26 12:44:39 INFO mapred.JobClient:     Reduce input groups=88308
09/03/26 12:44:39 INFO mapred.JobClient:     Combine output records=205892
09/03/26 12:44:39 INFO mapred.JobClient:     Map input records=90949
09/03/26 12:44:39 INFO mapred.JobClient:     Reduce output records=88308
09/03/26 12:44:39 INFO mapred.JobClient:     Map output bytes=7077681
09/03/26 12:44:39 INFO mapred.JobClient:     Map input bytes=4261379
09/03/26 12:44:39 INFO mapred.JobClient:     Combine input records=853603
09/03/26 12:44:39 INFO mapred.JobClient:     Map output records=736019
09/03/26 12:44:39 INFO mapred.JobClient:     Reduce input records=88308

What is all this…. what happened? Lets check the output.

hadoop@hadoopm$ bin/hadoop dfs -ls test-op
Found 2 items
drwxr-xr-x   - hadoop supergroup 0 2009-03-26 12:44 /user/hadoop/test-op/_logs
-rw-r--r--   2 hadoop supergroup 949205 2009-03-26 12:44 /user/hadoop/test-op/part-00000

hadoop@hadoopm$ bin/hadoop dfs -copyToLocal \
> /user/hadoop/test-op/part-00000 test-op-part-00000

hadoop@hadoopm$ head test-op-part-00000
"'A     1
"'About 1
"'Absolute      1
"'Ah!'  2
"'Ah,   2
"'Ample.'       1
"'And   10
"'Arthur!'      1
"'As    1
"'At    1

Thats the word(s)count(s) :)

Congrats you have just finish your first hadoop multinode setup.
For realtime data and output pls check the following links.

http://hadoopm:50030/
http://hadoopm:50070/

And BTW sometime would need to stop the cluster issue the following commands to do so…

hadoop@hadoopm$ ./bin/stop-mapred.sh # to stop mapreduce daemons
hadoop@hadoopm$ ./bin/stop-dfs.sh # to stop the hdfs daemons

You are now HADOOPing….

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s