= [wiki:GENIExperimenter/Tutorials/HadoopInASlice Hadoop in a Slice] = == Part II: Execute Experiment: Login to the nodes and execute the Hadoop experiment == {{{ #!html
Image Map
}}} = Instructions = Now that you have reserved your resources, you are ready to login to the slice and run some Hadoop examples. == 1. Login to Hadoop Master == {{{ #!html
  1. Login (ssh) to the hadoop-master using the credentials associated with the GENI Portal and the IP address displayed by Flack. The ssh application you use will depend on the configuration of your laptop/desktop.
}}} == 2. Check the status/properties of the VMs. == === A. Observe the properties of the network interfaces === {{{ # /sbin/ifconfig eth0 Link encap:Ethernet HWaddr fa:16:3e:72:ad:a6 inet addr:10.103.0.20 Bcast:10.103.0.255 Mask:255.255.255.0 inet6 addr: fe80::f816:3eff:fe72:ada6/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:1982 errors:0 dropped:0 overruns:0 frame:0 TX packets:1246 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:301066 (294.0 KiB) TX bytes:140433 (137.1 KiB) Interrupt:11 Base address:0x2000 eth1 Link encap:Ethernet HWaddr fe:16:3e:00:6d:af inet addr:172.16.1.1 Bcast:172.16.1.255 Mask:255.255.255.0 inet6 addr: fe80::fc16:3eff:fe00:6daf/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:21704 errors:0 dropped:0 overruns:0 frame:0 TX packets:4562 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:3100262 (2.9 MiB) TX bytes:824572 (805.2 KiB) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:19394 errors:0 dropped:0 overruns:0 frame:0 TX packets:19394 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:4010954 (3.8 MiB) TX bytes:4010954 (3.8 MiB) }}} === B. Observe the contents of the NEuca user data file. This file includes a script that will install and execute the script that you configured for the VM. === {{{ # neuca-user-data [global] actor_id=67C4EFB4-7CBF-48C9-8195-934FF81434DC slice_id=39672f6e-610a-4d86-8810-30e02d20cc99 reservation_id=55676541-5221-483d-bb60-429de025f275 unit_id=902709a4-32f2-41fc-b85c-b4791c779580 ;router= Not Specified ;iscsi_initiator_iqn= Not Specified slice_name=urn:publicid:IDN+ch.geni.net:ADAMANT+slice+pruth-winter-camp unit_url=http://geni-orca.renci.org/owl/8210b4d7-4afc-4838-801f-c20a8f1f75ae#hadoop-master host_name=hadoop-master [interfaces] fe163e006daf=up:ipv4:172.16.1.1/24 [storage] [routes] [scripts] bootscript=#!/bin/bash # Automatically generated boot script # wget or curl must be installed on the image mkdir -p /tmp cd /tmp if [ -x `which wget 2>/dev/null` ]; then wget -q -O `basename http://geni-images.renci.org/images/GENIWinterCamp/master.sh` http://geni-images.renci.org/images/GENIWinterCamp/master.sh else if [ -x `which curl 2>/dev/null` ]; then curl http://geni-images.renci.org/images/GENIWinterCamp/master.sh > `basename http://geni-images.renci.org/images/GENIWinterCamp/master.sh` fi fi eval "/bin/sh -c \"chmod +x /tmp/master.sh; /tmp/master.sh\"" }}} === C. Observe the contents of the of the script that was installed and executed on the VM. === {{{ # cat /tmp/master.sh #!/bin/bash echo "Hello from neuca script" > /home/hadoop/log MY_HOSTNAME=hadoop-master hostname $MY_HOSTNAME echo 172.16.1.1 hadoop-master >> /etc/hosts echo 172.16.1.10 hadoop-worker-0 >> /etc/hosts echo 172.16.1.11 hadoop-worker-1 >> /etc/hosts echo 172.16.1.12 hadoop-worker-2 >> /etc/hosts echo 172.16.1.13 hadoop-worker-3 >> /etc/hosts echo 172.16.1.14 hadoop-worker-4 >> /etc/hosts echo 172.16.1.15 hadoop-worker-5 >> /etc/hosts echo 172.16.1.16 hadoop-worker-6 >> /etc/hosts echo 172.16.1.17 hadoop-worker-7 >> /etc/hosts echo 172.16.1.18 hadoop-worker-8 >> /etc/hosts echo 172.16.1.19 hadoop-worker-9 >> /etc/hosts echo 172.16.1.20 hadoop-worker-10 >> /etc/hosts echo 172.16.1.21 hadoop-worker-11 >> /etc/hosts echo 172.16.1.22 hadoop-worker-12 >> /etc/hosts echo 172.16.1.23 hadoop-worker-13 >> /etc/hosts echo 172.16.1.24 hadoop-worker-14 >> /etc/hosts echo 172.16.1.25 hadoop-worker-15 >> /etc/hosts while true; do PING=`ping -c 1 172.16.1.1 > /dev/null 2>&1` if [ "$?" = "0" ]; then break fi sleep 5 done echo '/home/hadoop/hadoop-euca-init.sh 172.16.1.1 -master' >> /home/hadoop/log /home/hadoop/hadoop-euca-init.sh 172.16.1.1 -master echo "Done starting daemons" >> /home/hadoop/log }}} === D. Test for connectivity between the VMs. === {{{ # ping hadoop-worker-0 PING hadoop-worker-0 (172.16.1.10) 56(84) bytes of data. 64 bytes from hadoop-worker-0 (172.16.1.10): icmp_req=1 ttl=64 time=0.747 ms 64 bytes from hadoop-worker-0 (172.16.1.10): icmp_req=2 ttl=64 time=0.459 ms 64 bytes from hadoop-worker-0 (172.16.1.10): icmp_req=3 ttl=64 time=0.411 ms ^C --- hadoop-worker-0 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1998ms rtt min/avg/max/mdev = 0.411/0.539/0.747/0.148 ms # ping hadoop-worker-1 PING hadoop-worker-1 (172.16.1.11) 56(84) bytes of data. 64 bytes from hadoop-worker-1 (172.16.1.11): icmp_req=1 ttl=64 time=0.852 ms 64 bytes from hadoop-worker-1 (172.16.1.11): icmp_req=2 ttl=64 time=0.468 ms 64 bytes from hadoop-worker-1 (172.16.1.11): icmp_req=3 ttl=64 time=0.502 ms ^C --- hadoop-worker-1 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.468/0.607/0.852/0.174 ms }}} == 3. Check the status of the Hadoop filesystem. == === A. Query for the status of the filesystem and its associated workers. === {{{ # hadoop dfsadmin -report Configured Capacity: 54958481408 (51.18 GB) Present Capacity: 48681934878 (45.34 GB) DFS Remaining: 48681885696 (45.34 GB) DFS Used: 49182 (48.03 KB) DFS Used%: 0% Under replicated blocks: 1 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 2 (2 total, 0 dead) Name: 172.16.1.11:50010 Rack: /default/rack0 Decommission Status : Normal Configured Capacity: 27479240704 (25.59 GB) DFS Used: 24591 (24.01 KB) Non DFS Used: 3137957873 (2.92 GB) DFS Remaining: 24341258240(22.67 GB) DFS Used%: 0% DFS Remaining%: 88.58% Last contact: Sat Jan 04 21:49:32 UTC 2014 Name: 172.16.1.10:50010 Rack: /default/rack0 Decommission Status : Normal Configured Capacity: 27479240704 (25.59 GB) DFS Used: 24591 (24.01 KB) Non DFS Used: 3138588657 (2.92 GB) DFS Remaining: 24340627456(22.67 GB) DFS Used%: 0% DFS Remaining%: 88.58% Last contact: Sat Jan 04 21:49:33 UTC 2014 }}} == 4. Test the filesystem with a small file == === A. Create a small test file === {{{ # echo Hello GENI World > hello.txt }}} === B. Push the file into the Hadoop filesystem === {{{ # hadoop fs -put hello.txt hello.txt }}} === C. Check for the file's existence === {{{ # hadoop fs -ls Found 1 items -rw-r--r-- 3 root supergroup 12 2014-01-04 21:59 /user/root/hello.txt }}} === D. Check the contents of the file === {{{ # hadoop fs -cat hello.txt Hello GENI World }}} == 4. Run the Hadoop Sort Testcase == Test the true power of the Hadoop filesystem by creating and sorting a large random dataset. It may be useful/interesting to login to the master and/or worker VMs and use tools like top, iotop, and iftop to observe the resource utilization on each of the VMs during the sort test. Note: on these VMs iotop and iftop must be run as root. === A. Create a 1 GB random data set. === After the data is created, use the \verb$ls$ functionally to confirm the data exists. Note that the data is composed of several files in a directory. {{{ # hadoop jar /usr/local/hadoop-0.20.2/hadoop-0.20.2-examples.jar teragen 10000000 random.data.1G Generating 10000000 using 2 maps with step of 5000000 14/01/05 18:47:58 INFO mapred.JobClient: Running job: job_201401051828_0003 14/01/05 18:47:59 INFO mapred.JobClient: map 0% reduce 0% 14/01/05 18:48:14 INFO mapred.JobClient: map 35% reduce 0% 14/01/05 18:48:17 INFO mapred.JobClient: map 57% reduce 0% 14/01/05 18:48:20 INFO mapred.JobClient: map 80% reduce 0% 14/01/05 18:48:26 INFO mapred.JobClient: map 100% reduce 0% 14/01/05 18:48:28 INFO mapred.JobClient: Job complete: job_201401051828_0003 14/01/05 18:48:28 INFO mapred.JobClient: Counters: 6 14/01/05 18:48:28 INFO mapred.JobClient: Job Counters 14/01/05 18:48:28 INFO mapred.JobClient: Launched map tasks=2 14/01/05 18:48:28 INFO mapred.JobClient: FileSystemCounters 14/01/05 18:48:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000000 14/01/05 18:48:28 INFO mapred.JobClient: Map-Reduce Framework 14/01/05 18:48:28 INFO mapred.JobClient: Map input records=10000000 14/01/05 18:48:28 INFO mapred.JobClient: Spilled Records=0 14/01/05 18:48:28 INFO mapred.JobClient: Map input bytes=10000000 14/01/05 18:48:28 INFO mapred.JobClient: Map output records=10000000 }}} === B. Sort the dataset. === Note: you can use Hadoop's cat and/or get functionally to look at the random and sorted files to confirm their size and that the sort actually worked. {{{ # hadoop jar /usr/local/hadoop-0.20.2/hadoop-0.20.2-examples.jar terasort random.data.1G sorted.data.1G 14/01/05 18:50:49 INFO terasort.TeraSort: starting 14/01/05 18:50:49 INFO mapred.FileInputFormat: Total input paths to process : 2 14/01/05 18:50:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/01/05 18:50:50 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 14/01/05 18:50:50 INFO compress.CodecPool: Got brand-new compressor Making 1 from 100000 records Step size is 100000.0 14/01/05 18:50:50 INFO mapred.JobClient: Running job: job_201401051828_0004 14/01/05 18:50:51 INFO mapred.JobClient: map 0% reduce 0% 14/01/05 18:51:05 INFO mapred.JobClient: map 6% reduce 0% 14/01/05 18:51:08 INFO mapred.JobClient: map 20% reduce 0% 14/01/05 18:51:11 INFO mapred.JobClient: map 33% reduce 0% 14/01/05 18:51:14 INFO mapred.JobClient: map 37% reduce 0% 14/01/05 18:51:29 INFO mapred.JobClient: map 55% reduce 0% 14/01/05 18:51:32 INFO mapred.JobClient: map 65% reduce 6% 14/01/05 18:51:35 INFO mapred.JobClient: map 71% reduce 6% 14/01/05 18:51:38 INFO mapred.JobClient: map 72% reduce 8% 14/01/05 18:51:44 INFO mapred.JobClient: map 74% reduce 8% 14/01/05 18:51:47 INFO mapred.JobClient: map 74% reduce 10% 14/01/05 18:51:50 INFO mapred.JobClient: map 87% reduce 12% 14/01/05 18:51:53 INFO mapred.JobClient: map 92% reduce 12% 14/01/05 18:51:56 INFO mapred.JobClient: map 93% reduce 12% 14/01/05 18:52:02 INFO mapred.JobClient: map 100% reduce 14% 14/01/05 18:52:05 INFO mapred.JobClient: map 100% reduce 22% 14/01/05 18:52:08 INFO mapred.JobClient: map 100% reduce 29% 14/01/05 18:52:14 INFO mapred.JobClient: map 100% reduce 33% 14/01/05 18:52:23 INFO mapred.JobClient: map 100% reduce 67% 14/01/05 18:52:26 INFO mapred.JobClient: map 100% reduce 70% 14/01/05 18:52:29 INFO mapred.JobClient: map 100% reduce 75% 14/01/05 18:52:32 INFO mapred.JobClient: map 100% reduce 80% 14/01/05 18:52:35 INFO mapred.JobClient: map 100% reduce 85% 14/01/05 18:52:38 INFO mapred.JobClient: map 100% reduce 90% 14/01/05 18:52:46 INFO mapred.JobClient: map 100% reduce 100% 14/01/05 18:52:48 INFO mapred.JobClient: Job complete: job_201401051828_0004 14/01/05 18:52:48 INFO mapred.JobClient: Counters: 18 14/01/05 18:52:48 INFO mapred.JobClient: Job Counters 14/01/05 18:52:48 INFO mapred.JobClient: Launched reduce tasks=1 14/01/05 18:52:48 INFO mapred.JobClient: Launched map tasks=16 14/01/05 18:52:48 INFO mapred.JobClient: Data-local map tasks=16 14/01/05 18:52:48 INFO mapred.JobClient: FileSystemCounters 14/01/05 18:52:48 INFO mapred.JobClient: FILE_BYTES_READ=2382257412 14/01/05 18:52:48 INFO mapred.JobClient: HDFS_BYTES_READ=1000057358 14/01/05 18:52:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3402255956 14/01/05 18:52:48 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000000 14/01/05 18:52:48 INFO mapred.JobClient: Map-Reduce Framework 14/01/05 18:52:48 INFO mapred.JobClient: Reduce input groups=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Combine output records=0 14/01/05 18:52:48 INFO mapred.JobClient: Map input records=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Reduce shuffle bytes=951549012 14/01/05 18:52:48 INFO mapred.JobClient: Reduce output records=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Spilled Records=33355441 14/01/05 18:52:48 INFO mapred.JobClient: Map output bytes=1000000000 14/01/05 18:52:48 INFO mapred.JobClient: Map input bytes=1000000000 14/01/05 18:52:48 INFO mapred.JobClient: Combine input records=0 14/01/05 18:52:48 INFO mapred.JobClient: Map output records=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Reduce input records=10000000 14/01/05 18:52:48 INFO terasort.TeraSort: done }}} == 5. Advanced Example == Re-do the tutorial with a different number of workers, amount of bandwidth, and/or worker instance types. Warning: be courteous to other users and do not use too many of the resources. === A. Time the performance of runs with different resources. === === B. Observe largest size file you can create with different resources. === ---- = [wiki:GENIExperimenter/Tutorials/HadoopInASlice Introduction] = = [wiki:GENIExperimenter/Tutorials/HadoopInASlice/TeardownExperiment Next: Teardown Experiment] =