[[PageOutline]] = [.. Hadoop in a Slice] = {{{ #!html
Image Map
}}} Now that you have reserved your resources, you are ready to login to the slice and run some Hadoop examples. == 1. Configure the hadoop cluster == === 1.1. Login to Hadoop Master === {{{ #!html
  1. Login (ssh) to the master using the output of the readyToLogin script or gathering the necessary information from any tool you used for the reservation. The ssh application you use will depend on the configuration of your laptop/desktop.
}}} === 1.2. Look around your node. == a. Observe the properties of the network interfaces. {{{ # ifconfig ens3: flags=4163 mtu 1500 inet 10.103.0.17 netmask 255.255.255.0 broadcast 10.103.0.255 inet6 fe80::f816:3eff:fe31:db4 prefixlen 64 scopeid 0x20 ether fa:16:3e:31:0d:b4 txqueuelen 1000 (Ethernet) RX packets 99237 bytes 12928515 (12.3 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 55213 bytes 44478177 (42.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 eth0: flags=4163 mtu 1500 inet 172.16.1.1 netmask 255.255.255.0 broadcast 172.16.1.255 inet6 fe80::fc16:3eff:fe00:5b82 prefixlen 64 scopeid 0x20 ether fe:16:3e:00:5b:82 txqueuelen 1000 (Ethernet) RX packets 765818 bytes 1450907680 (1.3 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 511355 bytes 42217300 (40.2 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 lo: flags=73 mtu 65536 inet 127.0.0.1 netmask 255.0.0.0 inet6 ::1 prefixlen 128 scopeid 0x10 loop txqueuelen 0 (Local Loopback) RX packets 13502 bytes 1189668 (1.1 MiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 13502 bytes 1189668 (1.1 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 }}} a. Observe the contents of the NEuca user data file. This file includes a script that will install and execute the script that you configured for the VM. === {{{ # neuca-user-data [global] actor_id=b0e0e413-77ef-4775-b476-040ca8377d1d slice_id=e61547ac-a45d-4228-8483-dfe51945288d reservation_id=7163823e-cf26-45ec-960b-a833522d2693 unit_id=8679cd5d-7cf2-4fea-9f51-6832e6ce0358 ;router= Not Specified ;iscsi_initiator_iqn= Not Specified slice_name=urn:publicid:IDN+ch.geni.net:GRW15Illinois+slice+testhadoop unit_url=http://geni-orca.renci.org/owl/f3427516-3356-4525-97a6-752df5a7ec68#master host_name=master management_ip=128.120.83.37 physical_host=ucd-w4 nova_id=b7a54b3d-1410-4540-b57f-e5cf4827b3c3 [users] .... [interfaces] fe163e005b82=up:ipv4:172.16.1.1/24 [storage] [routes] [scripts] bootscript=#!/bin/bash # Automatically generated boot script # wget or curl must be installed on the image mkdir -p /home/hadoop/ cd /home/hadoop/ if [ -x `which wget 2>/dev/null` ]; then wget -q -O `basename http://geni-images.renci.org/images/tutorials/GENI-hadoop/hadoop_config_dynamic.sh.tgz` http://geni-images.renci.org/images/tutorials/GENI-hadoop/hadoop_config_dynamic.sh.tgz else if [ -x `which curl 2>/dev/null` ]; then curl http://geni-images.renci.org/images/tutorials/GENI-hadoop/hadoop_config_dynamic.sh.tgz > `basename http://geni-images.renci.org/images/tutorials/GENI-hadoop/hadoop_config_dynamic.sh.tgz` fi fi # untar tar -zxf `basename http://geni-images.renci.org/images/tutorials/GENI-hadoop/hadoop_config_dynamic.sh.tgz` eval "/bin/sh -c \"/home/hadoop/hadoop_config_dynamic.sh 2\"" }}} a. Observe the contents of the of the script that was installed and executed on the VM. === {{{ # sudo cat /home/hadoop/hadoop_config_dynamic.sh #!/bin/bash HADOOP_LOG='/home/hadoop/hadoop_boot.log' echo "Hello from neuca script" > $HADOOP_LOG .... }}} a. Test for connectivity between the VMs. === {{{ # ping worker-0 PING worker-0 (172.16.1.10) 56(84) bytes of data. 64 bytes from worker-0 (172.16.1.10): icmp_seq=1 ttl=64 time=1.09 ms 64 bytes from worker-0 (172.16.1.10): icmp_seq=2 ttl=64 time=0.772 ms 64 bytes from worker-0 (172.16.1.10): icmp_seq=3 ttl=64 time=0.888 ms # ping worker-1 PING worker-1 (172.16.1.11) 56(84) bytes of data. ... etc }}} === 1.3 Configure the hadoop filesystem === a. Switch to the ‘hadoop’ user. {{{ sudo su hadoop - }}} b. Format the hadoop filesystem: {{{ hdfs namenode -format }}} c. Start the hadoop services: {{{ start-dfs.sh }}} {{{ start-yarn.sh }}} == 3. Check the status of the Hadoop filesystem. == === A. Query for the status of the filesystem and its associated workers. === {{{ # hadoop dfsadmin -report Configured Capacity: 54958481408 (51.18 GB) Present Capacity: 48681934878 (45.34 GB) DFS Remaining: 48681885696 (45.34 GB) DFS Used: 49182 (48.03 KB) DFS Used%: 0% Under replicated blocks: 1 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 2 (2 total, 0 dead) Name: 172.16.1.11:50010 Rack: /default/rack0 Decommission Status : Normal Configured Capacity: 27479240704 (25.59 GB) DFS Used: 24591 (24.01 KB) Non DFS Used: 3137957873 (2.92 GB) DFS Remaining: 24341258240(22.67 GB) DFS Used%: 0% DFS Remaining%: 88.58% Last contact: Sat Jan 04 21:49:32 UTC 2014 Name: 172.16.1.10:50010 Rack: /default/rack0 Decommission Status : Normal Configured Capacity: 27479240704 (25.59 GB) DFS Used: 24591 (24.01 KB) Non DFS Used: 3138588657 (2.92 GB) DFS Remaining: 24340627456(22.67 GB) DFS Used%: 0% DFS Remaining%: 88.58% Last contact: Sat Jan 04 21:49:33 UTC 2014 }}} == 4. Test the filesystem with a small file == === A. Create a small test file === {{{ # echo Hello GENI World > hello.txt }}} === B. Push the file into the Hadoop filesystem === {{{ # hadoop fs -put hello.txt hello.txt }}} === C. Check for the file's existence === {{{ # hadoop fs -ls Found 1 items -rw-r--r-- 3 root supergroup 12 2014-01-04 21:59 /user/root/hello.txt }}} === D. Check the contents of the file === {{{ # hadoop fs -cat hello.txt Hello GENI World }}} == 4. Run the Hadoop Sort Testcase == Test the true power of the Hadoop filesystem by creating and sorting a large random dataset. It may be useful/interesting to login to the master and/or worker VMs and use tools like top, iotop, and iftop to observe the resource utilization on each of the VMs during the sort test. Note: on these VMs iotop and iftop must be run as root. === A. Create a 1 GB random data set. === After the data is created, use the ls functionally to confirm the data exists. Note that the data is composed of several files in a directory. {{{ # hadoop jar /usr/local/hadoop-0.20.2/hadoop-0.20.2-examples.jar teragen 10000000 random.data.1G Generating 10000000 using 2 maps with step of 5000000 14/01/05 18:47:58 INFO mapred.JobClient: Running job: job_201401051828_0003 14/01/05 18:47:59 INFO mapred.JobClient: map 0% reduce 0% 14/01/05 18:48:14 INFO mapred.JobClient: map 35% reduce 0% 14/01/05 18:48:17 INFO mapred.JobClient: map 57% reduce 0% 14/01/05 18:48:20 INFO mapred.JobClient: map 80% reduce 0% 14/01/05 18:48:26 INFO mapred.JobClient: map 100% reduce 0% 14/01/05 18:48:28 INFO mapred.JobClient: Job complete: job_201401051828_0003 14/01/05 18:48:28 INFO mapred.JobClient: Counters: 6 14/01/05 18:48:28 INFO mapred.JobClient: Job Counters 14/01/05 18:48:28 INFO mapred.JobClient: Launched map tasks=2 14/01/05 18:48:28 INFO mapred.JobClient: FileSystemCounters 14/01/05 18:48:28 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000000 14/01/05 18:48:28 INFO mapred.JobClient: Map-Reduce Framework 14/01/05 18:48:28 INFO mapred.JobClient: Map input records=10000000 14/01/05 18:48:28 INFO mapred.JobClient: Spilled Records=0 14/01/05 18:48:28 INFO mapred.JobClient: Map input bytes=10000000 14/01/05 18:48:28 INFO mapred.JobClient: Map output records=10000000 }}} === B. Sort the dataset. === {{{ # hadoop jar /usr/local/hadoop-0.20.2/hadoop-0.20.2-examples.jar terasort random.data.1G sorted.data.1G 14/01/05 18:50:49 INFO terasort.TeraSort: starting 14/01/05 18:50:49 INFO mapred.FileInputFormat: Total input paths to process : 2 14/01/05 18:50:50 INFO util.NativeCodeLoader: Loaded the native-hadoop library 14/01/05 18:50:50 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 14/01/05 18:50:50 INFO compress.CodecPool: Got brand-new compressor Making 1 from 100000 records Step size is 100000.0 14/01/05 18:50:50 INFO mapred.JobClient: Running job: job_201401051828_0004 14/01/05 18:50:51 INFO mapred.JobClient: map 0% reduce 0% 14/01/05 18:51:05 INFO mapred.JobClient: map 6% reduce 0% 14/01/05 18:51:08 INFO mapred.JobClient: map 20% reduce 0% 14/01/05 18:51:11 INFO mapred.JobClient: map 33% reduce 0% 14/01/05 18:51:14 INFO mapred.JobClient: map 37% reduce 0% 14/01/05 18:51:29 INFO mapred.JobClient: map 55% reduce 0% 14/01/05 18:51:32 INFO mapred.JobClient: map 65% reduce 6% 14/01/05 18:51:35 INFO mapred.JobClient: map 71% reduce 6% 14/01/05 18:51:38 INFO mapred.JobClient: map 72% reduce 8% 14/01/05 18:51:44 INFO mapred.JobClient: map 74% reduce 8% 14/01/05 18:51:47 INFO mapred.JobClient: map 74% reduce 10% 14/01/05 18:51:50 INFO mapred.JobClient: map 87% reduce 12% 14/01/05 18:51:53 INFO mapred.JobClient: map 92% reduce 12% 14/01/05 18:51:56 INFO mapred.JobClient: map 93% reduce 12% 14/01/05 18:52:02 INFO mapred.JobClient: map 100% reduce 14% 14/01/05 18:52:05 INFO mapred.JobClient: map 100% reduce 22% 14/01/05 18:52:08 INFO mapred.JobClient: map 100% reduce 29% 14/01/05 18:52:14 INFO mapred.JobClient: map 100% reduce 33% 14/01/05 18:52:23 INFO mapred.JobClient: map 100% reduce 67% 14/01/05 18:52:26 INFO mapred.JobClient: map 100% reduce 70% 14/01/05 18:52:29 INFO mapred.JobClient: map 100% reduce 75% 14/01/05 18:52:32 INFO mapred.JobClient: map 100% reduce 80% 14/01/05 18:52:35 INFO mapred.JobClient: map 100% reduce 85% 14/01/05 18:52:38 INFO mapred.JobClient: map 100% reduce 90% 14/01/05 18:52:46 INFO mapred.JobClient: map 100% reduce 100% 14/01/05 18:52:48 INFO mapred.JobClient: Job complete: job_201401051828_0004 14/01/05 18:52:48 INFO mapred.JobClient: Counters: 18 14/01/05 18:52:48 INFO mapred.JobClient: Job Counters 14/01/05 18:52:48 INFO mapred.JobClient: Launched reduce tasks=1 14/01/05 18:52:48 INFO mapred.JobClient: Launched map tasks=16 14/01/05 18:52:48 INFO mapred.JobClient: Data-local map tasks=16 14/01/05 18:52:48 INFO mapred.JobClient: FileSystemCounters 14/01/05 18:52:48 INFO mapred.JobClient: FILE_BYTES_READ=2382257412 14/01/05 18:52:48 INFO mapred.JobClient: HDFS_BYTES_READ=1000057358 14/01/05 18:52:48 INFO mapred.JobClient: FILE_BYTES_WRITTEN=3402255956 14/01/05 18:52:48 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=1000000000 14/01/05 18:52:48 INFO mapred.JobClient: Map-Reduce Framework 14/01/05 18:52:48 INFO mapred.JobClient: Reduce input groups=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Combine output records=0 14/01/05 18:52:48 INFO mapred.JobClient: Map input records=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Reduce shuffle bytes=951549012 14/01/05 18:52:48 INFO mapred.JobClient: Reduce output records=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Spilled Records=33355441 14/01/05 18:52:48 INFO mapred.JobClient: Map output bytes=1000000000 14/01/05 18:52:48 INFO mapred.JobClient: Map input bytes=1000000000 14/01/05 18:52:48 INFO mapred.JobClient: Combine input records=0 14/01/05 18:52:48 INFO mapred.JobClient: Map output records=10000000 14/01/05 18:52:48 INFO mapred.JobClient: Reduce input records=10000000 14/01/05 18:52:48 INFO terasort.TeraSort: done }}} ==== C. Look at the output. ==== You can use Hadoop's cat and/or get functionally to look at the random and sorted files to confirm their size and that the sort actually worked. Try some or all of these commands. Does the output make sense to you? {{{ hadoop fs -ls random.data.1G hadoop fs -ls sorted.data.1G hadoop fs -cat random.data.1G/part-00000 | less hadoop fs -cat sorted.data.1G/part-00000 | less }}} == 5. Advanced Example == Re-do the tutorial with a different number of workers, amount of bandwidth, and/or worker instance types. Warning: be courteous to other users and do not use too many of the resources. === A. Time the performance of runs with different resources. === === B. Observe largest size file you can create with different resources. === ---- = [.. Introduction] = = [../TeardownExperiment Next: Teardown Experiment] =