Tutorial: Running the wordcount example on a hadoop cluster deployed using vhadoop

posted Jul 30, 2014, 12:18 AM by SRGICS UPLB   [ updated Apr 28, 2016, 8:40 PM by Joseph Anthony Hermocilla ]
Apache Hadoop is a software framework for distributed processing of large data sets using simple programming models(MapReduce). To use hadoop in P2C, we provided a set of utilities to deploy clusters similar to the interface of vcluster. In this tutorial, we will describe how to run the wordcount example from the Apache hadoop distribution deployed on P2C.

1. First, connect to the P2C cloud controller using ssh with p2cuser as username and password. (Note: You need to obtain access permission from the P2C administrator).

$ ssh p2cuser@

2. Once logged on to the system, run the vhadoop command. The first parameter is the name of the cluster and the second is the number of slaves. In the example below, wordcount is the cluster name and 2 is the number of slaves. This operation may take some time.

$ vhadoop wordcount 2

3. Connect to the front-end node as indicated in the instruction. Use hduser for username and password.

$ ssh hduser@

4. Run the initialization script.

$ ./rebuild.sh

5. Check if all the nodes are up. In our example, the datanodes available should be 3 if the initialization is successful.

$ hdfs dfsadmin -report

6. Create the wc-in folder in the distributed filesystem to contain the input.

$ hdfs dfs -mkdir /wc-in

7. Check if the directory was created.

$ hdfs dfs -ls /

8. Copy the tagalog wikipedia text file (from NLP workshop), tagalog.txt, to wc-in. Note that wc-in is in HDFS not in the local filesystem. It needs a special command as shown below.

$ hdfs dfs -copyFromLocal examples/tagalog.txt /wc-in

9. Check if the file was copied. You should see tagalog.txt.

$ hdfs dfs -ls /wc-in

10. Run the wordcount application as shown below.

$ hadoop jar examples/hadoop-mapreduce-examples-2.4.0.jar wordcount /wc-in /wc-out

11. After execution you will see a new directory in hdfs named wc-out.

$ hdfs dfs -ls /

12. Congratulations!You have succesfully run a MapReduce application on a three-node Apache Hadoop cluster!View the actual result of the count using the command below.

$ hdfs dfs -cat /wc-out/part-r-00000 | less

You can also view the status of the mapreduce jobs through a web interface at http://<ip of master node>:8080.
HDFS status can be viewed at http://<ip of master node>:50070.

For more information, email jchermocilla@up.edu.ph.