Tutorial: Running a Python Spark application in a vhadoop deployed hadoop cluster

posted Apr 28, 2016, 8:26 PM by Joseph Anthony Hermocilla   [ updated Apr 28, 2016, 8:42 PM ]
"Apache Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning."

1. First start a vhadoop cluster as described here.

2. Save the following code as CountUP.py

"""CountUP.py"""
from pyspark import SparkConf, SparkContext
wiki = "/wc-in/tagalog.txt" 
conf = (SparkConf()
         .setMaster("yarn-client")
         .setAppName("CountUP")
         .set("spark.executor.memory", "128m"))
sc = SparkContext(conf = conf)
data = sc.textFile(wiki).cache()
us = data.filter(lambda s: 'Unibersidad' in s).count()
ps = data.filter(lambda s: 'Pilipinas' in s).count()
print "Lines with Unibersidad: %i, lines with Pilipinas: %i" % (us, ps)

4. Run the application.

$spark-submit CountUP.py