"Apache Spark is a fast and general processing engine compatible with Hadoop
data. It can run in Hadoop clusters through YARN or Spark's standalone
mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any
Hadoop InputFormat. It is designed to perform both batch processing
(similar to MapReduce) and new workloads like streaming, interactive
queries, and machine learning."
"""CountUP.py""" from pyspark import SparkConf, SparkContext wiki = "/wc-in/tagalog.txt" conf = (SparkConf() .setMaster("yarn-client") .setAppName("CountUP") .set("spark.executor.memory", "128m")) sc = SparkContext(conf = conf) data = sc.textFile(wiki).cache() us = data.filter(lambda s: 'Unibersidad' in s).count() ps = data.filter(lambda s: 'Pilipinas' in s).count() print "Lines with Unibersidad: %i, lines with Pilipinas: %i" % (us, ps) $spark-submit CountUP.py |