Introduction to Map-Reduce and Spark in Parallel Programming
Explore the concepts of Map-Reduce and Apache Spark for parallel programming. Understand how to transform and aggregate data using functions, and work with Resilient Distributed Datasets (RDDs) in Spark. Learn how to efficiently process data and perform calculations like estimating Pi using Spark's powerful capabilities.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Map-Reduce + Spark CSCI 476: Parallel Programming Professor William Killian
Map Transform one type to another through some function map_function(Value) -> Value2 Example: to_int( 1234 ) -> 1234 string int
Reduce Aggregate one type together through a function reduce_function(Value, Value) -> Value Example: operator.add(123, 456) -> 579 int int int
Apache Spark Open-Source Data Processing Engine Java, Scala, Python Key Concept: Resilient Distributed Dataset (RDD)
RDDs Represent Data or Transformations on Data Created through: textFile() parallelize() Transformations Actions can be applied to RDDs Actions return values Lazy evaluation Nothing will be computed until an action needs the data
Example: Calculating Pi Given a circle with radius of 1 Generate a random x,y point Do this MANY times Calculate ratio points within the circle Ratio is approximately Pi / 4
Example: Calculating Pi def sample(p): x, y = random(), random() return 1 if x*x + y*y < 1 else 0 SAMPLES = 100000000 # change? count = sc.parallelize(range(SAMPLES)) \ .map(sample) \ .reduce(operator.add) print(f"Pi is approximately {4.0 * count / SAMPLES}")
Spark (sample) Transformations map (func) filter (func) New dataset formed by selecting those who s call of func result in True union (otherRDD) intersection (otherRDD) distinct ([numTasks]) Unique elements join (otherRDD, [numTasks]) RDD of (k, v) joined with RDD of (k, w) creates RDD of (k, (v, w))
Spark (sample) Actions reduce (func) collect() Return all elements of the dataset as an array count() Return the number of elements in the dataset Remember: Actions force calculation. Transformations are LAZY
Spark: Remembering Information If there s data you care about repeatedly, you can cache it! .cache() This is useful if you have data preprocessing without any actions! RDD Programming Guide https://spark.apache.org/docs/latest/rdd-programming-guide.html API Documentation https://spark.apache.org/docs/latest/api/python/pyspark.html
Demo Demo: count-spark.py http://cs.millersville.edu/~wkillian/2020/spring/files/csci476/map-reduce/ Available on Linux Lab: ~wkillian/Public/476/map-reduce