Introduction to Map-Reduce and Spark in Parallel Programming

Slide Note
Embed
Share

Explore the concepts of Map-Reduce and Apache Spark for parallel programming. Understand how to transform and aggregate data using functions, and work with Resilient Distributed Datasets (RDDs) in Spark. Learn how to efficiently process data and perform calculations like estimating Pi using Spark's powerful capabilities.


Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Map-Reduce + Spark CSCI 476: Parallel Programming Professor William Killian

  2. Map Transform one type to another through some function map_function(Value) -> Value2 Example: to_int( 1234 ) -> 1234 string int

  3. Reduce Aggregate one type together through a function reduce_function(Value, Value) -> Value Example: operator.add(123, 456) -> 579 int int int

  4. Apache Spark Open-Source Data Processing Engine Java, Scala, Python Key Concept: Resilient Distributed Dataset (RDD)

  5. RDDs Represent Data or Transformations on Data Created through: textFile() parallelize() Transformations Actions can be applied to RDDs Actions return values Lazy evaluation Nothing will be computed until an action needs the data

  6. Example: Calculating Pi Given a circle with radius of 1 Generate a random x,y point Do this MANY times Calculate ratio points within the circle Ratio is approximately Pi / 4

  7. Example: Calculating Pi def sample(p): x, y = random(), random() return 1 if x*x + y*y < 1 else 0 SAMPLES = 100000000 # change? count = sc.parallelize(range(SAMPLES)) \ .map(sample) \ .reduce(operator.add) print(f"Pi is approximately {4.0 * count / SAMPLES}")

  8. Spark (sample) Transformations map (func) filter (func) New dataset formed by selecting those who s call of func result in True union (otherRDD) intersection (otherRDD) distinct ([numTasks]) Unique elements join (otherRDD, [numTasks]) RDD of (k, v) joined with RDD of (k, w) creates RDD of (k, (v, w))

  9. Spark (sample) Actions reduce (func) collect() Return all elements of the dataset as an array count() Return the number of elements in the dataset Remember: Actions force calculation. Transformations are LAZY

  10. Spark: Remembering Information If there s data you care about repeatedly, you can cache it! .cache() This is useful if you have data preprocessing without any actions! RDD Programming Guide https://spark.apache.org/docs/latest/rdd-programming-guide.html API Documentation https://spark.apache.org/docs/latest/api/python/pyspark.html

  11. Demo Demo: count-spark.py http://cs.millersville.edu/~wkillian/2020/spring/files/csci476/map-reduce/ Available on Linux Lab: ~wkillian/Public/476/map-reduce

Related