Introduction to Map-Reduce and Spark in Parallel Programming

Map-Reduce + Spark

CSCI 476: Parallel Programming

Professor William Killian

Map

•

Transform one type to another through some function

map_function(

Value

) ->

Value2

Example:

to_int(

‘1234’

) ->

string

int

Reduce

•

Aggregate one type together through a function

reduce_function(

Value

Value

) ->

Value

Example:

operator.add(

) ->

int

int

int

Apache Spark

•

Open-Source

•

Data Processing Engine

•

Java, Scala, Python

•

Key Concept:

Resilient Distributed Dataset (RDD)

RDDs

•

Represent

Data

or

Transformations on Data

•

Created through:

•

textFile()

•

parallelize()

•

Transformations

•

Actions

 can be applied to RDDs

•

Actions

 return values

•

Lazy evaluation

•

Nothing

 will be computed until an

action

 needs the data

Example:

Calculating Pi

•

Given a circle with radius of 1

•

Generate a random x,y point

•

Do this

MANY

 times

•

Calculate ratio points within the circle

•

Ratio is approximately Pi / 4

Example: Calculating Pi

def

 sample(p):

random

(),

random

()

return

if

else

SAMPLES

100000000

# change?

count

 = sc.

parallelize

range

SAMPLES

)) \

map

sample

) \

reduce

(operator.add)

print

(f

"Pi is approximately {

4.0

count

SAMPLES

}"

Spark (sample) Transformations

•

map (

func

•

filter (

func

•

New dataset formed by selecting those who’s call of

func

 result in True

•

union (

otherRDD

•

intersection (

otherRDD

•

distinct ([numTasks])

•

Unique elements

•

join (

otherRDD

, [numTasks])

•

RDD of (k, v) joined with RDD of (k, w) creates RDD of (k, (v, w))

Spark (sample) Actions

•

reduce (

func

•

collect()

•

Return all elements of the dataset as an array

•

count()

•

Return the number of elements in the dataset

Remember:

Actions

 force calculation.

Transformations

 are LAZY

Spark: Remembering Information

•

If there’s data you care about repeatedly, you can

cache

it!

.cache()

This is useful if you have data preprocessing without any actions!

RDD Programming Guide

https://spark.apache.org/docs/latest/rdd-programming-guide.html

API Documentation

https://spark.apache.org/docs/latest/api/python/pyspark.html

http://cs.millersville.edu/~wkillian/2020/spring/files/csci476/map-reduce/

Available on Linux Lab:

~wkillian/Public/476/map-reduce

Slide Note

Embed Share

Download

Explore the concepts of Map-Reduce and Apache Spark for parallel programming. Understand how to transform and aggregate data using functions, and work with Resilient Distributed Datasets (RDDs) in Spark. Learn how to efficiently process data and perform calculations like estimating Pi using Spark's powerful capabilities.

cle_ri Follow

Uploaded on Oct 09, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Map-Reduce + Spark CSCI 476: Parallel Programming Professor William Killian

Map Transform one type to another through some function map_function(Value) -> Value2 Example: to_int( 1234 ) -> 1234 string int

Reduce Aggregate one type together through a function reduce_function(Value, Value) -> Value Example: operator.add(123, 456) -> 579 int int int

Apache Spark Open-Source Data Processing Engine Java, Scala, Python Key Concept: Resilient Distributed Dataset (RDD)

RDDs Represent Data or Transformations on Data Created through: textFile() parallelize() Transformations Actions can be applied to RDDs Actions return values Lazy evaluation Nothing will be computed until an action needs the data

Example: Calculating Pi Given a circle with radius of 1 Generate a random x,y point Do this MANY times Calculate ratio points within the circle Ratio is approximately Pi / 4

Example: Calculating Pi def sample(p): x, y = random(), random() return 1 if x*x + y*y < 1 else 0 SAMPLES = 100000000 # change? count = sc.parallelize(range(SAMPLES)) \ .map(sample) \ .reduce(operator.add) print(f"Pi is approximately {4.0 * count / SAMPLES}")

Spark (sample) Transformations map (func) filter (func) New dataset formed by selecting those who s call of func result in True union (otherRDD) intersection (otherRDD) distinct ([numTasks]) Unique elements join (otherRDD, [numTasks]) RDD of (k, v) joined with RDD of (k, w) creates RDD of (k, (v, w))

Spark (sample) Actions reduce (func) collect() Return all elements of the dataset as an array count() Return the number of elements in the dataset Remember: Actions force calculation. Transformations are LAZY

Spark: Remembering Information If there s data you care about repeatedly, you can cache it! .cache() This is useful if you have data preprocessing without any actions! RDD Programming Guide https://spark.apache.org/docs/latest/rdd-programming-guide.html API Documentation https://spark.apache.org/docs/latest/api/python/pyspark.html

Demo Demo: count-spark.py http://cs.millersville.edu/~wkillian/2020/spring/files/csci476/map-reduce/ Available on Linux Lab: ~wkillian/Public/476/map-reduce

Introduction to Map-Reduce and Spark in Parallel Programming

Download Presentation

Presentation Transcript

Related

More Related Content