Introduction to Map-Reduce and Spark in Parallel Programming

Map-Reduce + Spark
CSCI 476: Parallel Programming
Professor William Killian
Map
Transform one type to another through some function
 
map_function(
Value
) -> 
Value2
Example:
 
to_int(
‘1234’
) -> 
1234
 
       
string
     int
Reduce
Aggregate one type together through a function
 
reduce_function(
Value
, 
Value
) -> 
Value
Example:
 
operator.add(
123
, 
456
) -> 
579
 
             
int
  
int
     
int
Apache Spark
Open-Source
Data Processing Engine
Java, Scala, Python
Key Concept: 
Resilient Distributed Dataset (RDD)
RDDs
Represent 
Data
 or 
Transformations on Data
Created through:
textFile()
parallelize()
Transformations
Actions
 can be applied to RDDs
Actions
 return values
Lazy evaluation
Nothing
 will be computed until an 
action
 needs the data
Example:
Calculating Pi
Given a circle with radius of 1
Generate a random x,y point
Do this 
MANY
 times
Calculate ratio points within the circle
Ratio is approximately Pi / 4
Example: Calculating Pi
def
 sample(p):
  
x
, 
y
 = 
random
(), 
random
()
  
return
 
1
 
if
 
x
*
x
 + 
y
*
y
 < 
1
 
else
 
0
SAMPLES
 = 
100000000
 
# change?
count
 = sc.
parallelize
(
range
(
SAMPLES
)) \
  .
map
(
sample
) \
  .
reduce
(operator.add)
print
(f
"Pi is approximately {
4.0
 * 
count
 / 
SAMPLES
}"
)
Spark (sample) Transformations
map (
func
)
filter (
func
)
New dataset formed by selecting those who’s call of 
func
 result in True
union (
otherRDD
)
intersection (
otherRDD
)
distinct ([numTasks])
Unique elements
join (
otherRDD
, [numTasks])
RDD of (k, v) joined with RDD of (k, w) creates RDD of (k, (v, w))
Spark (sample) Actions
reduce (
func
)
collect()
Return all elements of the dataset as an array
count()
Return the number of elements in the dataset
Remember:
Actions
 force calculation. 
Transformations
 are LAZY
Spark: Remembering Information
If there’s data you care about repeatedly, you can 
cache
 it!
.cache()
This is useful if you have data preprocessing without any actions!
RDD Programming Guide
https://spark.apache.org/docs/latest/rdd-programming-guide.html
API Documentation
https://spark.apache.org/docs/latest/api/python/pyspark.html
D
e
m
o
:
 
c
o
u
n
t
-
s
p
a
r
k
.
p
y
http://cs.millersville.edu/~wkillian/2020/spring/files/csci476/map-reduce/
Available on Linux Lab:
~wkillian/Public/476/map-reduce
Slide Note
Embed
Share

Explore the concepts of Map-Reduce and Apache Spark for parallel programming. Understand how to transform and aggregate data using functions, and work with Resilient Distributed Datasets (RDDs) in Spark. Learn how to efficiently process data and perform calculations like estimating Pi using Spark's powerful capabilities.

  • Map-Reduce
  • Apache Spark
  • Parallel Programming
  • Resilient Distributed Datasets
  • Data Processing

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Map-Reduce + Spark CSCI 476: Parallel Programming Professor William Killian

  2. Map Transform one type to another through some function map_function(Value) -> Value2 Example: to_int( 1234 ) -> 1234 string int

  3. Reduce Aggregate one type together through a function reduce_function(Value, Value) -> Value Example: operator.add(123, 456) -> 579 int int int

  4. Apache Spark Open-Source Data Processing Engine Java, Scala, Python Key Concept: Resilient Distributed Dataset (RDD)

  5. RDDs Represent Data or Transformations on Data Created through: textFile() parallelize() Transformations Actions can be applied to RDDs Actions return values Lazy evaluation Nothing will be computed until an action needs the data

  6. Example: Calculating Pi Given a circle with radius of 1 Generate a random x,y point Do this MANY times Calculate ratio points within the circle Ratio is approximately Pi / 4

  7. Example: Calculating Pi def sample(p): x, y = random(), random() return 1 if x*x + y*y < 1 else 0 SAMPLES = 100000000 # change? count = sc.parallelize(range(SAMPLES)) \ .map(sample) \ .reduce(operator.add) print(f"Pi is approximately {4.0 * count / SAMPLES}")

  8. Spark (sample) Transformations map (func) filter (func) New dataset formed by selecting those who s call of func result in True union (otherRDD) intersection (otherRDD) distinct ([numTasks]) Unique elements join (otherRDD, [numTasks]) RDD of (k, v) joined with RDD of (k, w) creates RDD of (k, (v, w))

  9. Spark (sample) Actions reduce (func) collect() Return all elements of the dataset as an array count() Return the number of elements in the dataset Remember: Actions force calculation. Transformations are LAZY

  10. Spark: Remembering Information If there s data you care about repeatedly, you can cache it! .cache() This is useful if you have data preprocessing without any actions! RDD Programming Guide https://spark.apache.org/docs/latest/rdd-programming-guide.html API Documentation https://spark.apache.org/docs/latest/api/python/pyspark.html

  11. Demo Demo: count-spark.py http://cs.millersville.edu/~wkillian/2020/spring/files/csci476/map-reduce/ Available on Linux Lab: ~wkillian/Public/476/map-reduce

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#