
Concise Dataflow Operations in Spark Framework
Learn about the concise dataflow operations and transformations in Spark, addressing issues with Hadoop, embedding dataflow operations in APIs, and utilizing RDDs for efficient data processing and error recovery.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1
Recap: Last month More concise languages for map-reduce pipelines Abstractions built on top of map-reduce General comments Specific systems Cascading, Pipes PIG, Hive Spark, Flink 2
Recap: Issues with Hadoop Too much typing programs are not concise Too low level missing abstractions hard to specify a workflow Not well suited to iterative operations E.g., E/M, k-means clustering, Workflow and memory-loading issues 3
Set of concise dataflow operations ( transformation ) Spark Too much typing programs are not concise Too low level missing abstractions hard to specify a workflow Not well suited to iterative operations E.g., E/M, k-means clustering, Workflow and memory-loading issues Dataflow operations are embedded in an API together with actions Sharded files are replaced by RDDs resiliant distributed datasets RDDs can be cached in cluster memory and recreated to recover from error 4
Spark examples spark is a spark context object 5
errors is a transformation, and thus a data strucure that explains HOW to do something will actually execute the plan for errors and return a value. Spark examples count() is an action: it everything is sharded, like in Hadoop and GuineaPig errors.filter() is a transformation collect() is an action 6
Spark examples everything is sharded and the shards are stored in memory of worker machines not local disk (if possible) # modify errors to be stored in cluster memory subsequent actions will be much faster You can also persist() an RDD on disk, which is like marking it as opts(stored=True) in GuineaPig. Spark s not smart about persisting data. 7
Spark examples: wordcount transformation on (key,value) pairs , which are special the action 8
Spark examples: batch logistic regression p.x and w are vectors, from the numpy package Python overloads operations like * and + for vectors. p.x and w are vectors, from the numpy package. reduce is an action it produces a numby vector 9
Spark examples: batch logistic regression Important note: numpy vectors/matrices are not just syntactic sugar . They are much more compact than something like a list of python floats. numpy operations like dot, *, + are calls to optimized C code a little python logic around a lot of numpy calls is pretty efficient
Spark examples: batch logistic regression w is defined outside the lambda function, but used inside it So: python builds a closure code including the current value of w and Spark ships it off to each worker. So w is copied, and must be read-only. 11
Spark examples: batch logistic regression dataset of points is cached in cluster memory to reduce i/o 12
Spark 14
Spark details: broadcast So: python builds a closure code including the current value of w and Spark ships it off to each worker. So w is copied, and must be read-only. 15
Spark details: broadcast little penalty for distributing something that s not used by all workers what s sent is a small pointer to w (e.g., the name of a file containing a serialized version of w) and when value is called, some clever all- reduce like machinery is used to reduce network load. alternative: create a broadcast variable, e.g., w_broad = spark.broadcast(w) which is accessed by the worker via w_broad.value() 16
Spark details: mapPartitions Common issue: map task requires loading in some small shared value more generally, map task requires some sort of initialization before processing a shard GuineaPig: special Augment sideview pattern for shared values can kludge up any initializer using Augment Raw Hadoop: mapper.configure() and mapper.close() methods 17
Spark details: mapPartitions Spark: rdd.mapPartitions(f): will call f(iteratorOverShard) once per shard, and return an iterator over the mapped values. f() can do any setup/close steps it needs Also: there are transformations to partition an RDD with a user-selected function, like in Hadoop. Usually you partition and persist/cache. 18
Spark from logistic regression to matrix factorization William Cohen 19
Recap Recovering latent factors in a matrix r m movies m movies ~ .. H a1 a2 am x1 y1 v11 b1 b2 bm x2 y2 .. .. n users W V vij vnm xn yn V[i,j] = user i s rating of movie j 20
Recap 21
Recap strata 22
MF HW How do you define the strata? first assign rows and columns to blocks blocks then assign blocks to strata colBlock rowBlock = 1 mod K 1 5 4 3 2 2 1 5 4 3 3 2 1 5 4 4 3 2 1 5 5 4 3 2 1 2 1 2 1 2 1 2 1 2 1 strata i defined by colBlock rowBlock = i mod K colBlock = rowBlock
MF HW Their algorithm: for epoch t=1, .,T in sequence for stratum s=1, ,K in sequence for block b=1, in stratum s in parallel for triple (i,j,rating) in block b in sequence do SGD step for (i,j,rating)
MF HW Our algorithm: like the logistic regression training data cache the rating matrix into cluster memory for epoch t=1, .,T in sequence for stratum s=1, ,K in sequence distribute H, W to workers for block b=1, in stratum s, in parallel run SGD and collect the updates that are performed i.e., the deltas (i,j,deltaH) or (i,j,deltaW) aggregate the deltas and apply the updates to H, W like the outer loop for logistic regression broadcast numpy matrices sort of like in logistic regression