
Big Data Storage Techniques & Systems Overview
Explore the key concepts of storing and processing big data, including techniques like striping and partitioning, systems like Google File System (GFS), and frameworks like Hadoop and MapReduce. Discover how modern distributed analytics frameworks simplify big data processing for programmers.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Distributed Systems CS 15-440 Hadoop Lecture 18, October 30, 2019 Mohammad Hammoud
Today Last Session: MPI Today s Session: Hadoop Distributed File System and MapReduce Announcements: P3 is out. It is due on November 20 by midnight Quiz II is on Wednesday, November 13
What Do We Do With Big Data? Store Share Access Process . and more! Encrypt We want to do all these seamlessly...
Where to Store Big Data? The underlying storage system is a key component for enabling Big Data querying/mining/analytics Typically, the storage system would partition and distribute Big Data, using striping (or partitioning) and placement techniques This allows for concurrent accesses to data as well as improves fault-tolerance Striping Unit Stripe Size Logical File 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 Server 1 Server 2 Server 3 Server 4
Example: The Google File System GFS paritions large files into fixed-size blocks and distributes them randomly across cluster machines Blk Blk Blk 0 Blk 1 Blk 4 Blk 6 Blk 5 Large File 2 3 Server 2 Server 3 Server 1 Server 0 (Writer) Blk 0 Blk 0 Blk 1 Blk 0 0M Blk 1 Blk 2 Blk 2 Blk 1 64M Blk 2 Blk 3 Blk 4 Blk 4 Blk 5 Blk 6 128M Blk 3 Blk 3 Blk 5 192M Blk 4 Blk 6 256M Blk 5 320M Blk 6 384M
Example: The Google File System GFS adopts a master-slave architecture File name GFS client Master Contact address Chunk Id, range Chunk Server Chunk Server Chunk Server Chunk data Linux File System Linux File System Linux File System
How to Process Big Data? One alternative: Create a custom distributed system (or program) for each new algorithm Cumbersome! Another alternative: utilize modern distributed analytics frameworks, which: Relieve programmers from concerns with many of the difficult aspects of developing distributed programs Allow programmers to focus on ONLY the sequential parts of their programs Examples: Hadoop MapReduce Google s Pregel CMU s Distributed GraphLab
Distributed Analytics Frameworks Hadoop MapReduce Architectural & Scheduling Models Execution Model Programming Model Introduction
Hadoop Hadoop is one of the most successful realizations of large-scale data-parallel distributed analytics frameworks Hadoop MapReduce is an open source implementation of Google s MapReduce Hadoop uses Hadoop Distributed File System (HDFS) as a storage layer HDFS is an open source implementation of GFS
Hadoop MapReduce: A Birds Eye View Hadoop MapReduce incorporates two phases, Map and Reduce phases, which encompass multiple Map and Reduce tasks Map Task Partition Partition HDFS BLK Split 0 Partition Reduce Task Partition Partition Partition Map Task HDFS BLK Split 1 Partition Partition Datase t Reduce Task To HDFS Partition Partition Map Task HDFS BLK Split 2 Partition HDFS Partition Partition Partition Reduce Task Partition Map Task HDFS BLK Split 3 Partition Partition Merge & Sort Stage Reduce Phase Shuffle Stage Reduce Stage Map Phase
Distributed Analytics Frameworks Hadoop MapReduce Architectural & Scheduling Models Execution Model Programming Model Introduction
The Programming Model Hadoop MapReduce employs a shared-based programming model, which entails that: Tasks can interact (if needed) via reading and writing to a shared space HDFS provides the shared space for all Map and Reduce tasks Programmers write only sequential code, without defining functions that send/receive messages between tasks A Shared Address Space (Provided by HDFS) MT2 MT4 MT5 MT6 MT1 MT3 Implicit communication (provided by the MapReduce Engine)- Programmers do not write or call any communication routines RT1 RT2 RT3 A Shared Address Space (Provided by HDFS)
Example: Word Count A Map Function Key2 Value2 Mohammad 1 A Reduce Function A Chunk of File Key2 Value2 is 1 Key1 Value1 Mohammad is delivering a lecture to the Mohammad 1 delivering 1 0 Mohammad is is 2 a 1 Parse & Count delivering 1 lecture 1 20 delivering a A Text File 15-440 class to 1 a 1 38 lecture to the the 1 Mohammad is delivering a lecture to the 15-440 class The course name of 15-440 is Distributed Systems name of 15-440 is Distributed Systems lecture 1 60 15-440 class 15-440 1 to 1 class 1 Iterate & Sum the 2 A Map Function 15-440 2 Key2 Value2 A Chunk of File class 1 The 1 Key1 Value1 course 1 The course course 1 0 The course name 1 name 1 Parse & Count of 1 of 1 17 name of 15- 440 15-440 1 Distributed 1 is 1 40 is Distributed Systems 1 Distributed 1 58 Systems Systems 1
Distributed Analytics Frameworks Hadoop MapReduce Architectural & Scheduling Models Execution Model Programming Model Introduction
The Execution Model Hadoop MapReduce adopts a synchronous execution model A distributed program (or system) is said to be synchronous if and only if its constituent tasks operate in a lock-step mode No two tasks can run concurrently under two different iterations In MapReduce: Each iteration is treated as a MapReduce job A job can encompass 1 or many Map tasks and 0 or many Reduce tasks Programs with multiple iterations (i.e., iterative programs) are executed using multiple chained MapReduce jobs When all Reduce tasks within job i are committed, a new job i + 1 is started (if any) Hence, two different tasks cannot run in parallel under two different jobs (or iterations)
Distributed Analytics Frameworks Hadoop MapReduce Architectural & Scheduling Models Execution Model Programming Model Introduction
The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture Core Switch The master A slave Rack Switch 1 Rack Switch 2 TaskTracker5 MT2 TaskTracker2 JobTracker MT1MT2MT3 TaskTracker3 TaskTracker4 TaskTracker1 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1 A pull-based task scheduling strategy is used, whereby: Map tasks are scheduled in proximity of HDFS blocks Reduce tasks are scheduled anywhere
The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture Core Switch The master A slave Rack Switch 1 Rack Switch 2 TaskTracker5 MT2 TaskTracker2 JobTracker MT2MT3 TaskTracker3 TaskTracker4 TaskTracker1 MT1 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1 With the above setup, how many Map tasks can run in parallel? Each TaskTracker has by default two Map slots, thus can run two Map tasks concurrently With 4 TaskTrackers and 2 Map slots on each TaskTracker, 8 Map tasks can be executed in parallel The maximum number of Map tasks that can run in parallel is denoted as Map wave
The Architectural and Scheduling Models Hadoop MapReduce employs a master-slave architecture Core Switch The master A slave Rack Switch 1 Rack Switch 2 TaskTracker5 MT2 TaskTracker2 JobTracker MT2MT3 TaskTracker3 TaskTracker4 TaskTracker1 MT1 MT3 Request a Map Task Schedule a Map Task at an Empty Map Slot on TaskTracker1 For a dataset with a size of 1024MB, how many Map waves are needed? The size of each HDFS block is by default 64MB and each split encompasses by default 1 HDFS block Hence, there will be a total of 1024/64 = 16 HDFS blocks or 16 splits The input to each Map task is a single split, thus there will be a total of 16 Map tasks Therefore, 16 tasks/8 slots = 2 Map waves will be needed
Hadoop MapReduce: Summary Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications
Hadoop MapReduce: Summary Aspect Hadoop MapReduce Programming Model Shared-Based Execution Model Synchronous Architectural Model Master-Slave Scheduling Model Pull-Based Suitable Applications Loosely-Connected/Embarrassingly-Parallel Applications
Next Class Pregel and GraphLab