Integrating Apache Big Data Stack with HPC Capabilities

HPC-ABDS: The Case for an
Integrating Apache Big Data Stack
with HPC
1st JTC 1 SGBD Meeting
SDSC San Diego March 19 2014
Judy Qiu
Shantenu Jha  (Rutgers)
Geoffrey Fox
gcf@indiana.edu
 
http://www.infomall.org
School of Informatics and Computing
Digital Science Center
Indiana University Bloomington
Enhanced
Apache Big
Data Stack
ABDS
~120 Capabilities
>40 Apache
Green layers have
strong HPC Integration
opportunities
Goal
Functionality of ABDS
Performance of HPC
Broad Layers in HPC-ABDS
Workflow-Orchestration
Application and Analytics
High level Programming
Basic Programming model and runtime
SPMD, Streaming, MapReduce, MPI
Inter process communication
Collectives, point to point, publish-subscribe
In memory databases/caches
Object-relational mapping
SQL and NoSQL, File management
Data Transport
Cluster Resource Management (Yarn, Slurm, SGE)
File systems(HDFS, Lustre …)
DevOps (Puppet, Chef …)
IaaS Management from HPC to hypervisors (OpenStack)
Cross Cutting
Message Protocols
Distributed Coordination
Security & Privacy
Monitoring
Getting High Performance on Data
Analytics (e.g. Mahout, R …)
On the systems side, we have two principles
The Apache Big Data Stack with ~120 projects has important broad
functionality with a vital large support organization
HPC including MPI has striking success in delivering high performance
with however a fragile sustainability model
There are 
key systems abstractions 
which are levels in HPC-ABDS software
stack where Apache approach needs careful integration with HPC
Resource management
Storage
Programming model -- horizontal scaling parallelism
Collective and Point to Point communication
Support of iteration
Data interface (not just key-value)
In application areas, we define 
application abstractions 
to support
Graphs/network
Geospatial
Images etc.
4 Forms of MapReduce
7
MPI is Map followed by Point to Point or Collective Communication
– as in style c) plus d)
HPC-ABDS
Hourglass
HPC Yarn for Resource management
Horizontally scalable parallel programming model
Collective and Point to Point communication
Support of iteration
System Abstractions/standards
Data format
Storage
120 Software Projects
Application Abstractions/standards
Graphs, Networks, Images, Geospatial ….
SPIDAL (Scalable Parallel
Interoperable Data Analytics Library)
or High performance Mahout, R,
Matlab …..
Integrating Yarn with HPC
We are sort of working on Use Cases with HPC-ABDS
Use Case 10 Internet of Things: Yarn, Storm, ActiveMQ
Use Case 19, 20 Genomics. Hadoop, Iterative MapReduce, MPI,
Much better analytics than Mahout
Use Case 26 Deep Learning. High performance distributed GPU
(optimized collectives) with Python front end (planned)
Variant of Use Case 26, 27 Image classification using Kmeans:
Iterative MapReduce
Use Case 28 Twitter with optimized index for Hbase, Hadoop and
Iterative MapReduce
Use Case 30 Network Science. MPI and Giraph for network
structure and dynamics (planned)
Use Case 39 Particle Physics. Iterative MapReduce (wrote
proposal)
Use Case 43 Radar Image Analysis. Hadoop for multiple individual
images moving to Iterative MapReduce for global integration over
“all” images
Use Case 44 Radar Images. Running on Amazon
Features of Harp Hadoop Plug in
Hadoop Plugin (on Hadoop 1.2.1 and Hadoop
2.2.0)
Hierarchical data abstraction on arrays, key-values
and graphs for easy programming expressiveness.
Collective communication model to support
various communication operations on the data
abstractions.
Caching with buffer management for memory
allocation required from computation and
communication
BSP style parallelism
Fault tolerance with check-pointing
Architecture
Performance on Madrid Cluster (8
nodes)
Note compute same in each case as product of centers times points identical
Increasing
Communication
Identical Computation
Mahout and Hadoop MR 
– Slow due to MapReduce
Python
 slow as Scripting
Spark
 Iterative MapReduce, non optimal communication
Harp
 Hadoop plug in with ~MPI collectives
MPI
 fastest as C not Java
Increasing
Communication
Identical Computation
Performance of MPI Kernel Operations
Pure Java as
in FastMPJ
slower than
Java
interfacing
to C version
of MPI
Use case 28: Truthy: Information diffusion research from Twitter Data
Building blocks:
Yarn
Parallel query evaluation using Hadoop MapReduce
Related hashtag mining algorithm using Hadoop MapReduce:
Meme daily frequency generation using MapReduce over index tables
Parallel force-directed graph layout algorithm using Twister (Harp) iterative MapReduce
Use case 28: Truthy: Information diffusion research from
Twitter Data
Two months’ data loading
for varied cluster size
Scalability of iterative graph
layout algorithm on Twister
Hadoop-FS
not indexed
Pig Performance
Hadoop
Harp-Hadoop
Pig +HD1 (Hadoop)
Pig + Yarn
Lines of Code
DACIDR for Gene Analysis (Use Case 19,20)
Deterministic Annealing Clustering and Interpolative
Dimension Reduction Method (DACIDR)
Use Hadoop for pleasingly parallel applications, and
Twister (replacing by Yarn) for iterative MapReduce
applications
Sequences – Cluster 
 Centers
Add Existing data and find Phylogenetic Tree
All-Pair
Sequence
Alignment
Streaming
Pairwise Clustering
Multidimensional
Scaling
Visualization
Simplified Flow Chart of DACIDR
Summarize a million Fungi Sequences
Spherical Phylogram Visualization
RAxML result visualized in FigTree.
Spherical Phylogram from new MDS
method visualized in PlotViz
Lessons / Insights
Integrate 
(don’t compete) 
HPC with “Commodity Big
data” 
(Google to Amazon to Enterprise data Analytics)
i.e. 
improve Mahout
; don’t compete with it
Use 
Hadoop plug-ins 
rather than replacing Hadoop
Enhanced Apache Big Data Stack 
HPC-ABDS has 120
members 
– please improve!
HPC-ABDS+ Integration areas 
include
file systems,
cluster resource management,
file and object data management,
inter process and thread communication,
analytics libraries,
Workflow
monitoring
Slide Note
Embed
Share

This presentation discusses the integration of Apache Big Data Stack with High-Performance Computing (HPC), emphasizing the broad functionality and key abstractions needed for high performance in data analytics. It covers various layers, workflow orchestration, programming models, system principles, and application abstractions essential for efficient data processing in HPC environments.

  • Apache Big Data
  • HPC Integration
  • Data Analytics
  • Workflow Orchestration
  • High Performance

Uploaded on Feb 17, 2025 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC 1st JTC 1 SGBD Meeting SDSC San Diego March 19 2014 Judy Qiu Shantenu Jha (Rutgers) Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

  2. Enhanced Apache Big Data Stack ABDS ~120 Capabilities >40 Apache Green layers have strong HPC Integration opportunities Goal Functionality of ABDS Performance of HPC

  3. Broad Layers in HPC-ABDS Workflow-Orchestration Application and Analytics High level Programming Basic Programming model and runtime SPMD, Streaming, MapReduce, MPI Inter process communication Collectives, point to point, publish-subscribe In memory databases/caches Object-relational mapping SQL and NoSQL, File management Data Transport Cluster Resource Management (Yarn, Slurm, SGE) File systems(HDFS, Lustre ) DevOps (Puppet, Chef ) IaaS Management from HPC to hypervisors (OpenStack) Cross Cutting Message Protocols Distributed Coordination Security & Privacy Monitoring

  4. Getting High Performance on Data Analytics (e.g. Mahout, R ) On the systems side, we have two principles The Apache Big Data Stack with ~120 projects has important broad functionality with a vital large support organization HPC including MPI has striking success in delivering high performance with however a fragile sustainability model There are key systems abstractions which are levels in HPC-ABDS software stack where Apache approach needs careful integration with HPC Resource management Storage Programming model -- horizontal scaling parallelism Collective and Point to Point communication Support of iteration Data interface (not just key-value) In application areas, we define application abstractions to support Graphs/network Geospatial Images etc.

  5. 4 Forms of MapReduce (d) Loosely Synchronous (b) Classic MapReduce (c) Iterative MapReduce (a) Map Only Input Iterations Input Input map map map Pij reduce reduce Output BLAST Analysis High Energy Physics Classic MPI Expectation maximization Parametric sweep (HEP) Histograms PDE Solvers and Clustering e.g. Kmeans Pleasingly Parallel Distributed search particle dynamics Linear Algebra, Page Rank Domain of MapReduce and Iterative Extensions MPI Giraph Science Clouds MPI is Map followed by Point to Point or Collective Communication as in style c) plus d) 7

  6. HPC-ABDS Hourglass HPC ABDS System (Middleware) 120 Software Projects System Abstractions/standards Data format Storage HPC Yarn for Resource management Horizontally scalable parallel programming model Collective and Point to Point communication Support of iteration Application Abstractions/standards Graphs, Networks, Images, Geospatial . High performance Applications SPIDAL (Scalable Parallel Interoperable Data Analytics Library) or High performance Mahout, R, Matlab ..

  7. Integrating Yarn with HPC

  8. We are sort of working on Use Cases with HPC-ABDS Use Case 10 Internet of Things: Yarn, Storm, ActiveMQ Use Case 19, 20 Genomics. Hadoop, Iterative MapReduce, MPI, Much better analytics than Mahout Use Case 26 Deep Learning. High performance distributed GPU (optimized collectives) with Python front end (planned) Variant of Use Case 26, 27 Image classification using Kmeans: Iterative MapReduce Use Case 28 Twitter with optimized index for Hbase, Hadoop and Iterative MapReduce Use Case 30 Network Science. MPI and Giraph for network structure and dynamics (planned) Use Case 39 Particle Physics. Iterative MapReduce (wrote proposal) Use Case 43 Radar Image Analysis. Hadoop for multiple individual images moving to Iterative MapReduce for global integration over all images Use Case 44 Radar Images. Running on Amazon

  9. Features of Harp Hadoop Plug in Hadoop Plugin (on Hadoop 1.2.1 and Hadoop 2.2.0) Hierarchical data abstraction on arrays, key-values and graphs for easy programming expressiveness. Collective communication model to support various communication operations on the data abstractions. Caching with buffer management for memory allocation required from computation and communication BSP style parallelism Fault tolerance with check-pointing

  10. Architecture Application MapReduce Applications Map-Collective Applications Harp Framework MapReduce V2 Resource Manager YARN

  11. Performance on Madrid Cluster (8 nodes) K-Means Clustering Harp v.s. Hadoop on Madrid Increasing 1600 Identical Computation Communication 1400 1200 Execution Time (s) 1000 800 600 400 200 0 100m 500 10m 5k 1m 50k Problem Size Hadoop 24 cores Harp 24 cores Hadoop 48 cores Harp 48 cores Hadoop 96 cores Harp 96 cores Note compute same in each case as product of centers times points identical

  12. Mahout and Hadoop MR Slow due to MapReduce Python slow as Scripting Spark Iterative MapReduce, non optimal communication Harp Hadoop plug in with ~MPI collectives MPI fastest as C not Java Increasing Identical Computation Communication

  13. Performance of MPI Kernel Operations 10000 MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG MPI.NET C# in Tempest FastMPJ Java in FG OMPI-nightly Java FG OMPI-trunk Java FG OMPI-trunk C FG 5000 100 Pure Java as in FastMPJ slower than Java interfacing to C version of MPI Average time (us) Average time (us) 1 5 256B 0B 2B 8B 32B 128B 512B 2KB 8KB 32KB 128KB 512KB 4B 16B 64B 1KB Message size (bytes) 4KB 16KB 64KB 256KB 1MB 4MB Message size (bytes) Performance of MPI send and receive operations Performance of MPI allreduce operation 10000 1000000 OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG OMPI-trunk C Madrid OMPI-trunk Java Madrid OMPI-trunk C FG OMPI-trunk Java FG 1000 10000 100 Average Time (us) Average Time (us) 100 10 1 1 0B 2B 8B 32B 128B Message Size (bytes) 512B 2KB 8KB 32KB 128KB 512KB 256B 1MB 4MB 256KB 4B 16B 64B 1KB 4KB 16KB 64KB Message Size (bytes) Performance of MPI send and receive on Infiniband and Ethernet Performance of MPI allreduce on Infiniband and Ethernet

  14. Use case 28: Truthy: Information diffusion research from Twitter Data Building blocks: Yarn Parallel query evaluation using Hadoop MapReduce Related hashtag mining algorithm using Hadoop MapReduce: Meme daily frequency generation using MapReduce over index tables Parallel force-directed graph layout algorithm using Twister (Harp) iterative MapReduce

  15. Use case 28: Truthy: Information diffusion research from Twitter Data Two months data loading for varied cluster size Scalability of iterative graph layout algorithm on Twister Hadoop-FS not indexed

  16. Pig Performance Hadoop Harp-Hadoop Pig +HD1 (Hadoop) Pig + Yarn Different Kmeans Implementation Total execution time vs. mapper number 2000 1800 1600 Total execution time (s) 1400 1200 1000 800 600 400 200 0 24 48 96 number of mappers Hadoop 10m,5000 Harp 10m,5000 Pig HD1 10m,5000 Pig Yarn 10m,5000 Hadoop 100m,500 Harp 100m,500 Pig HD1 100m,500 Pig Yarn 100m,500 Hadoop 1m,50000 Harp 1m,50000 Pig HD1 1m,50000 Pig Yarn 1m,50000

  17. Lines of Code Pig IndexedHBase meme-cooccur- count 152 10 0 162 IndexedHBase meme-cooccur- count ~434 0 28 462 Pig Kmeans Hadoop Kmeans Java Pig ~345 10 ~40 395 780 0 0 780 Python / Bash Total Lines

  18. DACIDR for Gene Analysis (Use Case 19,20) Deterministic Annealing Clustering and Interpolative Dimension Reduction Method (DACIDR) Use Hadoop for pleasingly parallel applications, and Twister (replacing by Yarn) for iterative MapReduce applications Sequences Cluster Centers Add Existing data and find Phylogenetic Tree Pairwise Clustering All-Pair Sequence Alignment Visualization Streaming Multidimensional Scaling Simplified Flow Chart of DACIDR

  19. Summarize a million Fungi Sequences Spherical Phylogram Visualization Spherical Phylogram from new MDS method visualized in PlotViz RAxML result visualized in FigTree.

  20. Lessons / Insights Integrate (don t compete) HPC with Commodity Big data (Google to Amazon to Enterprise data Analytics) i.e. improve Mahout; don t compete with it Use Hadoop plug-ins rather than replacing Hadoop Enhanced Apache Big Data Stack HPC-ABDS has 120 members please improve! HPC-ABDS+ Integration areas include file systems, cluster resource management, file and object data management, inter process and thread communication, analytics libraries, Workflow monitoring

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#