
Innovative Big Data Analytics Appliance and Solutions
Dive into the world of big data analytics with BlueDBM, an appliance designed for analyzing unprecedented amounts of data to uncover deep insights. Explore how cutting-edge technologies like flash-based solutions are revolutionizing the field, and discover the impact of distributed flash-based analytics on latency profiles. Learn about related works and architectural modifications that enhance efficiency in data processing.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
BlueDBM: An Appliance for Big Data Analytics Sang-Woo Jun*Ming Liu*Sungjin Lee*Jamey Hicks+ John Ankcorn+Myron King+Shuotao Xu*Arvind* *MIT Computer Science and Artificial Intelligence Laboratory +Quanta Research Cambridge ISCA 2015, Portland, OR. June 15, 2015 This work is funded by Quanta, Samsung and Lincoln Laboratory. We also thank Xilinx for their hardware and expertise donations. 1
Big data analytics Analysis of previously unimaginable amount of data can provide deep insight Google has predicted flu outbreaks a week earlier than the Center for Disease Control (CDC) Analyzing personal genome can determine predisposition to diseases Social network chatter analysis can identify political revolutions before newspapers Scientific datasets can be mined to extract accurate models Likely to be the biggest economic driver for the IT industry for the next decade 2
A currently popular solution: RAM Cloud Cluster of machines with large DRAM capacity and fast interconnect + Fastest as long as data fits in DRAM - Power hungry and expensive -Performance drops when data doesn t fit in DRAM What if enough DRAM isn t affordable? Flash-based solutions may be a better alternative + Faster than Disk, cheaper than DRAM + Lower power consumption than both - Legacy storage access interface is burdening - Slower than DRAM 3
Related work Use of flash SSDs, FusionIO, Purestorage Zetascale SSD for database buffer pool and metadata [SIGMOD 2008], [IJCA 2013] Networks QuickSAN [ISCA 2013] Hadoop/Spark on Infiniband RDMA [SC 2012] Accelerators SmartSSD[SIGMOD 2013], Ibex[VLDB 2014] Catapult[ISCA 2014] GPUs 4
Latency profile of distributed flash-based analytics Distributed processing involves many system components Flash device access Storage software (OS, FTL, ) Network interface (10gE, Infiniband, ) Actual processing Flash Access 75 s Storage Software 100 s Network 20 s Processing 50~100 s 100~1000 s 20~1000 s Latency is additive 5
Latency profile of distributed flash-based analytics Architectural modifications can remove unnecessary overhead Near-storage processing Cross-layer optimization of flash management software* Dedicated storage area network Accelerator Flash Access 75 s Storage Software 100 s Network 20 s Processing 50~100 s 100~1000 s < 20 s 20~1000 s Difficult to explore using flash packaged as off-the-shelf SSDs 6
Custom flash card had to be built Bus 0 Flash Flash HPC FMC PORT To Bus 1 Artix 7 FPGA Flash Flash VC707 Bus 2 Flash Flash Bus 3 Flash Flash Flash Array (on both side) Network Ports 7
BlueDBM: Platform with near-storage processing and inter-controller networks 20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe 8
BlueDBM: Platform with near-storage processing and inter-controller networks 1 of 2 Racks (10 Nodes) BlueDBM Storage Device 20 24-core Xeon Servers 20 BlueDBM Storage devices 1TB flash storage x4 20Gbps controller network Xilinx VC707 2GB/s PCIe 9
BlueDBM node architecture Lightweight flash management with very low overhead Adds almost no latency ECC support low latency/high bandwidth x4 20Gbps links at 0.5us latency Virtual channels with flow control access to flash storage High level information can be used for low level management FTL implemented inside file system Flash Device Custom network protocol with Software has very low level Flash Controller Controller Flash Interface Interface Network Network No time to go into gritty details! In-Storage Processor PCIe Host Server Host Server 10
BlueDBM software view space User- Hardware-assisted Applications Connectal Proxy File System Generated by Connectal* Kernel- space Block Device Driver Connectal (By Quanta) Connectal Wrapper HW Network Interface Flash Ctrl FPGA Accelerator Manager Accelerator NAND Flash BlueDBM provides a generic file system interface as well as an accelerator-specific interface (Aided by Connectal) 11
Power consumption is low Component VC707 Flash Board (x2) Storage Device Total Power (Watts) 30 10 40 Storage device power consumption is a very conservative estimate Component Storage Device Xeon Server Node Total Power (Watts) 40 200+ 240+ GPU-based accelerator will double the power 12
Applications Content-based image search * Faster flash with accelerators as replacement for DRAM-based systems BlueCache An accelerated memcached* Dedicated network and accelerated caching systems with larger capacity Graph analytics Benefits of lower latency access into distributed flash for computation on large graphs * Results obtained since the paper submission 13
Content-based image retrieval Takes a query image and returns similar images in a dataset of tens of million pictures Image similarity is determined by measuring the distance between histograms of each image Histogram is generated using RGB, HSV, edgeness , etc Better algorithms are available! 14
Image search accelerator Sang woo Jun, Chanwoo Chung FPGA Sobel Filter Flash Histogram Generator Query Histogram Flash Controller Comparator Software 15
Image query performance without sampling BlueDBM + FPGA CPU Bottleneck BlueDBM + CPU Off-the shelf M.2. SSD Faster flash with acceleration can perform at DRAM speed 16
Sampling to improve performance Intelligent sampling methods (e.g., Locality Sensitive Hashing) improves performance by dramatically reducing the search space But introduces random access pattern Locality-sensitive hash table Data Data accesses corresponding to a single hash table entry results in a lot of random accesses 17
Image query performance with sampling A disk based system cannot take advantage of the reduced search space 18
memcached service A distributed in-memory key-value store caches DB results indexed by query strings Accessed via socket communication Uses system DRAM for caching (~256GB) Extensively used by database-driven websites Facebook, Flicker, Twitter, Wikipedia, Youtube Memcached Servers Brower/ Mobile Apps Application Servers Memcached Response Return data Memcached request Web request Networking contributes to 90% overhead 19
Bluecache: Accelerated memcached service Shuotao Xu Inter-controller network 1TB Flash 1TB Flash 1TB Flash Flash Controller Flash Controller Flash Controller Network Network Network Bluecache accelerator Bluecache accelerator Bluecache accelerator PCIe PCIe PCIe web server web server web server Memcached server implemented in hardware Hashing and flash management implemented in FPGA 1TB hardware managed flash cache per node Hardware server accessed via local PCIe Direct network between hardware 20
Effect of architecture modification (no flash, only DRAM) 4500 4012 4000 3500 11X Performance (KOPS PER SECOND) 3000 THROUGHPUT 2500 2000 1500 1000 357 500 273 0 Bluecache Local Memcached Remote Memcached Get Operations ( Key Size = 64Bytes, Value Size = 64Bytes) PCIe DMA and inter-controller network reduces access overhead FPGA acceleration of memcached is effective 21
High cache-hit rate outweighs slow flash- accesses (small DRAM vs. large Flash) 350 Key size = 64 Bytes, Value size = 8K Bytes 5ms penalty per cache miss * Assuming no cache misses for Bluecache Throughput (KOps per seconds) 300 250 200 Bluecache (0.5TB Flash) 150 100 Local memcached (50GB DRAM) 50 0 0 5 10 15 20 25 30 35 40 45 50 Bluecache starts performing better at 5% miss A sweet spot for large flash caches exist 22
Graph traversal Very latency-bound problem, because often cannot predict the next node to visit Beneficial to reduce latency by moving computation closer to data Flash 1 Flash 3 Flash 2 In-Store Processor Host 1 Host 2 Host 3 23
Graph traversal performance 18000 * Used fast BlueDBM network even for separate network for fairness DRAM Flash 16000 Nodes traversed per second 14000 12000 10000 8000 6000 4000 2000 0 Software+DRAM Software + Separate Network Software + Controller Network Accelerator + Controller Network Flash based system can achieve comparable performance with a much smaller cluster 24
Other potential applications Genomics Deep machine learning Complex graph analytics Platform acceleration Spark, MATLAB, SciDB, Suggestions and collaboration are welcome! 25
Conclusion Fast flash-based distributed storage systems with low-latency random access may be a good platform to support complex queries on Big Data Reducing access latency for distributed storage require architectural modifications, including in-storage processors and fast storage networks Flash-based analytics hold a lot of promise, and we plan to continue demonstrating more application acceleration Thank you 26
Near-Data Accelerator is Preferable Traditional Approach Hardware & software latencies are additive Accelerator Accelerator DRAM DRAM NIC NIC Flash Flash FPGA FPGA CPU CPU Motherboard Motherboard BlueDBM Flash Flash NIC NIC DRAM DRAM CPU CPU FPGA FPGA Motherboard Motherboard 28
VC707 PCIe DRAM Artix 7 Network Cable Virtex 7 Network Ports Flash 29