Integrating File System with GPUs: ASPLOS 2013 GPUfs
This research delves into integrating file systems with GPUs, addressing challenges and advancements in modern system architecture. Explore the widening software-hardware gap, GPU programming frameworks, and the complexity of building systems with GPUs.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
ASPLOS 2013 GPUfs: Integrating a file system with GPUs Mark Silberstein (UTAustin/Technion) Bryan Ford (Yale), Idit Keidar (Technion) Emmett Witchel (UTAustin) 1 Mark Silberstein - UTAustin
Traditional System Architecture Applications OS CPU 2 Mark Silberstein - UTAustin
Modern System Architecture Accelerated applications OS Manycore processors Hybrid CPU-GPU CPU GPUs FPGA 3 Mark Silberstein - UTAustin
Software-hardware gap is widening Accelerated applications OS Manycore processors Hybrid CPU-GPU CPU GPUs FPGA 4 Mark Silberstein - UTAustin
Software-hardware gap is widening Accelerated applications Ad-hoc abstractions and management mechanisms OS Manycore processors Hybrid CPU-GPU CPU GPUs FPGA 5 Mark Silberstein - UTAustin
On-accelerator OS support closes the programmability gap Accelerated applications Native accelerator applications ation On-accelerator OS support OS Coordin Manycore processors Hybrid CPU-GPU CPU GPUs FPGA 6 Mark Silberstein - UTAustin
GPUfs: File I/O support for GPUs Motivation Goals Understanding the hardware Design Implementation Evaluation 7 Mark Silberstein - UTAustin
Building systems with GPUs is hard. Why? 8 Mark Silberstein - UTAustin
Goal of GPU programming frameworks GPU CPU Parallel Algorithm Data transfers GPU invocation Memory management 9 Mark Silberstein - UTAustin
Headache for GPU programmers GPU CPU Data transfers Invocation Memory management Parallel Algorithm Half of the CUDA SDK 4.1 samples: at least 9 CPU LOC per 1 GPU LOC 10 Mark Silberstein - UTAustin
GPU kernels are isolated GPU CPU Data transfers Invocation Memory management Parallel Algorithm 11 Mark Silberstein - UTAustin
Example: accelerating photo collage http://www.codeproject.com/Articles/36347/Face-Collage While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 12 Mark Silberstein - UTAustin
CPU Implementation Application CPU CPU CPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 13 Mark Silberstein - UTAustin
Offloading computations to GPU Application CPU CPU CPU Move to GPU While(Unhappy()){ Read_next_image_file() Decide_placement() Remove_outliers() } 14 Mark Silberstein - UTAustin
Offloading computations to GPU Co-processor programming model CPU Data transfer GPU Kernel termination Kernel start 15 Mark Silberstein - UTAustin
Kernel start/stop overheads Invocation latency CPU Cache flush GPU Synchronization 16 Mark Silberstein - UTAustin
Hiding the overheads Asynchronous invocation Manual data reuse management Double buffering CPU GPU 17 Mark Silberstein - UTAustin
Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU GPU 18 Mark Silberstein - UTAustin
Implementation complexity Management overhead Asynchronous invocation Manual data reuse management Double buffering CPU GPU Why do we need to deal with low-level system details? 19 Mark Silberstein - UTAustin
The reason is.... GPUs are peer-processors They need I/O OS services 20 Mark Silberstein - UTAustin
GPUfs: application view GPU2 GPU3 CPUs GPU1 GPUfs Host File System 21 Mark Silberstein - UTAustin
GPUfs: application view GPU2 GPU3 CPUs GPU1 System-wide shared namespace Host File System (POSIX API) Persistent storage 22 Mark Silberstein - UTAustin
Accelerating collage app with GPUfs No CPU management code CPU CPU CPU GPUfs GPU open/read from GPU 23 Mark Silberstein - UTAustin
Accelerating collage app with GPUfs CPU CPU CPU Read-ahead GPUfs + File System Cache GPU Overlap computation and transfers 24 Mark Silberstein - UTAustin
Accelerating collage app with GPUfs CPU CPU CPU GPUfs GPU Data reuse Random data access 25 Mark Silberstein - UTAustin
Challenge GPU CPU 26 Mark Silberstein - UTAustin
Massive parallelism Parallelism is essential for performance in deeply multi-threaded wide-vector hardware AMD HD5870* 31,000 active threads NVIDIAFermi* 23,000 active threads From M. Houston/A. Lefohn/K. Fatahalian A trip through the architecture of modern GPUs* Mark Silberstein - UTAustin 27
Heterogeneous memory GPUs inherently impose high bandwidth demands on memory GPU CPU 10-32GB/s 288-360GB/s ~x20 Memory Memory 6-16 GB/s 28 Mark Silberstein - UTAustin
How to build an FS layer on this hardware? 29 Mark Silberstein - UTAustin
GPUfs: principled redesign of the whole file system stack Relaxed FS API semantics for parallelism Relaxed FS consistency for heterogeneous memory GPU-specific implementation of synchronization primitives, lock-free data structures, memory allocation, . 30 Mark Silberstein - UTAustin
GPUfs high-level design CPU GPU GPU application using GPUfs FileAPI Unchanged applications using OS FileAPI Massive parallelism GPUfs hooks OS File System Interface GPUfs GPU File I/O library OS GPUfs Distributed Buffer Cache Heterogeneous memory (Page cache) CPU Memory GPU Memory Host File System Disk 31 Mark Silberstein - UTAustin
GPUfs high-level design CPU GPU GPU application using GPUfs FileAPI Unchanged applications using OS FileAPI GPUfs hooks OS File System Interface GPUfs GPU File I/O library OS GPUfs Distributed Buffer Cache (Page cache) CPU Memory GPU Memory Host File System Disk 32 Mark Silberstein - UTAustin
Buffer cache semantics Local or Distributed file system data consistency? 33 Mark Silberstein - UTAustin
GPUfs buffer cache Weak data consistency model close(sync)-to-open semantics (AFS) open() read(1) GPU1 Not visible to CPU GPU2 write(1) fsync() write(2) Remote-to-Local memory performance ratio is similar to a distributed system >> 34 Mark Silberstein - UTAustin
On-GPU File I/O API open/close gopen/gclose read/write gread/gwrite In the paper mmap/munmap gmmap/gmunmap fsync/msync gfsync/gmsync ftrunc gftrunc Changes in the semantics are crucial 35 Mark Silberstein - UTAustin
Implementation bits Paging support Dynamic data structures and memory allocators Lock-free radix tree Inter-processor communications (IPC) Hybrid H/W-S/W barriers Consistency module in the OS kernel In the paper ~1,5K GPU LOC, ~600 CPU LOC 36 Mark Silberstein - UTAustin
Evaluation All benchmarks are written as a GPU kernel: no CPU-side development 37 Mark Silberstein - UTAustin
Matrix-vector product (Inputs/Outputs in files) Vector 1x128K elements, Page size = 2MB, GPU=TESLA C2075 CUDA piplined CUDA optimized GPU file I/O 3500 3000 2500 Throughput (MB/s) 2000 1500 1000 500 0 280 560 2800 5600 11200 Input matrix size (MB) 38 Mark Silberstein - UTAustin
Word frequency count in text Count frequency of modern English words in the works of Shakespeare, and in the Linux kernel source tree English dictionary: 58,000 words Challenges Dynamic working set Small files Lots of file I/O (33,000 files,1-5KB each) Unpredictable output size 39 Mark Silberstein - UTAustin
Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) Shakespeare 1 file, 6MB 292s 40s (7.3X) 40s (7.3X) 40 Mark Silberstein - UTAustin
Results 8CPUs GPU-vanilla GPU-GPUfs Linux source 33,000 files, 524MB 6h 50m (7.2X) 53m (6.8X) 8% overhead Shakespeare 1 file, 6MB Unbounded input/output size support 292s 40s (7.3X) 40s (7.3X) 41 Mark Silberstein - UTAustin
GPUfs is the first system to provide native access to host OS services from GPU programs GPUfs CPU CPU GPU GPU Code is available for download at: https://sites.google.com/site/silbersteinmark/Home/gpufs http://goo.gl/ofJ6J 42 Mark Silberstein - UTAustin