Hardware-Assisted Task Scheduler for OS Intensive Applications
A hardware-assisted task scheduler called SchedTask is proposed to address the issue of instruction cache pollution in OS intensive applications. By utilizing SuperFunction characterization and a specialized scheduler, the system aims to optimize task execution on different cores for improved performance. The model of execution distinguishes between user applications and the privileged operating system. Insights into system execution decomposition into SuperFunctions reveal benefits for mail servers and Apache webservers in leveraging locality effects and maintaining instruction and data footprint similarity.
- Task Scheduler
- Hardware-Assisted
- OS Intensive Applications
- SuperFunction Characterization
- Execution Model
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
SchedTask: A Hardware-Assisted Task Scheduler Prathmesh Kallurkar* Smruti R. Sarangi Microarchitecture Research Lab Department of Computer Science & Engg. Intel India Indian Institute of Technology Delhi *Contributed to this work while I was a student at IIT Delhi 1
Outline Introduction Problem of instruction cache pollution for OS intensive applications Overview of the proposed technique SuperFunction Characterization Scheduler Results 2
Model of Execution Model of Execution Non-privileged: User Applications Privileged: Operating System 3
OS Intensive (Web server) OS Intensive (Web server) Non-privileged: User Applications Observations: Different tasks execute different code. Combined size footprints of all tasks is larger than the instruction cache. App1 App2 of instruction Syscall handler Interrupt handler Privileged: Operating System Time 4
SchedTask (proposed) Linux A B C D A B C D B C D A Time A B C D Time C D A B A B C D D A B C A B C D Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Agnostic to instruction footprint of tasks May execute dissimilar tasks on same core Core specialization Execute similar tasks on the same core 5
Outline Introduction SuperFunction Characterization Scheduler Results 6
Decomposing System Execution into SuperFunctions System Execution Applications (Entire User Process) OS Apache MySQL Distinguished by the code that the application executes at run time System call Interrupt Bottom Half Read Disk Network SCSI Write Distinguished by the handler s routine Distinguished by the interrupt signal ID Distinguished by the system call identifier 7
Insights regarding the SuperFunctions Mail server Apache webserver Thread 0 Thread 1 Thread 0 Thread 1 Read() Write() Read() Write() Read() Write() Read() Write() Take advantage of locality effects by executing Footprint of system call handlers does not differ much across different applications High similarity between the instruction and data footprints of the system call handler on both threads common execution blocks on the same core 8
Determining instruction overlap between different types of SuperFunctions System call handlers Write syscall Pwrite syscall Pread syscall Read syscall High insn. overlap High insn. overlap Constraint: SchedTask is forced to execute two SuperFunctionTypes on the same core Desirable: Execute SuperFunctions with higher instruction overlap on the same core How: Quantify the overlap between SuperFunctions as the number of common code pages that they access 9
Page-heatmap (Bloom Filter) Captures the set of physical code pages accessed by the SuperFunction 512 bits Page-heatmap register 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Set bit number 5 Set bit number 0 All bits are set to 0 at the start of an epoch hash (pf(y)) hash (pf(x)) Insn sequence: insn x insn y hash(x) = (( (x) + (x>>9) + (x>>18) + (x>>27) + (x>>36) + (x>>45) ) mod 512 Hash function uses all bits of the physical address in calculating the index of bit to be modified 10
Calculating Page Overlap A s Page heatmap: 0 1 0 0 0 0 0 1 0 0 0 0 0 1 B s Page heatmap: 1 0 0 0 0 1 1 Overlap between A and B = #common pages accessed = 2 Hamming weight of the bit vectors representing their Page-heatmaps 11
Outline Introduction SuperFunction Characterization Scheduler Results 12
Benchmarks Network: 1) Apache: Web server (Apache) 2) ISCP: Secure copy file from remote machine to local machine (SCP) 3) OSCP: Secure copy file from local machine to remote machine (SCP) Database: 4) DSS: Decision Support System (TPCH: Query 2) 5) OLTP: Online Transaction Processing (Sysbench benchmark suite) File system 6) Find: Filesystem browsing (Linux utility application Find) 7) FileSrv: Random reads and writes to file (Sysbench benchmark suite) 8) MailSrvIO: Imitate file operations of a mail server (Filebench benchmark suite) 13
Instruction Breakup 90% for FileSrv 20% for Iscp and DSS Execn block: 1 billion insn per core 14
Instruction Breakup: Similarity Across Epochs Epoch n (3ms) Epoch n+1 (3ms) Type of SuperFunction Fraction (%) Type of SuperFunction Fraction (%) Collect execn profile of SuperFunctions in one epoch Use it to decide a schedule for the next epoch AppX 50 AppX 50 1 Read syscall 25 Read syscall 25 2 Write syscall 25 Write syscall 25 3 Time Instruction breakup remains mostly similar 15
Outline Introduction SuperFunction Characterization Scheduler Results 16
Hierarchical Scheduler Hierarchical Scheduler Execution statistics of the last epoch Allocation Table for the current epoch SuperFunctionType SuperFunction Type Time Core ApplicationX 1.5 ms ApplicationX 0 ReadSys 1.5 ms ReadSys 1 Executed at the start of the epoch Performs resource allocation for the current epoch Steering Logic Core 0 Core 1 Time Start of an epoch 17
TAlloc Allocation Table SuperFuncType Cores A 0,1 Last Epoch s Execn Profile B 2 SuperFuncType Time Freq. Page-heatmap C 3 A 2 ms 10 PHA PHB PHC B 1 ms 5 Overlap Table C 1 ms 5 SuperFuncType Overlap Value Based on the execution of last epoch A (B, 3) (C, 1) B (A, 3) (C, 2) C (B, 2) (A, 1) 18
TMigrate: Work Stealing Core 1 A ? Core 2 A Core 3 B Core 4 C Cores A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 List of pending SuperFunctions What to do when core has no SuperFunction to execute Steal same type of work from other cores 19 19
TMigrate: Work Stealing Core 1 A ? Core 2 B Core 3 B Core 4 C Cores B1 B2 B3 B4 B5 B6 B7 B8 C1 C2 C3 C4 List of pending SuperFunctions - - Refer the Page overlap table A and B have a higher overlap than A and C While Aand Bare similar, A B B may access some cache lines that are not accessed by A Amortize effort by stealing multiple SuperFunctions of type B Steal similar work from other cores 20 20
Outline Introduction SuperFunction Characterization Scheduler Results 21
Evaluated Techniques Technique High level approach SelectiveOffload [Tech Report 09] Proposes a system with 2n cores; n reserved for application and n reserved for the OS FlexSC [OSDI 10] Segregate execution of application and system call handlers RegionSched [ASPLOS 12] Segregate execution of application and system call handlers SLICC Hardware scheduler [MICRO 13] SchedTask Segregate execution of SuperFunctions. Also reduce core idleness using work stealing [proposed technique] 22
Baseline System Parameter Value Cores 32 Pipeline Out Of Order Pipeline Private caches i-cache and d-cache (4-way 32 KB) L2 cache: (4-way 256 KB) Shared cache OS L3 cache (8-way 4 MB) Linux 2.6.32 (Debian 6.0) 23
Performance SchedTask outperforms state of the art schedulers by around 11.4 % Reasons: 1. High i-cache hit rates due to fine-grained scheduling 2. Low core idleness due to work stealing 24
I-Cache Hit-Rate Reason: High i-cache hit rates due to fine-grained scheduling 25
D-Cache Hit-Rate Core specialization increases d- cache hit rate. 1. Intuition: SuperFunctions that execute the same code typically access common data structures. 2. Fewer cache line bounces. 26
t4: read() t2: read() t3: read() t1: read() Threads -> Baseline: Core 1 file sys.lock Core 2 Core 3 Core 4 file sys.lock Cache line bouncing read() SchedTask: Core 1 Core 2 Core 3 Core 4 file sys.lock By reducing cache line bounces, we improve the data locality 27
Summary Decomposed the execution of OS intensive applications in to sequences of instructions called SuperFunctions Proposed a hierarchical scheduler that executes SuperFunctions with higher instruction overlap on the same core Demonstrated an average increase in instruction throughput of 11.4% over state of the art OS schedulers for a suite of 8 OS intensive applications 28