Hardware-Assisted Task Scheduler for OS Intensive Applications

SchedTask: A Hardware-Assisted

Task Scheduler

                 Prathmesh Kallurkar*

Smruti R. Sarangi

         Microarchitecture Research Lab             Department of Computer Science & Engg.

                         Intel India                                           Indian Institute of Technology Delhi

Contributed to this work while I was a student at IIT Delhi

Outline

App1

Syscall

handler

Interrupt

handler

App2

Time

Observations:

•

Different tasks execute different code.

•

Combined size of instruction

footprints of all tasks is larger than

the instruction cache.

SchedTask (proposed)

•

Core specialization

•

Execute similar tasks on the same core

•

Agnostic to instruction footprint of tasks

•

May execute dissimilar tasks on same core

Outline

Decomposing System Execution into SuperFunctions

System Execution

Applications

(Entire User Process)

OS

Apache

MySQL

System call

Interrupt

Bottom Half

Read

Disk

SCSI

Write

Network

Distinguished by the code that the

application executes at run time

Distinguished by the system

call identifier

Distinguished by the interrupt

signal ID

Distinguished by the handler’s

routine

Insights regarding the SuperFunctions

High similarity between the instruction and data

footprints of the system call handler on both threads

…

…

…

Read()

…

Write()

…

Apache webserver

Thread 0

…

Read()

…

…

Write()

…

…

Thread 1

…

…

…

Read()

…

Write()

…

Mail server

Thread 0

…

Read()

…

…

Write()

…

…

Thread 1

Take advantage of locality effects by executing

common execution blocks on the same core

Footprint of system call handlers does not differ much across different applications

Determining instruction overlap between

different types of SuperFunctions

Pread syscall

Write syscall

Read syscall

Constraint:

 SchedTask is forced to execute two SuperFunctionTypes on the same core

Desirable:

  Execute SuperFunctions with higher instruction overlap on the same core

How:

  Quantify the overlap between SuperFunctions as the number of common code

pages that they access

High insn. overlap

Pwrite syscall

High insn. overlap

System call handlers

Page-heatmap (Bloom Filter)

Page-heatmap register

512 bits

All bits are set to ‘0’ at the start of an epoch

Insn sequence:

insn

Set bit number 0

insn

Set bit number 5

hash (pf(

hash(x) = (( (x) + (x>>9) + (x>>18) + (x>>27) + (x>>36) + (x>>45) )

mod

Hash function uses all bits of the physical address in calculating the index of bit to be modified

hash (pf(

Captures the set of physical code pages accessed by the SuperFunction

Calculating Page Overlap

’s Page heatmap:

•

Overlap between

and

 = #common pages accessed = 2

•

Hamming weight of the bit vectors representing their Page-heatmaps

’s Page heatmap:

Outline

Benchmarks

•

Network:

1) Apache: Web server (

Apache

2) ISCP: Secure copy file from remote machine to local machine (

SCP

3) OSCP: Secure copy file from local machine to remote machine (

SCP

•

Database:

4) DSS: Decision Support System (

TPCH: Query 2

5) OLTP: Online Transaction Processing (

Sysbench benchmark suite

•

File system

6) Find: Filesystem browsing (

Linux utility application Find

7) FileSrv: Random reads and writes to file (

Sysbench benchmark suite

8) MailSrvIO: Imitate file operations of a mail server (

Filebench benchmark suite

Instruction Breakup

20% for Iscp and DSS

90% for FileSrv

Execn block: 1 billion insn per core

Instruction Breakup: Similarity Across Epochs

•

Collect execn profile of

SuperFunctions in one epoch

•

Use it to decide a schedule

for the next epoch

Epoch

(3ms)

Epoch

n+1

(3ms)

Instruction breakup

remains mostly similar

Time

Outline

Allocation Table for the current epoch

Core 0

Core 1

TAlloc

TMigrate

ApplicationX

TMigrate

ReadSys

ApplicationX

TMigrate

TMigrate

TMigrate

TMigrate

Execution statistics of the last epoch

Start of an epoch

Time

•

Executed at the start of the epoch

•

Performs resource allocation for the current epoch

Steering Logic

TAlloc

Last Epoch’s Execn Profile

Allocation Table

Overlap Table

Based on the execution of last epoch

TMigrate: Work Stealing

Core 1

Core 2

Core 3

Core 4

What to do when core has no SuperFunction to execute

Steal same type of work from other cores

Cores

TMigrate: Work Stealing

Core 1

Core 2

Core 3

Core 4

Steal similar work from other cores

Refer the Page overlap table

and

have a higher overlap than

and

Cores

•

While

and

are similar,

A≠ B

•

may access some cache lines that are not accessed by

•

Amortize effort by stealing multiple SuperFunctions of type

Outline

Evaluated Techniques

Baseline System

Performance

SchedTask outperforms state of the

art schedulers by around 11.4 %

Reasons:

1.

High i-cache hit rates due

to f

ine-grained scheduling

2.

Low core idleness due to

work stealing

I-Cache Hit-Rate

Reason:

High i-cache hit rates due to

ine-grained scheduling

D-Cache Hit-Rate

Core specialization increases d-

cache hit rate.

1.

Intuition: SuperFunctions

that execute the same code

typically access common

data structures.

2.

Fewer cache line bounces.

Core 2

Core 3

Core 1

Baseline:

Core 4

 file sys.lock

Core 2

Core 3

Core 1

SchedTask:

Core 4

read()

file sys.lock

Cache line bouncing



t1: read()

By reducing cache line bounces, we improve the data locality

Threads ->

 file sys.lock

t2: read()

t3: read()

t4: read()

Summary

•

Decomposed the execution of OS intensive applications in to sequences of

instructions called

SuperFunctions

•

Proposed a hierarchical scheduler that executes SuperFunctions with higher

instruction overlap on the same core

•

Demonstrated an average increase in instruction throughput of 11.4% over

state of the art OS schedulers for a suite of 8 OS intensive applications

Slide Note

Embed Share

Download

A hardware-assisted task scheduler called SchedTask is proposed to address the issue of instruction cache pollution in OS intensive applications. By utilizing SuperFunction characterization and a specialized scheduler, the system aims to optimize task execution on different cores for improved performance. The model of execution distinguishes between user applications and the privileged operating system. Insights into system execution decomposition into SuperFunctions reveal benefits for mail servers and Apache webservers in leveraging locality effects and maintaining instruction and data footprint similarity.

benjimen Follow

Uploaded on Sep 21, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

SchedTask: A Hardware-Assisted Task Scheduler Prathmesh Kallurkar* Smruti R. Sarangi Microarchitecture Research Lab Department of Computer Science & Engg. Intel India Indian Institute of Technology Delhi *Contributed to this work while I was a student at IIT Delhi 1

Outline Introduction Problem of instruction cache pollution for OS intensive applications Overview of the proposed technique SuperFunction Characterization Scheduler Results 2

Model of Execution Model of Execution Non-privileged: User Applications Privileged: Operating System 3

OS Intensive (Web server) OS Intensive (Web server) Non-privileged: User Applications Observations: Different tasks execute different code. Combined size footprints of all tasks is larger than the instruction cache. App1 App2 of instruction Syscall handler Interrupt handler Privileged: Operating System Time 4

SchedTask (proposed) Linux A B C D A B C D B C D A Time A B C D Time C D A B A B C D D A B C A B C D Core 0 Core 1 Core 2 Core 3 Core 0 Core 1 Core 2 Core 3 Agnostic to instruction footprint of tasks May execute dissimilar tasks on same core Core specialization Execute similar tasks on the same core 5

Outline Introduction SuperFunction Characterization Scheduler Results 6

Decomposing System Execution into SuperFunctions System Execution Applications (Entire User Process) OS Apache MySQL Distinguished by the code that the application executes at run time System call Interrupt Bottom Half Read Disk Network SCSI Write Distinguished by the handler s routine Distinguished by the interrupt signal ID Distinguished by the system call identifier 7

Insights regarding the SuperFunctions Mail server Apache webserver Thread 0 Thread 1 Thread 0 Thread 1 Read() Write() Read() Write() Read() Write() Read() Write() Take advantage of locality effects by executing Footprint of system call handlers does not differ much across different applications High similarity between the instruction and data footprints of the system call handler on both threads common execution blocks on the same core 8

Determining instruction overlap between different types of SuperFunctions System call handlers Write syscall Pwrite syscall Pread syscall Read syscall High insn. overlap High insn. overlap Constraint: SchedTask is forced to execute two SuperFunctionTypes on the same core Desirable: Execute SuperFunctions with higher instruction overlap on the same core How: Quantify the overlap between SuperFunctions as the number of common code pages that they access 9

Page-heatmap (Bloom Filter) Captures the set of physical code pages accessed by the SuperFunction 512 bits Page-heatmap register 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 Set bit number 5 Set bit number 0 All bits are set to 0 at the start of an epoch hash (pf(y)) hash (pf(x)) Insn sequence: insn x insn y hash(x) = (( (x) + (x>>9) + (x>>18) + (x>>27) + (x>>36) + (x>>45) ) mod 512 Hash function uses all bits of the physical address in calculating the index of bit to be modified 10

Calculating Page Overlap A s Page heatmap: 0 1 0 0 0 0 0 1 0 0 0 0 0 1 B s Page heatmap: 1 0 0 0 0 1 1 Overlap between A and B = #common pages accessed = 2 Hamming weight of the bit vectors representing their Page-heatmaps 11

Outline Introduction SuperFunction Characterization Scheduler Results 12

Benchmarks Network: 1) Apache: Web server (Apache) 2) ISCP: Secure copy file from remote machine to local machine (SCP) 3) OSCP: Secure copy file from local machine to remote machine (SCP) Database: 4) DSS: Decision Support System (TPCH: Query 2) 5) OLTP: Online Transaction Processing (Sysbench benchmark suite) File system 6) Find: Filesystem browsing (Linux utility application Find) 7) FileSrv: Random reads and writes to file (Sysbench benchmark suite) 8) MailSrvIO: Imitate file operations of a mail server (Filebench benchmark suite) 13

Instruction Breakup 90% for FileSrv 20% for Iscp and DSS Execn block: 1 billion insn per core 14

Instruction Breakup: Similarity Across Epochs Epoch n (3ms) Epoch n+1 (3ms) Type of SuperFunction Fraction (%) Type of SuperFunction Fraction (%) Collect execn profile of SuperFunctions in one epoch Use it to decide a schedule for the next epoch AppX 50 AppX 50 1 Read syscall 25 Read syscall 25 2 Write syscall 25 Write syscall 25 3 Time Instruction breakup remains mostly similar 15

Outline Introduction SuperFunction Characterization Scheduler Results 16

Hierarchical Scheduler Hierarchical Scheduler Execution statistics of the last epoch Allocation Table for the current epoch SuperFunctionType SuperFunction Type Time Core ApplicationX 1.5 ms ApplicationX 0 ReadSys 1.5 ms ReadSys 1 Executed at the start of the epoch Performs resource allocation for the current epoch Steering Logic Core 0 Core 1 Time Start of an epoch 17

TAlloc Allocation Table SuperFuncType Cores A 0,1 Last Epoch s Execn Profile B 2 SuperFuncType Time Freq. Page-heatmap C 3 A 2 ms 10 PHA PHB PHC B 1 ms 5 Overlap Table C 1 ms 5 SuperFuncType Overlap Value Based on the execution of last epoch A (B, 3) (C, 1) B (A, 3) (C, 2) C (B, 2) (A, 1) 18

TMigrate: Work Stealing Core 1 A ? Core 2 A Core 3 B Core 4 C Cores A1 A2 A3 A4 B1 B2 B3 B4 C1 C2 C3 C4 List of pending SuperFunctions What to do when core has no SuperFunction to execute Steal same type of work from other cores 19 19

TMigrate: Work Stealing Core 1 A ? Core 2 B Core 3 B Core 4 C Cores B1 B2 B3 B4 B5 B6 B7 B8 C1 C2 C3 C4 List of pending SuperFunctions - - Refer the Page overlap table A and B have a higher overlap than A and C While Aand Bare similar, A B B may access some cache lines that are not accessed by A Amortize effort by stealing multiple SuperFunctions of type B Steal similar work from other cores 20 20

Outline Introduction SuperFunction Characterization Scheduler Results 21

Evaluated Techniques Technique High level approach SelectiveOffload [Tech Report 09] Proposes a system with 2n cores; n reserved for application and n reserved for the OS FlexSC [OSDI 10] Segregate execution of application and system call handlers RegionSched [ASPLOS 12] Segregate execution of application and system call handlers SLICC Hardware scheduler [MICRO 13] SchedTask Segregate execution of SuperFunctions. Also reduce core idleness using work stealing [proposed technique] 22

Baseline System Parameter Value Cores 32 Pipeline Out Of Order Pipeline Private caches i-cache and d-cache (4-way 32 KB) L2 cache: (4-way 256 KB) Shared cache OS L3 cache (8-way 4 MB) Linux 2.6.32 (Debian 6.0) 23

Performance SchedTask outperforms state of the art schedulers by around 11.4 % Reasons: 1. High i-cache hit rates due to fine-grained scheduling 2. Low core idleness due to work stealing 24

I-Cache Hit-Rate Reason: High i-cache hit rates due to fine-grained scheduling 25

D-Cache Hit-Rate Core specialization increases d- cache hit rate. 1. Intuition: SuperFunctions that execute the same code typically access common data structures. 2. Fewer cache line bounces. 26

t4: read() t2: read() t3: read() t1: read() Threads -> Baseline: Core 1 file sys.lock Core 2 Core 3 Core 4 file sys.lock Cache line bouncing read() SchedTask: Core 1 Core 2 Core 3 Core 4 file sys.lock By reducing cache line bounces, we improve the data locality 27

Summary Decomposed the execution of OS intensive applications in to sequences of instructions called SuperFunctions Proposed a hierarchical scheduler that executes SuperFunctions with higher instruction overlap on the same core Demonstrated an average increase in instruction throughput of 11.4% over state of the art OS schedulers for a suite of 8 OS intensive applications 28

Hardware-Assisted Task Scheduler for OS Intensive Applications

Download Presentation

Presentation Transcript

Related

More Related Content