CILK: An Efficient Multithreaded Runtime System

People

Project at MIT & now at UT Austin

–

Bobby Blumofe (now UT Austin, Akamai)

–

Chris Joerg

–

Brad Kuszmaul (now Yale)

–

Charles Leiserson (MIT, Akamai)

–

Keith Randall (Bell Labs)

–

Yuli Zhou (Bell Labs)

Outline

Introduction

Programming environment

The work-stealing thread scheduler

Performance of applications

Modeling performance

Proven Properties

Conclusions

Introduction

Why multithreading?

To implement dynamic, asynchronous,

concurrent programs.

 Cilk programmer optimizes:

–

total work

–

critical path

A Cilk computation is viewed as a

dynamic

 directed acyclic graph (dag)

Introduction ...

Introduction ...

Cilk

program

 is a set of

procedures

procedure

is a

sequence

of

threads

Cilk

threads

 are:

–

represented by nodes in the dag

–

Non-blocking

: run to completion:

no

waiting or

suspension:

atomic

 units of execution

Introduction ...

Threads can

spawn

 child threads

–

downward

 edges connect a parent to its children

A child & parent can run

concurrently

–

Non-blocking threads



 a child

cannot

 return a

value to its parent.

–

The parent spawns a

successor

  that receives

values from its children

Introduction ...

A thread & its successor are parts of the

same Cilk procedure.

–

connected by

horizontal

 arcs

Children’s

returned values

 are received

before their successor begins:

–

They constitute data dependencies.

–

Connected by

curved

 arcs

Introduction ...

Introduction: Execution Time

Execution time of a Cilk program using

P processors

depends on

–

Work (T

):

 time for Cilk program with 1

processor to complete.

–

Critical path (T



):

 the time to execute the

longest directed path in the dag.

–

 >= T

/ P

(not true for some searches)

–

 >= T



Introduction: Scheduling

Cilk uses

run time scheduling

 called

work stealing

Works well on

dynamic

, asynchronous,

MIMD-style programs.

For “fully strict” programs, Cilk achieves

asymptotic

 optimality for:

space, time, & communication

Introduction: language

Cilk is an extension of C

Cilk programs are:

–

preprocessed to C

–

linked with a runtime library

Programming Environment

Declaring a thread:

thread T ( <args> ) { <stmts> }

T is preprocessed into a C function of 1

argument and return type

void

The 1 argument is a pointer to a

closure

Environment: Closure

closure

is a data structure that has:

–

a pointer to the C function for T

–

a slot for each argument

(inputs & continuations)

–

join counter

: count of the missing argument values

A closure is

ready

 when join

counter == 0

A closure is

waiting

 otherwise.

They are allocated from a runtime heap

Environment: Continuation

A Cilk

continuation

is a data type,

denoted by the keyword

cont

cont int x;

It is a global reference to an

empty slot

of a closure

It is implemented as 2 items:

–

pointer

 to the closure;

(what thread)

–

an

int

 value: the slot number.

(what input)

Environment: Closure

Environment: spawn

To

spawn

child

, a thread creates its closure:

spawn T (<args> )

–

creates child’s closure

–

sets available arguments

–

sets join counter

To specify a missing argument, prefix with a “?”

spawn T (k, ?x);

Environment: spawn_next

successor

 thread is spawned the

same way as a child, except the

keyword

spawn_next

 is used:

spawn_next T(k, ?x)

Children typically have no missing

arguments; successors do.

Explicit continuation passing

Nonblocking threads



 a parent cannot

block on children’s results.

It spawns a

successor

 thread.

This communication paradigm is called

explicit continuation passing

Cilk provides a primitive to

send a value

from one closure to another.

send_argument

Cilk provides the primitive

send_argument( k, value )

sends

value

 to the argument slot of a

waiting closure specified by continuation

spawn

spawn_next

send_argument

parent

child

successor

Cilk Procedure for computing

a Fibonacci number

thread

int fib

cont

int k, int n ) {

   if ( n < 2 )

send_argument

( k, n );

else {

cont

int x, y;

spawn_next

sum ( k, ?x, ?y );

spawn

fib ( x, n  - 1 );

spawn

fib ( y, n - 2 );

thread

sum (

cont

int k, int x, int y ) {

send_argument

( k, x + y );

Nonblocking Threads: Advantages

Shallow

 call stack

Simplify

 runtime system:

Completed threads leave C runtime stack empty.

Portable

 runtime implementation

Nonblocking Threads:

Disdvantages

Burdens programmer with explicit

continuation passing.

Work-Stealing Scheduler

The concept of work-stealing goes at

least as far back as 1981.

Work-stealing:

–

a process with no work selects a

victim

 from

which to get work.

–

it gets the

shallowest

 thread in the victim’s

spawn tree.

In Cilk, thieves choose victims

randomly

Thread Level

Stealing Work: The Ready Deque

Each closure has a level:

–

level( child ) = level( parent ) + 1

–

level( successor ) = level( parent )

Each processor maintains a

ready

deque

–

Contains ready closures

–

The

th

 element contains the list of all ready

closures whose level is

Ready deque

if

 ( ! readyDeque .isEmpty()  )

take

deepest

 thread

else

steal

shallowest

 thread from

readyDeque of

randomly

selected

  victim

Why Steal Shallowest closure?

Shallow threads

probably

  produce

more work

therefore,

reduce

communication

Shallow threads

more likely to be

on

critical path

Readying a Remote Closure

If a

send_argument

 makes a

remote

 closure

ready

put closure on

sending

 processor’s readyDeque

–



extra

 communication.

–

Done to make scheduler

provably

 good

–

Putting on local readyDeque works well in practice.

Performance of Application

serial

 = time for C program

 = time for 1-processor Cilk program

serial

/T

efficiency

 of the Cilk program

–

Efficiency

is

close to 1 for programs with

moderately long

 threads: Cilk overhead is small.

Performance of Applications

/T

speedup

/ T



average parallelism

If average parallelism is

large

then speedup is nearly perfect.

If average parallelism is

small

then speedup is much smaller.

Performance Data

Performance of Applications

Application speedup

 = efficiency X speedup

= ( T

serial

/T

 ) X ( T

/T

) = T

serial

/ T

Modeling Performance

>= max( T



, T

/ P )

A good scheduler should come close to

these lower bounds.

Modeling Performance

Empirical data suggests that for Cilk:



 / P +





where c



 1.067  & c





 1.042

If     T

/ T



> 10P

then critical path does not affect T

Proven Property: Time

Time

: Including overhead,

 = O( T

/P + T



),

which is

asymptotically

 optimal

Conclusions

We can predict the performance of a Cilk

program by observing

machine-independent

characteristics:

–

Work

–

Critical path

when the program is

fully-strict

Cilk’s usefulness is unclear for other kinds of

programs (e.g., iterative programs).

Conclusions ...

Explicit continuation passing a nuisance.

It subsequently was removed (with more clever

pre-processing).

Conclusions ...

Great system research has a theoretical

underpinning.

Such research identifies important properties

–

of the systems themselves, or

–

of our ability to reason about them formally.

Cilk identified 3 significant system properties:

–

Fully strict programs

–

Non-blocking threads

–

Randomly choosing a victim.

END

The Cost of Spawns

A spawn is about

an order of magnitude

 more

costly than a C function call.

Spawned threads running on parent’s processor

can be

 implemented more efficiently than

remote spawns.

–

This usually is the case.

Compiler techniques can exploit this distinction.

Communication Efficiency

request

 is an

attempt

 to steal work

(the victim may not have work).

Requests/processor & steals/processor

both

grow as the critical path grows

Proven Properties: Space

fully strict

 program’s threads send

arguments only to its parent’s successors.

For such programs, space, time, &

communication bounds are proven.

Space

 <= S

–

There

exists

 a P-processor execution for

which this is asymptotically optimal.

Proven Properties:

Communication

Communication

: The expected # of bits

communicated in a P-processor execution is:

O( T



P S

MAX

where

MAX

 denotes its largest closure.

There exists a program such that, for all P, there

exists

-processor execution that communicates

bits, where

k > c T



P S

MAX

, for some constant,

Slide Note

Embed Share

Download

CILK is a multithreaded runtime system designed to develop dynamic, asynchronous, and concurrent programs efficiently. It utilizes a work-stealing thread scheduler and relies on a directed acyclic graph (DAG) model for computations. With a focus on optimizing critical paths and total work, CILK enables non-blocking, concurrent execution of threads within procedures, enhancing performance and scalability.

nael644 Follow

Uploaded on Oct 08, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

CILK: An Efficient Multithreaded Runtime System

People Project at MIT & now at UT Austin Bobby Blumofe (now UT Austin, Akamai) Chris Joerg Brad Kuszmaul (now Yale) Charles Leiserson (MIT, Akamai) Keith Randall (Bell Labs) Yuli Zhou (Bell Labs)

Outline Introduction Programming environment The work-stealing thread scheduler Performance of applications Modeling performance Proven Properties Conclusions

Introduction Why multithreading? To implement dynamic, asynchronous, concurrent programs. Cilk programmer optimizes: total work critical path A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)

Introduction ...

Introduction ... Cilk program is a set of procedures A procedure is a sequence of threads Cilk threads are: represented by nodes in the dag Non-blocking: run to completion: no waiting or suspension: atomic units of execution

Introduction ... Threads can spawn child threads downward edges connect a parent to its children A child & parent can run concurrently. Non-blocking threads a child cannot return a value to its parent. The parent spawns a successor that receives values from its children

Introduction ... A thread & its successor are parts of the same Cilk procedure. connected by horizontal arcs Children s returned values are received before their successor begins: They constitute data dependencies. Connected by curved arcs

Introduction ...

Introduction: Execution Time Execution time of a Cilk program using P processors depends on: Work (T1): time for Cilk program with 1 processor to complete. Critical path (T ): the time to execute the longest directed path in the dag. TP >= T1 / P (not true for some searches) TP >= T

Introduction: Scheduling Cilk uses run time scheduling called work stealing. Works well on dynamic, asynchronous, MIMD-style programs. For fully strict programs, Cilk achieves asymptotic optimality for: space, time, & communication

Introduction: language Cilk is an extension of C Cilk programs are: preprocessed to C linked with a runtime library

Programming Environment Declaring a thread: thread T ( <args> ) { <stmts> } T is preprocessed into a C function of 1 argument and return type void. The 1 argument is a pointer to a closure

Environment: Closure A closure is a data structure that has: a pointer to the C function for T a slot for each argument (inputs & continuations) a join counter: count of the missing argument values A closure is ready when join counter == 0. A closure is waiting otherwise. They are allocated from a runtime heap

Environment: Continuation A Cilk continuation is a data type, denoted by the keyword cont. cont int x; It is a global reference to an empty slot of a closure. It is implemented as 2 items: a pointer to the closure; (what thread) an int value: the slot number. (what input)

Environment: Closure

Environment: spawn To spawn a child, a thread creates its closure: spawn T (<args> ) creates child s closure sets available arguments sets join counter To specify a missing argument, prefix with a ? spawn T (k, ?x);

Environment: spawn_next A successor thread is spawned the same way as a child, except the keyword spawn_next is used: spawn_next T(k, ?x) Children typically have no missing arguments; successors do.

Explicit continuation passing Nonblocking threads a parent cannot block on children s results. It spawns a successor thread. This communication paradigm is called explicit continuation passing. Cilk provides a primitive to send a value from one closure to another.

send_argument Cilk provides the primitive send_argument( k, value ) sends value to the argument slot of a waiting closure specified by continuation k. spawn_next successor parent spawn send_argument child

Cilk Procedure for computing a Fibonacci number thread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 ); } } thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }

Nonblocking Threads: Advantages Shallow call stack. Simplify runtime system: Completed threads leave C runtime stack empty. Portable runtime implementation

Nonblocking Threads: Disdvantages Burdens programmer with explicit continuation passing.

Work-Stealing Scheduler The concept of work-stealing goes at least as far back as 1981. Work-stealing: a process with no work selects a victim from which to get work. it gets the shallowest thread in the victim s spawn tree. In Cilk, thieves choose victims randomly.

Thread Level

Stealing Work: The Ready Deque Each closure has a level: level( child ) = level( parent ) + 1 level( successor ) = level( parent ) Each processor maintains a ready deque: Contains ready closures The Lth element contains the list of all ready closures whose level is L.

Ready deque if ( ! readyDeque .isEmpty() ) take deepest thread else steal shallowest thread from readyDeque of randomly selected victim

Why Steal Shallowest closure? Shallow threads probably produce more work, therefore, reduce communication. Shallow threads more likely to be on critical path.

Readying a Remote Closure If a send_argument makes a remote closure ready, put closure on sending processor s readyDeque extra communication. Done to make scheduler provably good Putting on local readyDeque works well in practice.

Performance of Application Tserial = time for C program T1 = time for 1-processor Cilk program Tserial /T1 = efficiency of the Cilk program Efficiency is close to 1 for programs with moderately long threads: Cilk overhead is small.

Performance of Applications T1/TP = speedup T1/ T = average parallelism If average parallelism is large then speedup is nearly perfect. If average parallelism is small then speedup is much smaller.

Performance Data

Performance of Applications Application speedup = efficiency X speedup = ( Tserial /T1 ) X ( T1/TP ) = Tserial / TP

Modeling Performance TP >= max( T , T1 / P ) A good scheduler should come close to these lower bounds.

Modeling Performance Empirical data suggests that for Cilk: TP c1 T1 / P + c T , where c1 1.067 & c 1.042 If T1 / T > 10P then critical path does not affect TP.

Proven Property: Time Time: Including overhead, TP = O( T1/P + T ), which is asymptotically optimal

Conclusions We can predict the performance of a Cilk program by observing machine-independent characteristics: Work Critical path when the program is fully-strict. Cilk s usefulness is unclear for other kinds of programs (e.g., iterative programs).

Conclusions ... Explicit continuation passing a nuisance. It subsequently was removed (with more clever pre-processing).

Conclusions ... Great system research has a theoretical underpinning. Such research identifies important properties of the systems themselves, or of our ability to reason about them formally. Cilk identified 3 significant system properties: Fully strict programs Non-blocking threads Randomly choosing a victim.

END

The Cost of Spawns A spawn is about an order of magnitude more costly than a C function call. Spawned threads running on parent s processor can be implemented more efficiently than remote spawns. This usually is the case. Compiler techniques can exploit this distinction.

Communication Efficiency A request is an attempt to steal work (the victim may not have work). Requests/processor & steals/processor both grow as the critical path grows.

Proven Properties: Space A fully strict program s threads send arguments only to its parent s successors. For such programs, space, time, & communication bounds are proven. Space: SP <= S1 P. There exists a P-processor execution for which this is asymptotically optimal.

Proven Properties: Communication Communication: The expected # of bits communicated in a P-processor execution is: O( T P SMAX ) where SMAX denotes its largest closure. There exists a program such that, for all P, there exists a P-processor execution that communicates k bits, where k > c T P SMAX, for some constant, c.

CILK: An Efficient Multithreaded Runtime System

Download Presentation

Presentation Transcript

Related

More Related Content