CILK: An Efficient Multithreaded Runtime System

 
C
I
L
K
:
 
A
n
 
E
f
f
i
c
i
e
n
t
 
M
u
l
t
i
t
h
r
e
a
d
e
d
R
u
n
t
i
m
e
 
S
y
s
t
e
m
 
 
People
 
n
Project at MIT & now at UT Austin
Bobby Blumofe (now UT Austin, Akamai)
Chris Joerg
Brad Kuszmaul (now Yale)
Charles Leiserson (MIT, Akamai)
Keith Randall (Bell Labs)
Yuli Zhou (Bell Labs)
 
Outline
 
n
Introduction
n
Programming environment
n
The work-stealing thread scheduler
n
Performance of applications
n
Modeling performance
n
Proven Properties
n
Conclusions
 
Introduction
 
n
Why multithreading?
To implement dynamic, asynchronous,
concurrent programs.
n
 Cilk programmer optimizes:
total work
critical path
n
A Cilk computation is viewed as a
dynamic
,
 directed acyclic graph (dag)
 
Introduction ...
 
Introduction ...
 
n
Cilk 
program
 is a set of 
procedures
n
A 
procedure
 
is a 
sequence
 of 
threads
n
Cilk 
threads
 are:
represented by nodes in the dag
Non-blocking
: run to completion: 
no
 
waiting or
suspension: 
atomic
 units of execution
 
Introduction ...
 
n
Threads can 
spawn
 child threads
downward
 edges connect a parent to its children
n
A child & parent can run 
concurrently
.
Non-blocking threads 
 a child 
cannot
 return a
value to its parent.
The parent spawns a 
successor
  that receives
values from its children
 
Introduction ...
 
n
A thread & its successor are parts of the
same Cilk procedure.
connected by 
horizontal
 arcs
n
Children’s 
returned values
 are received
before their successor begins:
They constitute data dependencies.
Connected by 
curved
 arcs
 
Introduction ...
 
Introduction: Execution Time
 
n
Execution time of a Cilk program using
P processors 
depends on
:
Work (T
1
):
 time for Cilk program with 1
processor to complete.
Critical path (T
):
 the time to execute the
longest directed path in the dag.
T
P
 >= T
1 
/ P
 
(not true for some searches)
T
P
 >= T
 
Introduction: Scheduling
 
n
Cilk uses 
run time scheduling
 called
work stealing
.
n
Works well on 
dynamic
, asynchronous,
MIMD-style programs.
n
For “fully strict” programs, Cilk achieves
asymptotic
 optimality for:
space, time, & communication
 
Introduction: language
 
n
Cilk is an extension of C
n
Cilk programs are:
preprocessed to C
linked with a runtime library
 
Programming Environment
 
n
Declaring a thread:
thread T ( <args> ) { <stmts> }
n
T is preprocessed into a C function of 1
argument and return type 
void
.
n
The 1 argument is a pointer to a 
closure
 
Environment: Closure
 
n
A 
closure
 
is a data structure that has:
a pointer to the C function for T
a slot for each argument 
(inputs & continuations)
a 
join counter
: count of the missing argument values
n
A closure is 
ready
 when join 
counter == 0
.
n
A closure is 
waiting
 otherwise.
n
They are allocated from a runtime heap
 
Environment: Continuation
 
n
A Cilk 
continuation
 
is a data type,
denoted by the keyword 
cont
.
cont int x;
n
It is a global reference to an 
empty slot
of a closure
.
n
It is implemented as 2 items:
a 
pointer
 to the closure; 
(what thread)
an 
int
 value: the slot number.  
(what input)
 
Environment: Closure
 
Environment: spawn
 
n
To 
spawn
 a 
child
, a thread creates its closure:
spawn T (<args> )
creates child’s closure
sets available arguments
sets join counter
n
To specify a missing argument, prefix with a “?”
spawn T (k, ?x);
 
Environment: spawn_next
 
n
A 
successor
 thread is spawned the
same way as a child, except the
keyword 
spawn_next
 is used:
spawn_next T(k, ?x)
n
Children typically have no missing
arguments; successors do.
 
Explicit continuation passing
 
n
Nonblocking threads 
 a parent cannot
block on children’s results.
n
It spawns a 
successor
 thread.
n
This communication paradigm is called
explicit continuation passing
.
n
Cilk provides a primitive to 
send a value
from one closure to another.
 
send_argument
 
n
Cilk provides the primitive
send_argument( k, value )
sends 
value
 to the argument slot of a
waiting closure specified by continuation 
k
.
 
spawn
 
spawn_next
 
send_argument
 
parent
 
child
 
successor
 
Cilk Procedure for computing
a Fibonacci number
 
thread
 
int fib
 
( 
cont
 
int k, int n ) {
   if ( n < 2 )
 
send_argument
( k, n );
   
else {
 
cont
 
int x, y;
             
spawn_next
 
sum ( k, ?x, ?y );
             
spawn
 
fib ( x, n  - 1 );
             
spawn
 
fib ( y, n - 2 );
 
   }
}
thread 
sum ( 
cont
 
int k, int x, int y ) {
  
send_argument
 
( k, x + y );
}
 
Nonblocking Threads: Advantages
 
n
Shallow
 call stack
.
n
Simplify
 runtime system:
Completed threads leave C runtime stack empty.
n
Portable
 runtime implementation
 
Nonblocking Threads:
Disdvantages
 
Burdens programmer with explicit
continuation passing.
 
Work-Stealing Scheduler
 
n
The concept of work-stealing goes at
least as far back as 1981.
n
Work-stealing:
a process with no work selects a 
victim
 from
which to get work.
it gets the 
shallowest
 thread in the victim’s
spawn tree.
n
In Cilk, thieves choose victims 
randomly
.
 
Thread Level
 
Stealing Work: The Ready Deque
 
n
Each closure has a level:
level( child ) = level( parent ) + 1
level( successor ) = level( parent )
n
Each processor maintains a 
ready
 
deque
:
Contains ready closures
The 
L
th
 element contains the list of all ready
closures whose level is 
L
.
 
Ready deque
 
if
 ( ! readyDeque .isEmpty()  )
 
take 
deepest
 thread
else
 
steal
 
shallowest
 thread from
readyDeque of 
randomly
selected
  victim
 
Why Steal Shallowest closure?
 
n
Shallow threads 
probably
  produce 
more work
,
therefore, 
reduce
 
communication
.
n
Shallow threads 
more likely to be
 on 
critical path
.
 
Readying a Remote Closure
 
n
If a 
send_argument
 makes a 
remote
 closure
ready
,
 
put closure on 
sending
 processor’s readyDeque
 
extra
 communication.
Done to make scheduler 
provably
 good
Putting on local readyDeque works well in practice.
 
Performance of Application
 
n
T
serial
 = time for C program
n
T
1
 = time for 1-processor Cilk program
n
T
serial 
/T
1
 = 
efficiency
 of the Cilk program
Efficiency
 
is
 
close to 1 for programs with
moderately long
 threads: Cilk overhead is small.
 
Performance of Applications
 
n
T
1
/T
P 
= 
speedup
n
T
1
/ T
 
= 
average parallelism
n
If average parallelism is 
large
 
then speedup is nearly perfect.
n
If average parallelism is 
small
 
then speedup is much smaller.
 
Performance Data
 
Performance of Applications
 
n
Application speedup
 = efficiency X speedup
n
= ( T
serial 
/T
1
 ) X ( T
1
/T
P  
) = T
serial
 / T
P
 
Modeling Performance
 
n
T
P 
>= max( T
 
, T
1 
/ P )
n
A good scheduler should come close to
these lower bounds.
 
Modeling Performance
 
Empirical data suggests that for Cilk:
T
P
 
 
c
1
 T
1
 / P + 
c 
 T
 ,
where c
1
 
 1.067  & c 
 
 1.042
If     T
1 
/ T
 
> 10P
 
then critical path does not affect T
P
.
 
Proven Property: Time
 
Time
: Including overhead,
 
T
P
 = O( T
1
/P + T
 
),
 
which is 
asymptotically
 optimal
 
Conclusions
 
n
We can predict the performance of a Cilk
program by observing 
machine-independent
characteristics:
Work
Critical path
 
when the program is 
fully-strict
.
n
Cilk’s usefulness is unclear for other kinds of
programs (e.g., iterative programs).
 
Conclusions ...
 
Explicit continuation passing a nuisance.
It subsequently was removed (with more clever
pre-processing).
 
Conclusions ...
 
n
Great system research has a theoretical
underpinning.
n
Such research identifies important properties
of the systems themselves, or
of our ability to reason about them formally.
n
Cilk identified 3 significant system properties:
Fully strict programs
Non-blocking threads
Randomly choosing a victim.
 
END
 
The Cost of Spawns
 
n
A spawn is about 
an order of magnitude
 more
costly than a C function call.
n
Spawned threads running on parent’s processor
can be
 implemented more efficiently than
remote spawns.
This usually is the case.
n
Compiler techniques can exploit this distinction.
 
Communication Efficiency
 
n
A 
request
 is an 
attempt
 to steal work
(the victim may not have work).
n
Requests/processor & steals/processor
both 
grow as the critical path grows
.
 
Proven Properties: Space
 
n
A 
fully strict
 program’s threads send
arguments only to its parent’s successors.
n
For such programs, space, time, &
communication bounds are proven.
n
Space
: 
S
P
 <= S
1
 P
.
There 
exists
 a P-processor execution for
which this is asymptotically optimal.
 
Proven Properties:
Communication
 
Communication
: The expected # of bits
communicated in a P-processor execution is:
 
O( T
 P S
MAX 
)
  
where 
S
MAX
 denotes its largest closure.
 
There exists a program such that, for all P, there
exists
 a 
P
-processor execution that communicates 
k
bits, where 
k > c T
 P S
MAX
, for some constant, 
c
.
Slide Note
Embed
Share

CILK is a multithreaded runtime system designed to develop dynamic, asynchronous, and concurrent programs efficiently. It utilizes a work-stealing thread scheduler and relies on a directed acyclic graph (DAG) model for computations. With a focus on optimizing critical paths and total work, CILK enables non-blocking, concurrent execution of threads within procedures, enhancing performance and scalability.

  • CILK
  • Multithreading
  • Runtime System
  • Concurrency
  • Optimization

Uploaded on Oct 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. CILK: An Efficient Multithreaded Runtime System

  2. People Project at MIT & now at UT Austin Bobby Blumofe (now UT Austin, Akamai) Chris Joerg Brad Kuszmaul (now Yale) Charles Leiserson (MIT, Akamai) Keith Randall (Bell Labs) Yuli Zhou (Bell Labs)

  3. Outline Introduction Programming environment The work-stealing thread scheduler Performance of applications Modeling performance Proven Properties Conclusions

  4. Introduction Why multithreading? To implement dynamic, asynchronous, concurrent programs. Cilk programmer optimizes: total work critical path A Cilk computation is viewed as a dynamic, directed acyclic graph (dag)

  5. Introduction ...

  6. Introduction ... Cilk program is a set of procedures A procedure is a sequence of threads Cilk threads are: represented by nodes in the dag Non-blocking: run to completion: no waiting or suspension: atomic units of execution

  7. Introduction ... Threads can spawn child threads downward edges connect a parent to its children A child & parent can run concurrently. Non-blocking threads a child cannot return a value to its parent. The parent spawns a successor that receives values from its children

  8. Introduction ... A thread & its successor are parts of the same Cilk procedure. connected by horizontal arcs Children s returned values are received before their successor begins: They constitute data dependencies. Connected by curved arcs

  9. Introduction ...

  10. Introduction: Execution Time Execution time of a Cilk program using P processors depends on: Work (T1): time for Cilk program with 1 processor to complete. Critical path (T ): the time to execute the longest directed path in the dag. TP >= T1 / P (not true for some searches) TP >= T

  11. Introduction: Scheduling Cilk uses run time scheduling called work stealing. Works well on dynamic, asynchronous, MIMD-style programs. For fully strict programs, Cilk achieves asymptotic optimality for: space, time, & communication

  12. Introduction: language Cilk is an extension of C Cilk programs are: preprocessed to C linked with a runtime library

  13. Programming Environment Declaring a thread: thread T ( <args> ) { <stmts> } T is preprocessed into a C function of 1 argument and return type void. The 1 argument is a pointer to a closure

  14. Environment: Closure A closure is a data structure that has: a pointer to the C function for T a slot for each argument (inputs & continuations) a join counter: count of the missing argument values A closure is ready when join counter == 0. A closure is waiting otherwise. They are allocated from a runtime heap

  15. Environment: Continuation A Cilk continuation is a data type, denoted by the keyword cont. cont int x; It is a global reference to an empty slot of a closure. It is implemented as 2 items: a pointer to the closure; (what thread) an int value: the slot number. (what input)

  16. Environment: Closure

  17. Environment: spawn To spawn a child, a thread creates its closure: spawn T (<args> ) creates child s closure sets available arguments sets join counter To specify a missing argument, prefix with a ? spawn T (k, ?x);

  18. Environment: spawn_next A successor thread is spawned the same way as a child, except the keyword spawn_next is used: spawn_next T(k, ?x) Children typically have no missing arguments; successors do.

  19. Explicit continuation passing Nonblocking threads a parent cannot block on children s results. It spawns a successor thread. This communication paradigm is called explicit continuation passing. Cilk provides a primitive to send a value from one closure to another.

  20. send_argument Cilk provides the primitive send_argument( k, value ) sends value to the argument slot of a waiting closure specified by continuation k. spawn_next successor parent spawn send_argument child

  21. Cilk Procedure for computing a Fibonacci number thread int fib ( cont int k, int n ) { if ( n < 2 ) send_argument( k, n ); else { cont int x, y; spawn_next sum ( k, ?x, ?y ); spawn fib ( x, n - 1 ); spawn fib ( y, n - 2 ); } } thread sum ( cont int k, int x, int y ) { send_argument ( k, x + y ); }

  22. Nonblocking Threads: Advantages Shallow call stack. Simplify runtime system: Completed threads leave C runtime stack empty. Portable runtime implementation

  23. Nonblocking Threads: Disdvantages Burdens programmer with explicit continuation passing.

  24. Work-Stealing Scheduler The concept of work-stealing goes at least as far back as 1981. Work-stealing: a process with no work selects a victim from which to get work. it gets the shallowest thread in the victim s spawn tree. In Cilk, thieves choose victims randomly.

  25. Thread Level

  26. Stealing Work: The Ready Deque Each closure has a level: level( child ) = level( parent ) + 1 level( successor ) = level( parent ) Each processor maintains a ready deque: Contains ready closures The Lth element contains the list of all ready closures whose level is L.

  27. Ready deque if ( ! readyDeque .isEmpty() ) take deepest thread else steal shallowest thread from readyDeque of randomly selected victim

  28. Why Steal Shallowest closure? Shallow threads probably produce more work, therefore, reduce communication. Shallow threads more likely to be on critical path.

  29. Readying a Remote Closure If a send_argument makes a remote closure ready, put closure on sending processor s readyDeque extra communication. Done to make scheduler provably good Putting on local readyDeque works well in practice.

  30. Performance of Application Tserial = time for C program T1 = time for 1-processor Cilk program Tserial /T1 = efficiency of the Cilk program Efficiency is close to 1 for programs with moderately long threads: Cilk overhead is small.

  31. Performance of Applications T1/TP = speedup T1/ T = average parallelism If average parallelism is large then speedup is nearly perfect. If average parallelism is small then speedup is much smaller.

  32. Performance Data

  33. Performance of Applications Application speedup = efficiency X speedup = ( Tserial /T1 ) X ( T1/TP ) = Tserial / TP

  34. Modeling Performance TP >= max( T , T1 / P ) A good scheduler should come close to these lower bounds.

  35. Modeling Performance Empirical data suggests that for Cilk: TP c1 T1 / P + c T , where c1 1.067 & c 1.042 If T1 / T > 10P then critical path does not affect TP.

  36. Proven Property: Time Time: Including overhead, TP = O( T1/P + T ), which is asymptotically optimal

  37. Conclusions We can predict the performance of a Cilk program by observing machine-independent characteristics: Work Critical path when the program is fully-strict. Cilk s usefulness is unclear for other kinds of programs (e.g., iterative programs).

  38. Conclusions ... Explicit continuation passing a nuisance. It subsequently was removed (with more clever pre-processing).

  39. Conclusions ... Great system research has a theoretical underpinning. Such research identifies important properties of the systems themselves, or of our ability to reason about them formally. Cilk identified 3 significant system properties: Fully strict programs Non-blocking threads Randomly choosing a victim.

  40. END

  41. The Cost of Spawns A spawn is about an order of magnitude more costly than a C function call. Spawned threads running on parent s processor can be implemented more efficiently than remote spawns. This usually is the case. Compiler techniques can exploit this distinction.

  42. Communication Efficiency A request is an attempt to steal work (the victim may not have work). Requests/processor & steals/processor both grow as the critical path grows.

  43. Proven Properties: Space A fully strict program s threads send arguments only to its parent s successors. For such programs, space, time, & communication bounds are proven. Space: SP <= S1 P. There exists a P-processor execution for which this is asymptotically optimal.

  44. Proven Properties: Communication Communication: The expected # of bits communicated in a P-processor execution is: O( T P SMAX ) where SMAX denotes its largest closure. There exists a program such that, for all P, there exists a P-processor execution that communicates k bits, where k > c T P SMAX, for some constant, c.

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#