Computer Architecture: Shared Memory Systems

CSE 502:
Computer Architecture
Shared-Memory Multi-Processors
Shared-Memory Multiprocessors
Multiple threads use 
shared memory
 (address space)
“SysV Shared Memory” or “Threads” in software
Communication implicit via loads and stores
Opposite of explicit 
message-passing multiprocessors
Theoretical foundation: 
PRAM model
Memory System
Why Shared Memory?
Pluses
App sees multitasking uniprocessor
OS needs only evolutionary extensions
Communication happens without OS
Minuses
Synchronization is complex
Communication is implicit (hard to optimize)
Hard to implement (in hardware)
Result
SMPs and CMPs are most successful machines to date
First with multi-billion-dollar markets
Paired vs. Separate Processor/Memory?
Separate CPU/memory
Uniform memory access
 
(
UMA
)
Equal latency to memory
Low peak performance
Paired CPU/memory
Non-uniform memory access
 
(
NUMA
)
Faster local memory
Data placement matters
High peak performance
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
R
R
R
R
Shared vs. Point-to-Point Networks
Shared network
Example: bus
Low latency
Low bandwidth
Doesn’t scale >~16 cores
Simple cache coherence
Point-to-point network:
Example: mesh, ring
High latency (many “
hops
”)
Higher bandwidth
Scales to 1000s of cores
Complex cache coherence
CPU($)
Mem
CPU($)
Mem
R
CPU($)
Mem
R
CPU($)
Mem
R
CPU($)
Mem
R
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
R
R
R
R
Organizing Point-To-Point Networks
Network topology
: organization of network
Trade off perf. (connectivity, latency, bandwidth) 
 
cost
Router chips
Networks w/separate router chips are 
indirect
Networks w/ processor/memory/router in chip are 
direct
Fewer components, “
Glueless MP
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
CPU($)
Mem
R
R
R
R
R
R
R
CPU($)
Mem
R
CPU($)
Mem
R
CPU($)
Mem
R
CPU($)
Mem
R
Issues for Shared Memory Systems
Two big ones
Cache coherence
Memory consistency model
Closely related
Often confused
A: 0
Cache Coherence: The Problem (1/2)
Variable A initially has value 0
P1 stores value 1 into A
P2 loads A from memory and sees old value 0
Bus
P1
 
t1: 
Store A=1
P2
A: 0
A: 
0
 
1
A: 0
Main Memory
L1
 
t2: 
Load A?
L1
 
Need to do something to keep P2’s cache 
coherent
A: 0
Cache Coherence: The Problem (2/2)
P1 and P2 have variable A (value 0) in their caches
P1 stores value 1 into A
P2 loads A from its cache and sees old value 0
Bus
P1
 
t1: 
Store A=1
P2
A: 0
A: 
0
 
1
A: 0
Main Memory
L1
 
t2: 
Load A?
L1
 
Need to do something to keep P2’s cache 
coherent
Approaches to Cache Coherence
Software-based solutions
Mechanisms:
Mark cache blocks/memory pages as cacheable/non-cacheable
Add “Flush” and “Invalidate” instructions
Could be done by compiler or run-time system
Difficult to get perfect (e.g., what about memory aliasing?)
Hardware solutions are far more common
System ensures everyone always sees the latest value
Coherence with Write-through Caches
Allows multiple readers, but writes through to bus
Requires Write-through, no-write-allocate cache
All caches must monitor (aka “
snoop
”) all bus traffic
Simple state machine for each cache frame
Bus
P1
 
t1: 
Store A=1
P2
A: 0
A [V]: 0
A [V]: 0
Main Memory
Write-through
No-write-allocate
 
t2: 
BusWr A=1
 
t3: 
Invalidate A
A [
V
 
I
]: 
0
A: 
0
 
1
A [V]: 
0
 1
 
Valid-Invalid Snooping Protocol
Processor Actions
Ld, St, BusRd, BusWr
Bus Messages
BusRd, BusWr
Track 1 bit per cache frame
Valid/Invalid
 
Store / BusWr
BusWr / --
Store / BusWr
Load / BusRd
Load / --
Valid
Invalid
Supporting Write-Back Caches
Write-back caches are good
Drastically reduce bus write bandwidth
Add notion of “
ownership
” to Valid-Invalid
When “
owner
” has only replica of a cache block
Update it freely
Multiple readers are ok
Not allowed to write without gaining ownership
On a read, system must check if there is an owner
If yes, take away ownership
Modified-Shared-Invalid (MSI) States
Processor Actions
Load, Store, Evict
Bus Messages
BusRd, BusRdX, BusInv, BusWB, BusReply
(Here for simplicity, some messages can be combined)
Track 3 states per cache frame
Invalid
: cache does not have a copy
Shared
: cache has a read-only copy; clean
Clean: memory (or later caches) is up to date
Modified
: cache has the only valid copy; writable; dirty
Dirty: memory (or later caches) is out of date
 
Simple MSI Protocol (1/9)
Invalid
 
Load / BusRd
Shared
Bus
A [I]
A: 0
P2
A [I]
P1
1: Load A
2: BusRd A
3: BusReply A
A [
I
 
S
]: 
0
 
Simple MSI Protocol (2/9)
 
Invalid
Load / BusRd
Shared
 
Load / --
 
BusRd 
/ [BusReply]
Bus
A [I]
A: 0
P2
A [S]: 0
P1
1: Load A
2: BusRd A
3: BusReply A
1: Load A
A [
I
 
S
]: 
0
 
Simple MSI Protocol (3/9)
 
Invalid
Load / BusRd
Shared
Load / --
BusRd
 / [BusReply]
 
Evict / --
Bus
A [I]
A: 0
P2
A [S]: 0
P1
A [S]: 0
A [
S
 
I
]
Evict A
A [S]: 0
Simple MSI Protocol (4/9)
 
 
Store / BusRdX
Invalid
Load / BusRd
Shared
Load / --
BusRd 
/ [BusReply]
Modified
Evict / --
 
BusRdX 
/ [BusReply]
Bus
A [I]
A: 0
P2
A [
S
 
I
]: 
0
P1
1: Store A
2: BusRdX A
3: BusReply A
A [
I
 
M
]: 
0
 
1
 
Load, Store / --
 
Simple MSI Protocol (5/9)
 
Store / BusRdX
Invalid
Load / BusRd
Shared
Load / --
BusRd 
/ [BusReply]
Modified
Evict / --
 
BusRd 
/ BusReply
Load, Store / --
BusRdX / [BusReply]
Bus
A [M]: 1
A: 0
P2
A [I]
P1
1: Load A
2: BusRd A
3: BusReply A
A [
I
 
S
]: 
1
A [
M
 
S
]: 1
A: 
0
 
1
4: Snarf A
 
Simple MSI Protocol (6/9)
 
Store / BusRdX
Invalid
Load / BusRd
Shared
Load / --
BusRd 
/ [BusReply]
Modified
Evict / --
BusRd 
/ BusReply
Load, Store / --
 
Store
 
/ BusInv
Bus
A [S]: 1
A: 1
P2
A [S]: 1
P1
1: Store A
aka “
Upgrade
2: BusInv A
A [
S
 
M
]: 
2
A [
S
 
I
]
 
BusRdX / [BusReply]
BusRdX, 
BusInv 
/ [BusReply]
 
Simple MSI Protocol (7/9)
 
Store / BusRdX
Invalid
Load / BusRd
Shared
Load / --
BusRd 
/ [BusReply]
Modified
 
BusRdX 
/ BusReply
Evict / --
BusRd 
/ BusReply
Load, Store / --
Store
 
/ BusInv
BusRdX, BusInv / [BusReply]
Bus
A [I]
A: 1
P2
A [M]: 2
P1
1: Store A
2: BusRdX A
3: BusReply A
A [
M
 
I
]: 
2
A [
I
 
M
]: 
3
 
Simple MSI Protocol (8/9)
 
Store / BusRdX
Invalid
Load / BusRd
Shared
Load / --
BusRd 
/ [BusReply]
Modified
BusRdX 
/ BusReply
Evict / --
BusRd 
/ BusReply
 
Evict / BusWB
Load, Store / --
Store
 
/ BusInv
BusRdX, BusInv / [BusReply]
Bus
A [M]: 3
A: 1
P2
A [I]
P1
1: Evict A
2: BusWB A
A [
M
 
I
]: 
3
A: 
1
 
3
Simple MSI Protocol (9/9)
Store / BusRdX
Invalid
Load / BusRd
Shared
Load / --
BusRd 
/ [BusReply]
Cache Actions:
Load, Store, Evict
Bus Actions:
BusRd, BusRdX
BusInv, BusWB,
BusReply
Modified
BusRdX 
/ BusReply
Evict / --
BusRd 
/ BusReply
Evict / BusWB
Load, Store / --
Store
 
/ BusInv
BusRdX, BusInv 
/ [BusReply]
Usable coherence protocol
Scalable Cache Coherence
Part I: bus bandwidth
Replace non-scalable bandwidth substrate (bus)
…with scalable-bandwidth one (e.g., mesh)
Part II: processor snooping bandwidth
Most snoops result in no action
Replace non-scalable broadcast protocol (spam everyone)
…with scalable directory protocol (spam cores that care)
Requires a “
directory
” to keep track of “
sharers
Directory Coherence Protocols
Extend memory to track caching information
For each physical cache line, a 
home
 directory tracks:
Owner: core that has a dirty copy (i.e., M state)
Sharers: cores that have clean copies (i.e., S state)
Cores send coherence events to home directory
Home directory only sends events to cores that care
Read Transaction
L has a cache miss on a load instruction
L
H
 
1: Read Req
 
2: Read Reply
4-hop Read Transaction
L has a cache miss on a load instruction
Block was previously in modified state at R
L
H
 
1: Read Req
 
4: Read Reply
R
State: M
Owner: R
 
2: Recall Req
 
3: Recall Reply
3-hop Read Transaction
L has a cache miss on a load instruction
Block was previously in modified state at R
L
H
 
1: Read Req
 
3: Read Reply
R
State: M
Owner: R
 
2: Fwd’d Read Req
 
3: Fwd’d Read Ack
An Example Race: Writeback & Read
L has dirty copy, wants to write back to H
R concurrently sends a read to H
H
 
1: WB Req
 
5: Read Reply
R
State: M
Owner: L
 
2: Read Req
 
3: Fwd’d Read Req
 
4:
Race !
WB & Fwd Rd
No need to ack
 
6:
Race!
Final State: S
No need to Ack
 
Races require complex 
intermediate
 states
L
Basic Operation: Read
 
Read 
A
 (miss)
#1
Directory
#2
 
Read 
A
 
Fill 
A
 
A:
 Shared, #1
Typical way to reason about directories
Basic Operation: Write
Read 
A
 (miss)
Read 
A
Fill 
A
A:
 Shared, #1
 
ReadExclusive
 A
 
Invalidate 
A
 
Fill 
A
Inv Ack 
A
 
A:
 Mod., #2
#1
Directory
#2
 
 
Coherence vs. Consistency
Coherence concerns only one memory location
Consistency concerns ordering for all locations
A Memory System is Coherent if
Can serialize all operations to that location
Operations performed by any core appear in program order
Read returns value written by last store to that location
A Memory System is Consistent if
It follows the rules of its 
Memory Model
Operations on memory locations appear in 
some
 defined order
 
 
Why Coherence != Consistency
 
/* initial A = B = flag = 0 */
 
  
P1
   
  
P2
 
A = 1;
   
while (flag == 0); /* spin */
 
B = 1; 
   
print A;
 
flag = 1; 
  
print B;
Intuition says we see “1” printed twice (A,B)
Coherence doesn’t say anything
Different memory locations
Uniprocessor ordering (LSQ) won’t help
Consistency defines what is “correct” behavior
 
Sequential Consistency
 (
SC
)
switch randomly set
after each memory op
processors
issue
memory
ops
in
program
order
P1
P2
P3
Memory
Defines Single Sequential Order Among All Ops.
 
 
Sufficient Conditions for SC
“A multiprocessor is sequentially consistent if the result
of any execution is the same as if the operations of all
the processors were executed in some sequential order,
and the operations of each individual processor appear
in this sequence in the order specified by its program.”
      
-Lamport, 1979
Every proc. issues memory ops in program order
Memory ops happen (start and end) atomically
On Store, 
wait to commit
 before issuing next memory op
On Load, 
wait to write back 
before issuing next op
Easy to reason about, very slow (without ugly tricks)
 
 
Mutual Exclusion Example
Mutually exclusive access to a critical region
Works as advertised under Sequential Consistency
Fails if P1 and P2 see different Load/Store order
OoO allows P1 to read B before writing (committing) A
 
  
P1
   
 
 
  
P2
 
lockA: A = 1;
   
lockB: B=1;
 
if (B != 0)
   
if (A != 0)
 
   { A = 0; goto lockA; }
 
   { B = 0; goto lockB; }
 
/* critical section*/
 
/* critical section*/
 
A = 0;
    
B = 0;
 
 
Problems with SC Memory Model
Difficult to implement efficiently in hardware
Straight-forward implementations:
No concurrency among memory access
Strict ordering of memory accesses at each node
Essentially precludes out-of-order CPUs
Unnecessarily restrictive
Most parallel programs won’t notice out-of-order accesses
Conflicts with latency hiding techniques
Mutex Example w/ Store Buffer
 
  
P1
   
 
 
  
P2
 
lockA: A = 1;
   
lockB: B=1;
 
if (B != 0)
   
if (A != 0)
 
   { A = 0; goto lockA; }
 
   { B = 0; goto lockB; }
 
/* critical section*/
 
/* critical section*/
 
A = 0;
    
B = 0;
Does not work
Relaxed Consistency Models
Sequential Consistency (SC):
R → W, R → R, W → R, W → W
Total Store Ordering
 (TSO) relaxes W → R
R → W, R → R, W → W
Partial Store Ordering
 relaxes W → W (coalescing WB)
R → W, R → R
Weak Ordering
 or 
Release Consistency
 (RC)
All ordering explicitly declared
Use 
fences
 to define boundaries
Use 
acquire and release
 to force flushing of values
Atomic Operations & Synchronization
Atomic operations perform multiple actions together
Each of these can implement the others
Compare-and-Swap
 (CAS)
Compare memory value to arg1, write arg2 on match
Test-and-Set
Overwrite memory value with arg1 and return old value
Fetch-and-Increment
Increment value in memory and return the old value
Load-Linked/Store-Conditional
 (LL/SC)
Two
 operations, but Store succeeds iff value unchanged
Slide Note
Embed
Share

Shared memory multiprocessors in computer architecture involve multiple threads using shared memory for communication, with synchronization complexities and implicit communication challenges. Despite these drawbacks, Shared Memory Systems have proven to be successful machines due to their evolutionary extensions and multi-billion-dollar markets. Paired versus separate processor/memory configurations and shared versus point-to-point networks provide insights into memory access and network scalability implications.

  • Computer Architecture
  • Shared Memory Systems
  • Multiprocessors
  • Synchronization
  • Communication

Uploaded on Oct 04, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CSE502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors

  2. CSE502: Computer Architecture Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit via loads and stores Opposite of explicit message-passing multiprocessors Theoretical foundation: PRAM model P1 P2 P3 P4 Memory System

  3. CSE502: Computer Architecture Why Shared Memory? Pluses App sees multitasking uniprocessor OS needs only evolutionary extensions Communication happens without OS Minuses Synchronization is complex Communication is implicit (hard to optimize) Hard to implement (in hardware) Result SMPs and CMPs are most successful machines to date First with multi-billion-dollar markets

  4. CSE502: Computer Architecture Paired vs. Separate Processor/Memory? Separate CPU/memory Uniform memory access (UMA) Equal latency to memory Low peak performance Paired CPU/memory Non-uniform memory access (NUMA) Faster local memory Data placement matters High peak performance CPU($) CPU($) CPU($) CPU($) CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem R R R R Mem Mem Mem Mem

  5. CSE502: Computer Architecture Shared vs. Point-to-Point Networks Shared network Example: bus Low latency Low bandwidth Doesn t scale >~16 cores Simple cache coherence Point-to-point network: Example: mesh, ring High latency (many hops ) Higher bandwidth Scales to 1000s of cores Complex cache coherence CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem R CPU($) Mem R CPU($) Mem R R R R Mem R R Mem CPU($) CPU($)

  6. CSE502: Computer Architecture Organizing Point-To-Point Networks Network topology: organization of network Trade off perf. (connectivity, latency, bandwidth) cost Router chips Networks w/separate router chips are indirect Networks w/ processor/memory/router in chip are direct Fewer components, Glueless MP R CPU($) Mem R CPU($) Mem R R R Mem R Mem R Mem R Mem R Mem R R Mem CPU($) CPU($) CPU($) CPU($) CPU($) CPU($)

  7. CSE502: Computer Architecture Issues for Shared Memory Systems Two big ones Cache coherence Memory consistency model Closely related Often confused

  8. CSE502: Computer Architecture Cache Coherence: The Problem (1/2) Variable A initially has value 0 P1 stores value 1 into A P2 loads A from memory and sees old value 0 P1 P2 t1: Store A=1 t2: Load A? A: 0 A: 0 1 A: 0 L1 L1 Bus A: 0 Main Memory Need to do something to keep P2 s cache coherent

  9. CSE502: Computer Architecture Cache Coherence: The Problem (2/2) P1 and P2 have variable A (value 0) in their caches P1 stores value 1 into A P2 loads A from its cache and sees old value 0 P1 P2 t1: Store A=1 t2: Load A? A: 0 A: 0 1 A: 0 L1 L1 Bus A: 0 Main Memory Need to do something to keep P2 s cache coherent

  10. CSE502: Computer Architecture Approaches to Cache Coherence Software-based solutions Mechanisms: Mark cache blocks/memory pages as cacheable/non-cacheable Add Flush and Invalidate instructions Could be done by compiler or run-time system Difficult to get perfect (e.g., what about memory aliasing?) Hardware solutions are far more common System ensures everyone always sees the latest value

  11. CSE502: Computer Architecture Coherence with Write-through Caches Allows multiple readers, but writes through to bus Requires Write-through, no-write-allocate cache All caches must monitor (aka snoop ) all bus traffic Simple state machine for each cache frame P1 P2 t1: Store A=1 A [V]: 0 A [V]: 0 1 A [V]: 0 A [V I]: 0 Write-through No-write-allocate t3: Invalidate A Bus t2: BusWr A=1 A: 0 A: 0 1 Main Memory

  12. CSE502: Computer Architecture Valid-Invalid Snooping Protocol Processor Actions Ld, St, BusRd, BusWr Bus Messages BusRd, BusWr Track 1 bit per cache frame Valid/Invalid Load / BusRd Load / -- Store / BusWr Valid BusWr / -- Invalid Store / BusWr

  13. CSE502: Computer Architecture Supporting Write-Back Caches Write-back caches are good Drastically reduce bus write bandwidth Add notion of ownership to Valid-Invalid When owner has only replica of a cache block Update it freely Multiple readers are ok Not allowed to write without gaining ownership On a read, system must check if there is an owner If yes, take away ownership

  14. CSE502: Computer Architecture Modified-Shared-Invalid (MSI) States Processor Actions Load, Store, Evict Bus Messages BusRd, BusRdX, BusInv, BusWB, BusReply (Here for simplicity, some messages can be combined) Track 3 states per cache frame Invalid: cache does not have a copy Shared: cache has a read-only copy; clean Clean: memory (or later caches) is up to date Modified: cache has the only valid copy; writable; dirty Dirty: memory (or later caches) is out of date

  15. CSE502: Computer Architecture Simple MSI Protocol (1/9) Load / BusRd Invalid Shared 1: Load A P1 P2 A [I S]: 0 A [I] A [I] 2: BusRd A Bus A: 0 3: BusReply A

  16. CSE502: Computer Architecture Simple MSI Protocol (2/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared 1: Load A 1: Load A P1 P2 A [S]: 0 A [I S]: 0 A [I] 2: BusRd A 3: BusReply A Bus A: 0

  17. CSE502: Computer Architecture Simple MSI Protocol (3/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared Evict / -- Evict A P1 P2 A [I] A [S]: 0 A [S I] A [S]: 0 Bus A: 0

  18. CSE502: Computer Architecture Simple MSI Protocol (4/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX / [BusReply] Evict / -- Store / BusRdX 1: Store A P1 P2 A [S]: 0 A [S I]: 0 A [I M]: 0 1 A [I] 2: BusRdX A 3: BusReply A Modified Bus A: 0 Load, Store / --

  19. CSE502: Computer Architecture Simple MSI Protocol (5/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX / [BusReply] Evict / -- Store / BusRdX 1: Load A P1 P2 A [I S]: 1 A [I] A [M]: 1 A [M S]: 1 3: BusReply A 2: BusRd A Modified Bus A: 0 A: 0 1 4: Snarf A Load, Store / --

  20. CSE502: Computer Architecture Simple MSI Protocol (6/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX / [BusReply] BusRdX, BusInv / [BusReply] Evict / -- Store / BusRdX 1: Store A aka Upgrade P1 P2 A [S]: 1 A [S M]: 2 A [S]: 1 A [S I] 2: BusInv A Modified Bus A: 1 Load, Store / --

  21. CSE502: Computer Architecture Simple MSI Protocol (7/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX, BusInv / [BusReply] Evict / -- BusRdX / BusReply Store / BusRdX 1: Store A P1 P2 A [M]: 2 A [M I]: 2 A [I M]: 3 A [I] 2: BusRdX A 3: BusReply A Modified Bus A: 1 Load, Store / --

  22. CSE502: Computer Architecture Simple MSI Protocol (8/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX, BusInv / [BusReply] Evict / -- BusRdX / BusReply Store / BusRdX Evict / BusWB 1: Evict A P1 P2 A [I] A [M]: 3 A [M I]: 3 2: BusWB A Modified Bus A: 1 A: 1 3 Load, Store / --

  23. CSE502: Computer Architecture Simple MSI Protocol (9/9) Load / BusRd BusRd / [BusReply] Load / -- Invalid Shared BusRdX, BusInv / [BusReply] Evict / -- BusRdX / BusReply Store / BusRdX Evict / BusWB Cache Actions: Load, Store, Evict Bus Actions: BusRd, BusRdX BusInv, BusWB, BusReply Modified Load, Store / -- Usable coherence protocol

  24. CSE502: Computer Architecture Scalable Cache Coherence Part I: bus bandwidth Replace non-scalable bandwidth substrate (bus) with scalable-bandwidth one (e.g., mesh) Part II: processor snooping bandwidth Most snoops result in no action Replace non-scalable broadcast protocol (spam everyone) with scalable directory protocol (spam cores that care) Requires a directory to keep track of sharers

  25. CSE502: Computer Architecture Directory Coherence Protocols Extend memory to track caching information For each physical cache line, a home directory tracks: Owner: core that has a dirty copy (i.e., M state) Sharers: cores that have clean copies (i.e., S state) Cores send coherence events to home directory Home directory only sends events to cores that care

  26. CSE502: Computer Architecture Read Transaction L has a cache miss on a load instruction 1: Read Req L H 2: Read Reply

  27. CSE502: Computer Architecture 4-hop Read Transaction L has a cache miss on a load instruction Block was previously in modified state at R State: M Owner: R 1: Read Req 2: Recall Req L H R 4: Read Reply 3: Recall Reply

  28. CSE502: Computer Architecture 3-hop Read Transaction L has a cache miss on a load instruction Block was previously in modified state at R State: M Owner: R 1: Read Req 2: Fwd d Read Req L H R 3: Fwd d Read Ack 3: Read Reply

  29. CSE502: Computer Architecture An Example Race: Writeback & Read L has dirty copy, wants to write back to H R concurrently sends a read to H Race! Race ! State: M Owner: L No need to Ack Final State: S WB & Fwd Rd No need to ack 1: WB Req 2: Read Req 6: L H R 4: 3: Fwd d Read Req 5: Read Reply Races require complex intermediate states

  30. CSE502: Computer Architecture Basic Operation: Read #1 Directory #2 Read A (miss) A: Shared, #1 Typical way to reason about directories

  31. CSE502: Computer Architecture Basic Operation: Write #1 Directory #2 Read A (miss) A: Shared, #1 A: Mod., #2

  32. CSE502: Computer Architecture Coherence vs. Consistency Coherence concerns only one memory location Consistency concerns ordering for all locations A Memory System is Coherent if Can serialize all operations to that location Operations performed by any core appear in program order Read returns value written by last store to that location A Memory System is Consistent if It follows the rules of its Memory Model Operations on memory locations appear in some defined order

  33. CSE502: Computer Architecture Why Coherence != Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says we see 1 printed twice (A,B) Coherence doesn t say anything Different memory locations Uniprocessor ordering (LSQ) won t help Consistency defines what is correct behavior

  34. CSE502: Computer Architecture Sequential Consistency (SC) processors issue memory ops in program order P3 P1 P2 switch randomly set after each memory op Memory Defines Single Sequential Order Among All Ops.

  35. CSE502: Computer Architecture Sufficient Conditions for SC A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. Every proc. issues memory ops in program order Memory ops happen (start and end) atomically On Store, wait to commit before issuing next memory op On Load, wait to write back before issuing next op -Lamport, 1979 Easy to reason about, very slow (without ugly tricks)

  36. CSE502: Computer Architecture Mutual Exclusion Example Mutually exclusive access to a critical region Works as advertised under Sequential Consistency Fails if P1 and P2 see different Load/Store order OoO allows P1 to read B before writing (committing) A P1 lockA: A = 1; if (B != 0) { A = 0; goto lockA; } /* critical section*/ A = 0; P2 lockB: B=1; if (A != 0) { B = 0; goto lockB; } /* critical section*/ B = 0;

  37. CSE502: Computer Architecture Problems with SC Memory Model Difficult to implement efficiently in hardware Straight-forward implementations: No concurrency among memory access Strict ordering of memory accesses at each node Essentially precludes out-of-order CPUs Unnecessarily restrictive Most parallel programs won t notice out-of-order accesses Conflicts with latency hiding techniques

  38. CSE502: Computer Architecture Mutex Example w/ Store Buffer P2 P1 Read B t1 Read A t2 Write B t3 t4 Write A Shared Bus A: 0 B: 0 P1 lockA: A = 1; if (B != 0) { A = 0; goto lockA; } /* critical section*/ A = 0; P2 lockB: B=1; if (A != 0) { B = 0; goto lockB; } /* critical section*/ B = 0; Does not work

  39. CSE502: Computer Architecture Relaxed Consistency Models Sequential Consistency (SC): R W, R R, W R, W W Total Store Ordering(TSO) relaxes W R R W, R R, W W Partial Store Orderingrelaxes W W (coalescing WB) R W, R R Weak Ordering or Release Consistency (RC) All ordering explicitly declared Use fences to define boundaries Use acquire and release to force flushing of values X Y X must complete before Y

  40. CSE502: Computer Architecture Atomic Operations & Synchronization Atomic operations perform multiple actions together Each of these can implement the others Compare-and-Swap (CAS) Compare memory value to arg1, write arg2 on match Test-and-Set Overwrite memory value with arg1 and return old value Fetch-and-Increment Increment value in memory and return the old value Load-Linked/Store-Conditional (LL/SC) Two operations, but Store succeeds iff value unchanged

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#