Efficient Support for Synchronization Without Invalidations

Slide Note
Embed
Share

Addressing complex software issues like data races and inefficiencies in hardware synchronization, the DeNovoSync solution provides efficient support for arbitrary synchronization without writer-initiated invalidations. By introducing disciplined shared memory with structured synchronization and explicit memory side effects, this approach simplifies software complexity, ensures safe non-determinism, and enhances efficiency in data accesses. While other solutions focus on data access restrictions, DeNovoSync targets unstructured synchronization commonly used in software components at runtime and in operating systems.


Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. DeNovoSync: Efficient Support for Arbitrary Synchronization without Writer-Initiated Invalidations Hyojin Sung and Sarita Adve Department of Computer Science University of Illinois, EPFL

  2. Motivation Complex software Data races, non-determinism, implicit communication, Shared Memory Complex, inefficient hardware Complex coherence, consistency, unnecessary traffic, ...

  3. Motivation Complex software Data races, non-determinism, implicit communication, WILD Shared Memory Complex, inefficient hardware Complex coherence, consistency, unnecessary traffic, ...

  4. Motivation Complex software Data races, non-determinism, implicit communication, Disciplined WILD Shared Memory Complex, inefficient hardware Complex coherence, consistency, unnecessary traffic, ...

  5. Motivation Complex software Data races, non-determinism, implicit communication, Structured synch + Explicit memory sideeffects Disciplined Shared Memory Complex, inefficient hardware Complex coherence, consistency, unnecessary traffic, ...

  6. Motivation Simpler Complex software No data races, safe non-determinism, explicit sharing, Structured synch + Explicit memory sideeffects Disciplined Shared Memory Complex, inefficient hardware Complex coherence, consistency, unnecessary traffic, ...

  7. Motivation Simpler Complex software No data races, safe non-determinism, explicit sharing, Structured synch + Explicit memory sideeffects Disciplined Shared Memory Complex, inefficient hardware Complex coherence, consistency, unnecessary traffic, ...

  8. Motivation Simpler Complex software No data races, safe non-determinism, explicit sharing, Structured synch + Explicit memory sideeffects Disciplined Shared Memory Simpler, more efficient Complex, inefficient hardware DeNovo [PACT11], DeNovoND [ASPLOS13, Top Picks 14] BUT focus on data accesses, synchronization restricted BUT much software (runtime, OS, ) uses unstructured synch

  9. Motivation Simpler Complex software No data races, safe non-determinism, explicit sharing, Structured synch + Explicit memory sideeffects Support arbitrary synchronization with advantages of DeNovo Disciplined Shared Memory DeNovoSync: Simpler, more efficient Complex, inefficient hardware DeNovo [PACT11], DeNovoND [ASPLOS13, Top Picks 14] BUT focus on data accesses, synchronization restricted BUT much software (runtime, OS, ) uses unstructured synch

  10. Supporting Arbitrary Synchronization: The Challenge MESI: Writer sends invalidations to cached copies to avoid stale data Directory storage, inv/ack msgs, transient states, Write A Read A Cache 1 A Cache 2 A BUT Synchronization? Na ve: Don t cache synch INV Prior DeNovo assumptions Race-freedom Restricted synchronization with special hardware Reader self-invalidates stale data

  11. Contributions of DeNovoSync DeNovoSync: Cache arbitrary synch w/o writer invalidations Simplicity, perf, energy advantages of DeNovo w/o sw restrictions DeNovoSync vs. MESI for 24 kernels (16 & 64 cores), 13 apps Kernels: 22% lower exec time, 58% lower traffic for 44 of 48 cases Apps: 4% lower exec time, 24% lower traffic for 12 of 13 cases

  12. Outline Motivation Background: DeNovo Coherence for Data DeNovoSync Design Experiments Conclusions

  13. DeNovo Coherence for Data (1 of 2) Original DeNovo software assumptions [PACT 11] Data-race-free Synchronization: Barriers demarcate parallel phases Writeable data regions in parallel phase are explicit W Coherence Read hit: Don t return stale data Before next parallel phase, cache selectively self-invalidates Needn t invalidate data it accessed in previous phase W Read miss: Find one up-to-date copy registry Write miss registers at directory Shared LLC data arrays double as registry Keep valid data or registered core id

  14. DeNovo Coherence for Data (2 of 2) No transient states Read Read Invalid Valid No invalidation traffic Write Write No directory storage overhead Registered Read, Write No false sharing (word coherence) But how to handle arbitrary synchronization? More complexity-, performance-, and energy-efficient than MESI DeNovoND adds structured locks [ASPLOS 13, Top Picks 14] When to self-invalidate: at lock acquire What data to self-invalidate: dynamically collected modified data signatures Special hardware support for locks

  15. Outline Motivation Background: DeNovo Coherence for Data DeNovoSync Design Experiments Conclusions

  16. Unstructured Synchronization Michael-Scott non-blocking queue void queue.enqueue(value v): node *w := new node(v, null) ptr t, n loop t := tail n := t->next if t == tail if n == null if (CAS(&t->next, n, w)) break; else CAS(&tail, t, n) CAS(&tail, t, w) New node to be inserted Data accesses ordered by synchronization Self-invalidate at synch using static regions or dynamic signatures But what about synchronization?

  17. DeNovoSync: Software Requirements Software requirement: Data-race-free Distinguish synchronization vs. data accesses to hardware Obeyed by C++, C, Java, Semantics: Sequential consistency Optional software information for data consistency performance 17

  18. DeNovoSync0 Protocol Test&Test&Set Lock Key: Synch read should not return stale data Core 1 Core 2 When to self-invalidate synch location? Every synch read? Every synch read to non-registered state ACQ LD LD LD REL ST LD DeNovoSync0 registers (serializes) synch reads Successive reads hit Updates propagate to readers CAS 18

  19. DeNovoSync0 Protocol Test&Test&Set Lock Key: Synch read should not return stale data Core 1 Core 2 Core 3 ACQ When to self-invalidate synch location? Every synch read? Every synch read to non-registered state ACQ LD LD LD LD LD REL ST LD DeNovoSync0 registers (serializes) synch reads Successive reads hit Updates propagate to readers BUT many registration transfers for Read-Read races LD CAS CAS LD 19

  20. DeNovoSync = DeNovoSync0 + Hardware Backoff Hardware backoff to reduce Read-Read races Remote synch read requests = hint for contention Delay next (local) synch read miss for backoff cycles Two-level adaptive counters for backoff cycles B = Per-core backoff counter On remote synch read request, B B + I I = Per-core increment counter D = Default increment value On Nth remote synch read request, I I + D N determined by system configuration Read-Read races Contention Backoff! More Read-Read races More contention Backoff longer! 20

  21. Example Test&Test&Set Lock Thread 1 Thread 2 Thread 3 ACQ Core 1 Core 3 Core 2 ACQ LD LD Backoff 0 0 0 LD HIT Registered Invalid Invalid LD REL ST C1 Shared Cache 21

  22. Example Test&Test&Set Lock Thread 1 Thread 2 Thread 3 ACQ Core 1 Core 3 Core 2 ACQ LD LD Backoff 0 16 0 0 LD Registered Valid Registered Invalid LD REL ST BACKOFF C1 C2 Shared Cache 22

  23. Example Thread 1 Thread 2 Thread 3 ACQ Core 1 Core 3 Core 2 ACQ LD LD Backoff 16 0 0 LD Valid Registered Invalid Registered REL ST LD CAS LD C2 C3 C3 Shared Cache 23

  24. Example Thread 1 Thread 2 Thread 3 ACQ Core 1 Core 3 Core 2 ACQ LD LD Backoff 16 0 0 LD Valid Invalid Registered REL ST LD CAS LD C2 C3 C2 Shared Cache 24

  25. Example Thread 1 Thread 2 Thread 3 ACQ Core 1 Core 3 Core 2 ACQ LD LD Backoff 16 0 0 LD Valid Invalid HIT Registered REL ST LD CAS LD C2 C3 C2 Shared Cache 25

  26. Example Thread 1 Thread 2 Thread 3 ACQ Core 1 Core 3 Core 2 ACQ LD LD Backoff 16 0 0 LD Valid Registered Invalid Registered Invalid HIT REL ST LD CAS LD C2 C3 C2 Shared Cache Hardware backoff reduces cache misses from Read-Read races 26

  27. Outline Motivation Background: DeNovo Coherence for Data DeNovoSync Design Experiments Methodology Qualitative Analysis Results Conclusions

  28. Methodology Compared MESI vs. DeNovoSync0 vs. DeNovoSync Simics full-system simulator GEMS and Garnet for memory and network simulation 16 and 64 cores (in-order) Metrics: Execution time, network traffic Workloads 24 synchronization kernels Lock-based: Test&Test&Set and array locks Non-blocking data structures Barriers: centralized and tree barriers, balanced and unbalanced 13 application benchmarks From SPLASH-2 and PARSEC 3.1 Annotated data sharing statically (choice orthogonal to this paper) 28

  29. Qualitative Analysis ... ACQ Analyze costs (execution time, traffic) in two parts LD LD LD Linearization point Ordering of linearization instruction = ordering of method Usually on critical path LD CAS Pre-linearization points Non-linearization instructions (do not determine ordering) Usually checks, not on critical path ... 29

  30. Qualitative Analysis Example (1 of 2) Multiple readers, one succeeds: Test&Test&Set locks Linearization Pre-linearization MESI DeNovoSync0 DeNovoSync

  31. Qualitative Analysis Example (1 of 2) Multiple readers, one succeeds: Test&Test&Set locks Linearization Pre-linearization MESI Release has high inv overhead, on critical path to next acquire DeNovoSync0 DeNovoSync

  32. Qualitative Analysis Example (1 of 2) Multiple readers, one succeeds: Test&Test&Set locks Linearization Pre-linearization MESI Release has high inv overhead, on critical path to next acquire DeNovoSync0 No inv overhead DeNovoSync

  33. Qualitative Analysis Example (1 of 2) Multiple readers, one succeeds: Test&Test&Set locks Linearization Pre-linearization MESI Release has high inv overhead, on critical path to next acquire DeNovoSync0 No inv overhead DeNovoSync No inv overhead

  34. Qualitative Analysis Example (1 of 2) Multiple readers, one succeeds: Test&Test&Set locks Linearization Pre-linearization MESI Release has high inv overhead, on critical path to next acquire Local spinning DeNovoSync0 No inv overhead DeNovoSync No inv overhead

  35. Qualitative Analysis Example (1 of 2) Multiple readers, one succeeds: Test&Test&Set locks Linearization Pre-linearization MESI Release has high inv overhead, on critical path to next acquire Local spinning DeNovoSync0 No inv overhead Read-Read races, but not on critical path DeNovoSync No inv overhead

  36. Qualitative Analysis Example (1 of 2) Multiple readers, one succeeds: Test&Test&Set locks Linearization Pre-linearization MESI Release has high inv overhead, on critical path to next acquire Local spinning DeNovoSync0 No inv overhead Read-Read races, but not on critical path DeNovoSync No inv overhead Backoff mitigates Read- Read races DeNovo expected to be better than MESI Similar analysis holds for non-blocking constructs

  37. Qualitative Analysis Example (2 of 2) Many readers, all succeed: Centralized barriers MESI: high linearization due to invalidations DeNovo: high linearization due to serialized read registrations One writer, one reader: Tree barriers, array locks DeNovo, MESI comparable to first order Qualitative analysis only considers synchronization Data effects: Self-invalidation, coherence granularity, Orthogonal to this work, but affect experimental results 37

  38. Synchronization Kernels: Execution Time (64 cores) 200% 175% 150% 125% 100% 75% 50% 25% 0% DS0 DS DS DS DS DS DS DS DS DS DS DS DS0 DS0 DS0 DS0 DS0 DS0 DS0 DS0 DS0 DS0 M M M M M M M M M M M single Q stack heap single Q stack heap M-S queue Herlihy stack Herlihy heap tree (UB) central (UB) Non-blocking struct Barriers Test&Test&Set locks Array locks For 44 of 48 cases, 22% lower exec time, 58% lower traffic (not shown) Remaining 4 cases: Centralized unbalanced barriers: But tree barriers better for MESI too Heap with array locks: Need dynamic data signatures for self-invalidation

  39. Barriers (64 cores) Applications (64 cores) Execution Time 120% 100% 80% 60% 40% 20% 0% Network Traffic 120% 100% 80% 60% 40% 20% 0% M DS M DS M DS M DS M DS M DS M DS M DS M DS M DS M DS M DS M DS FFT LU black scholes swaptions radix bodytrack barnes water ocean fluid animate canneal ferret x264 For 12 of 13 cases, 4% lower exec time, 24% lower traffic Memory time dominated by data (vs. sync) accesses

  40. Conclusions DeNovoSync: First to cache arbitrary synch w/o writer-initiated inv Registered reads + hardware backoff With simplicity, performance, energy advantages of DeNovo No transient states, no directory storage, no inv/acks, no false sharing, DeNovoSync vs. MESI Kernels: For 44 of 48 cases, 22% lower exec time, 58% lower traffic Apps: For 12 of 13 cases, 4% lower exec time, 24% lower traffic Complexity-, performance-, energy-efficiency w/o s/w restrictions Future: DeNovo w/ heterogeneity [ISCA 15], dynamic data signatures

Related


More Related Content