Efficient Hardware Support for Disciplined Non-Determinism in Shared Memory Systems

Slide Note
Embed
Share

DeNovoND presents a holistic approach to rethinking memory hierarchy, aiming to enhance hardware efficiency for disciplined non-determinism in shared memory systems. The work addresses limitations in deterministic programming, offering strong safety properties and simplified coherence and consistency mechanisms. Challenges in supporting nondeterministic programs are also discussed, highlighting the need for managing conflicting concurrent accesses and unknown side-effects in critical sections.


Uploaded on Sep 27, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. DeNovoND: Efficient Hardware Support for Disciplined Non-Determinism Hyojin Sung, Rakesh Komuravelli, and Sarita V. Adve Department of Computer Science University of Illinois at Urbana-Champaign Illinois-Intel Parallelism Center

  2. Motivation Shared memory is de-facto model for multicore SW and HW BUT Complex SW: data races, unstructured parallelism, memory model, Inefficient HW: complex coherence/consistency, unnecessary traffic, Recent work on disciplined shared memory SW: Easier programming model HW: If SW is more disciplined, can we build more efficient HW? DeNovo: Holistic rethinking of entire memory hierarchy Illinois-Intel Parallelism Center

  3. Disciplined Shared Memory Disciplined Shared-Memory = Global address space + Implicit, anywhere communication, synchronization Explicit, structured side-effects Illinois-Intel Parallelism Center

  4. Disciplined Shared Memory Deterministic Parallel Java (DPJ) strong safety properties Determinism-by-default, simple semantics OOPSLA 09 structured parallel control Disciplined Shared Memory explicit effects DeNovo performance, complexity and power efficient Simplify coherence and consistency PACT 11 Illinois-Intel Parallelism Center

  5. Limitation DeNovo for deterministic programs Important assumptions 1. No conflicting concurrent accesses, only barrier synchronization 2. Known side-effects Allowed DeNovo to eliminate design complexity and inefficiency Challenges for nondeterministic programs The assumptions do not hold any more 1. Can have conflicting concurrent accesses, support lock synchronization 2. Side-effects unknown in critical sections Applications with lock-based non-determinism are common Illinois-Intel Parallelism Center

  6. Contribution Deterministic Parallel Java (DPJ) strong safety properties Determinism-by-default, simple semantics Explicit & safe non-determinism POPL 11 structured parallel control Disciplined Shared Memory explicit effects DeNovoND: Non-deterministic codes with benefits of DeNovo Minimal additional HW for non-determinism Comparable performance to MESI 30% lower network traffic than MESI PLUS all advantages of DeNovo for deterministic codes Illinois-Intel Parallelism Center

  7. Outline Motivation Background DPJ/DeNovo for deterministic codes DPJ support for disciplined non-determinism DeNovoND Design DeNovoND Implementation Evaluation Conclusion and Future Work Illinois-Intel Parallelism Center

  8. DPJ for Deterministic Codes ... Structured parallel control Fork-join parallelism Explicit region and effect Regions divide heap Read or write effects on regions Data-race freedom guarantee Simple, modular type checking LD ST ST ST ST ... write effect heap Illinois-Intel Parallelism Center

  9. DPJ for Deterministic Codes ... Java-compatible type system Structured parallel control Fork-join parallelism Explicit region and effect Regions divide heap Read or write effects on regions Data-race freedom guarantee Simple, modular type checking LD ST ST ST ST Hardware simplify coherence problems! ... write effect heap Illinois-Intel Parallelism Center

  10. DeNovo for Deterministic Codes Coherence Enforcement 1. Invalidate stale copies in private cache 2. Track up-to-date copy Explicit effects Compiler knows all writeable regions in this parallel phase Cache can self-invalidate before next parallel phase Registration Directory keeps track of one up-to-date copy Writer registers itself before next parallel phase Illinois-Intel Parallelism Center

  11. DeNovo for Deterministic Codes No space overhead Keep valid data or registered core id LLC data arrays double as directory No transient states No invalidation traffic No false sharing registry Read Invalid Valid Write Write Registered Illinois-Intel Parallelism Center

  12. Example Run L1 of Core 1 L1 of Core 2 X in DeNovo-region Y in DeNovo-region VX V Y R X V Y R X I Y VX R Y VX V Y I X R Y ST ST .. Registration Registration Shared L2 Ack Ack R C1 R C2 VX V Y self-invalidate( ) Registered Valid Invalid Illinois-Intel Parallelism Center

  13. DPJ Support for Safe Non-Determinism Nondeterminism comes from conflicting concurrent accesses ... Isolate these accesses as atomic Enclosed in atomic sections Atomic regions and effects Disciplined non-determinism - Race freedom, strong isolation - Determinism-by-default semantics LD ST ... DeNovoND converts atomic statements into locks Illinois-Intel Parallelism Center

  14. Outline Motivation Background DeNovoND Design Memory Consistency Model Distributed Queue-based Lock DeNovoND Implementation Evaluation Conclusion and Future Work Illinois-Intel Parallelism Center

  15. Memory Consistency Model Deterministic accesses 1. Same task in this parallel phase 2. Or before this parallel phase ... DeNovo Coherence Mechanism ST 0xa Parallel Phase LD 0xa .. Illinois-Intel Parallelism Center

  16. Memory Consistency Model Non-deterministic accesses 1. Same task in this parallel phase 2. Or before this parallel phase 3. Or in preceding critical sections ... ST 0xa ST 0xa Parallel Phase Critical Section LD 0xa .. Illinois-Intel Parallelism Center

  17. Coherence for non-deterministic data Coherence Enforcement 1. Invalidate stale copies in private cache 2. Track up-to-date copy When to invalidate? Between the start of critical section and any read What to invalidate? Entire cache? regions with atomic effect? Track atomic writes in a signature, transfer with lock Registration Writer updates before next critical section Illinois-Intel Parallelism Center

  18. Distributed Queue-based Lock Lock primitive that works on DeNovoND No directory, no write invalidation No spinning for lock Modeled after QOSB Lock Lock requests form a distributed queue But much simpler Details in the paper Illinois-Intel Parallelism Center

  19. Outline Motivation Background DeNovoND Design DeNovoND Implementation Evaluation Conclusion and Future Work Illinois-Intel Parallelism Center

  20. Access Signatures Simple and small hardware Bloom filter per core Track accesses with atomic effects only Only 256 bits suffice Operations on Bloom filter On write: insert address On read: query filter for address for self-invalidation Illinois-Intel Parallelism Center

  21. Example Run X in DeNovo-region Y in DeNovo-region Z in atomic DeNovo-region W in atomic DeNovo-region L1 of Core 1 L1 of Core 2 lock transfer lock transfer Z W Z W R X V R R Z R Z RZ R X R X R X R X Y Y Y Y Y V X V X V X V X Y Y Y Y Y V X V I V V I V V W V W I W V W I W V V V V R R R R R V W V W V W R W R W ST Z Z Z Z Z Z Z LD ST .. LD Read miss Registration Registration Read miss Shared L2 Ack Ack R C1 R C2 V Z R C1 V W R C1 R C2 R C1 R C2 R C1 R C2 self-invalidate( ) self-invalidate( ) reset filter V W Illinois-Intel Parallelism Center

  22. Optimization to reduce self-invalidation X in DeNovo-region Y in DeNovo-region Z in atomic DeNovo-region W in atomic DeNovo-region 1. loads in Registered state 2. Touched-atomic bit Set on first atomic load Subsequent load don t self-invalidate More in the paper ST LD .. LD LD self-invalidate( ) Illinois-Intel Parallelism Center

  23. Overheads Hardware Bloom filter 256 bits per core Storage overhead One additional state, but no storage overhead (2 bits) Touched-atomic bit per word in L1 Communication overhead Bloom filter piggybacked on lock transfer message Writeback messages for locks Lock writebacks carry more info Illinois-Intel Parallelism Center

  24. Evaluation Methodology Simulator: Simics + GEMS + Garnet System Parameters 16 in-order cores Workloads SPLASH-2, PARSEC and STAMP Unchanged except region/effect and self-invalidation Protocols MESI and DeNovoND With idealized locks and realistic locks Illinois-Intel Parallelism Center

  25. MESI vs. DeNovoND: Idealized lock barnes ocean water fluidanimate streamcluster tsp kmeans ssca2 DeNovoND performs comparable to MESI for all apps For both DIL-INF and DIL-256 Illinois-Intel Parallelism Center

  26. MESI vs. DeNovoND: Realistic lock barnes ocean water fluidanimate streamcluster tsp kmeans ssca2 pthread lock vs. distributed queue-based lock DeNovoND performs comparable or better than MESI Illinois-Intel Parallelism Center

  27. Network Traffic (Realistic lock) barnes ocean water fluidanimate streamcluster tsp kmeans ssca2 DeNovoND has 33% less traffic than MESI (67% max) No invalidation traffic Reduced load misses due to lack of false sharing Illinois-Intel Parallelism Center

  28. Conclusions and Future Work DeNovoND: Efficient HW support for non-determinism Minimal additional HW for safe non-determinism Comparable performance to MESI 30% lower network traffic than MESI PLUS all advantages of DeNovo for deterministic codes Future work: broaden the application space further Pipeline parallelism, lock-free data structures, OS, legacy codes Illinois-Intel Parallelism Center

Related