Practical Data Race Detection for Production Use

Slide Note
Embed
Share

ProRace is a tool designed for detecting data races in multithreaded programs, highlighting the severe problems caused by race conditions like power outages and financial losses. It addresses the limitations of existing race detectors by focusing on soundness, complete detection, and lower overhead. The key problem of sampling unsoundness is discussed, with the goal of creating a production-ready, lightweight sampling tool using Performance Monitor Unit (PMU) support in commodity processors.


Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. ProRace: Practical Data Race Detection for Production Use Tong Zhang, Changhee Jung, Dongyoon Lee Department of Computer Science - 1 -

  2. Data Races in Multithreaded Programs Two threads access the same shared location (one is write) Not ordered by synchronizations Thread 1 Thread 2 p = NULL if (p) { crash fput(p, ) } MySQL bug #3596 - 2 -

  3. Race Conditions Caused Severe Problems 50+ million people lost power Cost an estimated $6 billion Northeast Blackout of 2003 About 30 million shares worth of trading were affected Cost an estimated $13 million Stock Price Mismatch in 2012 - 3 -

  4. Limitations of Existing Dynamic Race Detectors Sound and Complete Data Race Detectors FastTrack [PLDI 09] Intel s Inspector XE Google s Thread Sanitizer High overhead (10-100x) Not suitable for production run Can not be left always-on - 4 -

  5. Limitations of Existing Dynamic Race Detectors Sampling-based (Unsound) Data Race Detectors SW sampling LiteRace [PLDI 09] Pacer [PLDI 10] Still expensive (up to 1.86x) HW sampling DataCollider [OSDI 10] RaceZ [ICSE 11] Low detection coverage (high false negatives) - 5 -

  6. Key Problem: Sampling is Unsound Thread 1 Thread 2 A = A + 1 C = C*C Race Detected! Missed! Race p = null B = A * A fput (p, ) D = D-C Sampled instructions - 6 -

  7. Our Goal: Production Runnable Lightweight Sampling Use Performance Monitor Unit (PMU) support in commodity processors: e.g., Intel s PEBS and PT Use custom PMU driver High Detection Capability Reconstruct unsampled memory accesses via best-effort forward/backward replay Detect data races with extended memory traces - 7 -

  8. Outline Motivation Background: Hardware Assisted Program Tracing ProRace: Design and Implementation Experiments Conclusion - 8 -

  9. Intel Processors Supports for Program Tracing PEBS (Precise Event Based Sampling) Configurable sampling frequency and events e.g., retired loads/stores Provides the sampled event and its architectural execution context at the sample time e.g., register values, time stamp counter (TSC), etc., but not memory states PT (Process Tracing) Complete control-flow trace e.g., branch decisions, target IP HW compression -> Requires post-processing - 9 -

  10. Outline Motivation Background: Hardware Assisted Program Tracing ProRace: Design and Implementation Experiments Conclusion - 10 -

  11. Overview of ProRace Online Offline Data Race Report Data Race Detector Library Sync Ops. low overhead High detection Low detection coverage coverage Mem. Ops & Architecture State PEBS Reconstruct Memory Accesses Extended Extended Memory Memory Trace Memory Trace Memory Trace Trace Extended Branch Information PT Custom Driver - See the paper Focus of this talk - 11 -

  12. Program Path Recording via Process Trace BB1 BB2 BB4 ld R1 M[R2-24] ld R3 M[R1] cmp $0, R3 je BB2 ld R4 M[R2-1] cmp R4, R3 je BB4 ld R1 M[R2-28] add $1, R1 st M[R2-28] R1 ldR1 M[R2-24] add $1, R1 st M[R2-24] R1 BB0 Process Trace (PT) BB1: T (to BB2) BB2: F (to BB3) BB3: T (to BB1) BB1: T (to BB2) BB2: T (to BB4) BB4: T (to BB3) BB1 T F BB2 BB5 T F BB4 BB3 - 12 - PEBS Sample

  13. Reconstructing Unsampled Memory Accesses via Forward Replay R1 R2 R3 R4 Memory Access BB1 BB2 BB4 ld R1 M[R2-24] ld R3 M[R1] cmp $0, R3 je BB2 ld R4 M[R2-1] cmp R4, R3 je BB4 ld R1 M[R2-28] add $1, R1 st M[R2-28] R1 ldR1 M[R2-24] add $1, R1 st M[R2-24] R1 A A A A - - - - - B B B B B B B B B C C C C C C C C C D - - - - - - - - Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] - 13 - PEBS Sample - Unknown Value

  14. Reconstructing Unsampled Memory Accesses via Backward Replay R1 R2 R3 R4 Memory Access - B - D BB1 BB2 BB4 ld R1 M[R2-24] ld R3 M[R1] cmp $0, R3 je BB2 ld R4 M[R2-1] cmp R4, R3 je BB4 ld R1 M[R2-28] add $1, R1 st M[R2-28] R1 ldR1 M[R2-24] add $1, R1 st M[R2-24] R1 Read [B-24] A B - D Read [A] A B C D A B C D A A A A - - - - - B B B B B B B B B C C C C C C C C C D - - - - - - - - Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] - 14 - PEBS Sample - Unknown Value

  15. Other Techniques Used by ProRace To reduce online sample overhead Custom PMU(PEBS+PT) driver More efficient than off-the-shelf Linux Driver To reconstruct more unsampled memory accesses Memory emulation Reverse execution Reconstruct PC relative memory access Please refer to the paper - 15 -

  16. Implementation Implementation Online: Linux Kernel 4.x Offline: Intel Pin + FastTrack [PLDI 09] Evaluation Performance Evaluation PARSEC benchmark suites with simlarge input Real world applications (Apache, MySQL, Memcached, etc.) Detection Capability 12 real bugs from the previous literatures - 16 -

  17. Performance Overhead: PARSEC 10 100 1000 10000 100000 Trace every N samples: 7% overhead 5 4 Normalized Overhead 3 2 1 0 - 17 -

  18. Performance Overhead: Real App 2.6% overhead 10 100 1000 10000 100000 Trace every N samples: 9.13 5 4 Normalized Overhead 3 2 1 0 - 18 -

  19. Detection Capability Detection Capability at the Sampling Period of 1000 RaceZ ProRace 100% detection Detection Capability (%) 50 25 0 - 19 -

  20. Conclusion Detection Capability ProRace Sampling Based Detector Overhead ProRace lowers the runtime cost of sampling and introduced new technique to reconstruct un-sampled memory accesses, which enhanced data race detection coverage. - 20 -

  21. Q&A Thank you! - 21 -

  22. 1. Performance Overhead Vanilla ProRace 7.52 49.92 7.8 5 Normalized Overhead 4 3 2 1 0 Real Applications PARSEC - 22 -

  23. 3. Offline Analysis Cost Prototype Using Pin Tool Can be optimized and parallelized PT Decoding Trace Reconstruction Data Race Detection Overhead per One Second 10000 Program Execution 100 1 apache mysql cherokee aget pfscan pbzip2 - 23 -

  24. Memory Recovery Rate Basicblock Replay Forward Replay Forward/Backward Relpay 1000 Normalized Memory Recovery Rate 100 10 1 apache mysql cherokee aget pfscan pbzip2 - 24 -

  25. Recover unsampled memory accesses Forward replay Thread 2 Trace Thread 1 Trace p = malloc(); if ( p ) { t0 p = NULL Sample p Sample r t1 r = &p; t2 PEBS SAMPLE **r++; } crash Sample p t3 Recovered Forward replay Backward replay Time - 25 -

  26. Recover unsampled memory accesses Backward replay Thread 2 Trace Thread 1 Trace p = malloc(); if ( p ) { Sample p t0 p = NULL Sample p Sample r t1 r = &p; t2 PEBS SAMPLE **r++; } crash t3 Recovered Forward replay Backward replay Time - 26 -

  27. Recover unsampled memory accesses Replay Across BasicBlock Thread 2 Trace Thread 1 Trace Sample p p = malloc(); if ( p ) { Taken Branch Sample p t0 p = NULL Sample p Sample r t1 r = &p; t2 PT SAMPLE PEBS SAMPLE **r++; } crash t3 Recovered Forward replay Backward replay Time - 27 -

  28. Recover unsampled memory accesses Replay with memory emulation enabled BasicBlock 2(iteration n) %a %b %c Memory Access Read [%b-20] mov MEM[%b-20] %a %a Read [%b-16] mov MEM[%b-16] %c %c add MEM[%c], %a Read [%c] Memory [%b] %c %a [%b-16] [%b-20] available unavailable - 28 -

  29. Recover unsampled memory accesses Replay with memory emulation enabled(continued) BasicBlock 2(iteration n) %a %b %c Memory Access mov MEM[%b-16] %c Read [%b-16] Read [%c] add MEM[%c], %a Memory %c %c [%b-16] BasicBlock 2(iteration n+1) %a %b %c Memory Access Read [%b-16] Read [%c] NOT RECOVERED Recovered through memory emulation mov MEM[%b-16] %c add MEM[%c], %a available unavailable - 29 -

  30. Recover Unsampled Memory Accesses BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge void foo(int x, int *s) { for (int i=0;i<x;i++) { (*s) += i; } } BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp - 30 -

  31. Offline Forward Replay BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp R1 R2 R3 Memory Access Read [C] A B C - - B B C C Write [C] Read [B-20] - B C - B C Write [B-20] - B C - 31 - PEBS Sample

  32. Offline Backward Replay BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp R1 R2 R3 Memory Access Read [B2-20] - B - Read [B2-16] A B - Read [C] A B C - B C Write [C] - B C Read [B-20] - B C - B C Write [B-20] - B C - 32 - PEBS Sample

  33. Replay Across Basic Blocks [Using PT Info] R1 R2 R3 Memory Access BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge Read [B2-20] - B - Read [B2-14] - B - - B - BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp Enable PT to Play Across BB Read [B2-20] - B - Read [B2-16] A B - Read [C] A B C - B C Write [C] - B C Read [B-20] - B C - B C Write [B-20] - B C - 33 - PEBS Sample

  34. Intel Processors Supports for Program Tracing PT (Process Tracing) Complete control-flow trace PEBS(Precise Event Based Sampling) Architecture State Configurable Frequency/Event Process Trace (PT) BB1: T (to BB2) BB2: F (to BB3) BB3: T (to BB1) BB1: T (to BB2) BB2: T (to BB4) BB4: T (to BB3) BB0 BB1 T F BB2 mov R4 M[R2-1] BB2 BB5 T F Architecture State R1 R2 R3 R4 A B BB4 C D BB3 - 34 -

  35. Program Path Recording via Process Trace BB1 BB2 BB4 mov R1 M[R2-24] mov R3 M[R1] cmp $0, R3 je BB2 mov R4 M[R2-1] cmp R4, R3 je BB4 mov R1 M[R2-28] add $1, R1 mov M[R2-28] R1 movR1 M[R2-24] add $1, R1 mov M[R2-24] R1 BB0 Process Trace (PT) BB1: T (to BB2) BB2: F (to BB3) BB3: T (to BB1) BB1: T (to BB2) BB2: T (to BB4) BB4: T (to BB3) BB1 T F BB2 BB5 T F BB4 BB3 - 35 - PEBS Sample

  36. Reconstructing Unsampled Memory Accesses via Forward Replay BB1 BB2 BB4 mov R1 M[R2-24] mov R3 M[R1] cmp $0, R3 je BB2 mov R4 M[R2-1] cmp R4, R3 je BB4 mov R1 M[R2-28] add $1, R1 mov M[R2-28] R1 movR1 M[R2-24] add $1, R1 mov M[R2-24] R1 R1 R2 R3 R4 A B A B A B A B - B - B - B - B - B C C C C C C C C C D - - - - - - - - Memory Access Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] - 36 - PEBS Sample - Unknown Value

  37. Reconstructing Unsampled Memory Accesses via Backward Replay BB1 BB2 BB4 mov R1 M[R2-24] mov R3 M[R1] cmp $0, R3 je BB2 mov R4 M[R2-1] cmp R4, R3 je BB4 mov R1 M[R2-28] add $1, R1 mov M[R2-28] R1 movR1 M[R2-24] add $1, R1 mov M[R2-24] R1 R1 R2 R3 R4 A B A B A B A B - B C C C - - D D D D D Memory Access Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] Read [A] Read [B-24] - 37 - PEBS Sample - Unknown Value

Related


More Related Content