Practical Data Race Detection for Production Use

undefined
Tong Zhang
, Changhee Jung, Dongyoon Lee
Department of Computer Science
ProRace: Practical Data Race Detection
for Production Use
Data Races in Multithreaded Programs
Two threads access the same shared location (one is write)
Not ordered by synchronizations
 
Thread 1
 
Thread 2
 
MySQL bug #3596
Race Conditions Caused Severe Problems
 
• 50+ million people lost power
• Cost an estimated $6 billion
 
• About 30 million shares’ worth
   of trading were affected
• Cost an estimated $13 million
Limitations of Existing Dynamic Race Detectors
 
Sound and Complete Data Race Detectors
FastTrack 
[PLDI’09]
Intel’s Inspector XE
Google’s Thread Sanitizer
 
High overhead
(10-100x)
 
Not suitable for production run
Can not be left always-on
Sampling-based (Unsound) Data Race Detectors
SW sampling
LiteRace 
[PLDI’09]
Pacer 
[PLDI’10]
HW sampling
DataCollider 
[OSDI’10]
RaceZ 
[ICSE’11]
 
Still expensive
(up to 1.86x)
 
Low detection coverage
(high false negatives)
Limitations of Existing Dynamic Race Detectors
Key Problem: Sampling is Unsound
fput (p, 
)
p 
= null
Thread 1
Thread 2
A
 = A + 1
B = A * A
D = D-C
C = C*C
 
Race
Detected!
 
Race
Missed!
Our Goal: Production Runnable
 
Lightweight Sampling
Use Performance Monitor Unit (
PMU
) support in commodity
processors: e.g., Intel’s PEBS and PT
Use custom PMU driver
 
High Detection Capability
Reconstruct 
unsampled memory accesses 
via best-effort
forward/backward replay
Detect data races with extended memory traces
Outline
Motivation
Background: Hardware Assisted Program Tracing
ProRace: Design and Implementation
Experiments
Conclusion
Intel Processor’s Supports for Program Tracing
 
PEBS (Precise Event Based Sampling)
Configurable sampling frequency and events
e.g., retired loads/stores
Provides the sampled event and its architectural execution
context at the sample time
e.g., register values, time stamp counter (TSC), etc., but not memory
states
 
PT (Process Tracing)
Complete control-flow trace
e.g., branch decisions, target IP
HW compression -> Requires post-processing
Outline
Motivation
Background: Hardware Assisted Program Tracing
ProRace: Design and Implementation
Experiments
Conclusion
Memory
Trace
Overview of ProRace
Online
Offline
Data
Race
Detector
 
Focus of this talk
 
Custom Driver - See the paper
 
Low detection
coverage
Program Path Recording via Process Trace
 
Process Trace (PT)
 
BB1: T (to BB2)
 
BB2: F (to BB3)
 
BB3: T (to BB1)
 
BB1: T (to BB2)
 
BB2: T (to BB4)
 
BB4: T (to BB3)
Reconstructing Unsampled Memory Accesses
via Forward Replay
 
- Unknown Value
Reconstructing Unsampled Memory Accesses
via Backward Replay
- Unknown Value
Other Techniques Used by ProRace
To reduce online sample overhead
Custom PMU(PEBS+PT) driver
More efficient than off-the-shelf Linux Driver
To reconstruct more unsampled memory accesses
Memory emulation
Reverse execution
Reconstruct PC relative memory access
Please refer to the paper
Implementation
 
Implementation
Online: Linux Kernel 4.x
Offline: Intel Pin + FastTrack 
[PLDI’09]
 
 
 
Evaluation
Performance Evaluation
PARSEC benchmark suites with simlarge input
Real world applications (Apache, MySQL, Memcached, etc.)
Detection Capability
12 real bugs from the previous literatures
Performance Overhead: PARSEC
Trace every N samples:
 
7
%
 
o
v
e
r
h
e
a
d
Performance Overhead: Real App
 
2
.
6
%
 
o
v
e
r
h
e
a
d
Trace every N samples:
Detection Capability
100% detection
Conclusion
ProRace lowers the runtime cost of sampling and
introduced new technique to reconstruct un-sampled
memory accesses, which enhanced data race detection
coverage.
Q&A
Thank you!
1. Performance Overhead
3. Offline Analysis Cost
Prototype
Using Pin Tool
Can be optimized and parallelized
Memory Recovery Rate
Recover unsampled memory accesses
Trace
Thread 1
Thread 2
p = malloc();
if ( p ) {
    r = &p;
   **r++;
}
p = NULL
crash
Time
t0
t1
t2
t3
Trace
Sample p
Forward replay
Sample r
Sample p
Recover unsampled memory accesses
Trace
Thread 1
Thread 2
p = malloc();
if ( p ) {
    r = &p;
   **r++;
}
p = NULL
crash
Time
t0
t1
t2
t3
Trace
Sample p
Backward replay
Sample r
Sample p
Recover unsampled memory accesses
Trace
Thread 1
Thread 2
p = malloc();
if ( p ) {
    r = &p;
   **r++;
}
p = NULL
crash
Time
t0
t1
t2
t3
Trace
Sample p
Replay Across BasicBlock
Sample r
Sample p
Sample p
Taken
Branch
PT SAMPLE
Recover unsampled memory accesses
Replay with memory emulation enabled
Memory
[%b]
%c
%a
%c
Recover unsampled memory accesses
Replay with memory emulation enabled(continued)
• Recovered
through memory
emulation
Recover Unsampled Memory Accesses
 
void foo(int x, int *s)
{
    for (int i=0;i<x;i++)
    {
        (*s) += i;
    }
}
Offline Forward Replay
Offline Backward Replay
Replay Across Basic Blocks [Using PT Info]
Enable PT to Play Across BB
Intel Processor’s Supports for Program Tracing
 
PEBS(Precise Event Based Sampling)
    Architecture State
    Configurable
 
Frequency/Event
 
PT (Process Tracing)
    Complete control-flow trace
Program Path Recording via Process Trace
 
Process Trace (PT)
 
BB1: T (to BB2)
 
BB2: F (to BB3)
 
BB3: T (to BB1)
 
BB1: T (to BB2)
 
BB2: T (to BB4)
 
BB4: T (to BB3)
Reconstructing Unsampled Memory Accesses
via Forward Replay
 
Read [B-1]
 
 
 
Read [B-28]
 
 
Write [B-28]
 
Read [B-24]
 
 
Write [B-24]
 
- Unknown Value
Reconstructing Unsampled Memory Accesses
via Backward Replay
 
 
 
Read [A]
 
Read [B-24]
- Unknown Value
Read [B-1]
Read [B-28]
Write [B-28]
Read [B-24]
Write [B-24]
Slide Note

Good afternoon, my name is Tong Zhang, from VirginiaTech. My advisors are Dr. Changhee Jung and Dr. Dongyoon Lee. Today I’m going to present: ProRace, Practical Data Race Detection for Production Use.

Embed
Share

ProRace is a tool designed for detecting data races in multithreaded programs, highlighting the severe problems caused by race conditions like power outages and financial losses. It addresses the limitations of existing race detectors by focusing on soundness, complete detection, and lower overhead. The key problem of sampling unsoundness is discussed, with the goal of creating a production-ready, lightweight sampling tool using Performance Monitor Unit (PMU) support in commodity processors.

  • Data race detection
  • Multithreaded programs
  • Race conditions
  • Production use
  • Performance monitor

Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. ProRace: Practical Data Race Detection for Production Use Tong Zhang, Changhee Jung, Dongyoon Lee Department of Computer Science - 1 -

  2. Data Races in Multithreaded Programs Two threads access the same shared location (one is write) Not ordered by synchronizations Thread 1 Thread 2 p = NULL if (p) { crash fput(p, ) } MySQL bug #3596 - 2 -

  3. Race Conditions Caused Severe Problems 50+ million people lost power Cost an estimated $6 billion Northeast Blackout of 2003 About 30 million shares worth of trading were affected Cost an estimated $13 million Stock Price Mismatch in 2012 - 3 -

  4. Limitations of Existing Dynamic Race Detectors Sound and Complete Data Race Detectors FastTrack [PLDI 09] Intel s Inspector XE Google s Thread Sanitizer High overhead (10-100x) Not suitable for production run Can not be left always-on - 4 -

  5. Limitations of Existing Dynamic Race Detectors Sampling-based (Unsound) Data Race Detectors SW sampling LiteRace [PLDI 09] Pacer [PLDI 10] Still expensive (up to 1.86x) HW sampling DataCollider [OSDI 10] RaceZ [ICSE 11] Low detection coverage (high false negatives) - 5 -

  6. Key Problem: Sampling is Unsound Thread 1 Thread 2 A = A + 1 C = C*C Race Detected! Missed! Race p = null B = A * A fput (p, ) D = D-C Sampled instructions - 6 -

  7. Our Goal: Production Runnable Lightweight Sampling Use Performance Monitor Unit (PMU) support in commodity processors: e.g., Intel s PEBS and PT Use custom PMU driver High Detection Capability Reconstruct unsampled memory accesses via best-effort forward/backward replay Detect data races with extended memory traces - 7 -

  8. Outline Motivation Background: Hardware Assisted Program Tracing ProRace: Design and Implementation Experiments Conclusion - 8 -

  9. Intel Processors Supports for Program Tracing PEBS (Precise Event Based Sampling) Configurable sampling frequency and events e.g., retired loads/stores Provides the sampled event and its architectural execution context at the sample time e.g., register values, time stamp counter (TSC), etc., but not memory states PT (Process Tracing) Complete control-flow trace e.g., branch decisions, target IP HW compression -> Requires post-processing - 9 -

  10. Outline Motivation Background: Hardware Assisted Program Tracing ProRace: Design and Implementation Experiments Conclusion - 10 -

  11. Overview of ProRace Online Offline Data Race Report Data Race Detector Library Sync Ops. low overhead High detection Low detection coverage coverage Mem. Ops & Architecture State PEBS Reconstruct Memory Accesses Extended Extended Memory Memory Trace Memory Trace Memory Trace Trace Extended Branch Information PT Custom Driver - See the paper Focus of this talk - 11 -

  12. Program Path Recording via Process Trace BB1 BB2 BB4 ld R1 M[R2-24] ld R3 M[R1] cmp $0, R3 je BB2 ld R4 M[R2-1] cmp R4, R3 je BB4 ld R1 M[R2-28] add $1, R1 st M[R2-28] R1 ldR1 M[R2-24] add $1, R1 st M[R2-24] R1 BB0 Process Trace (PT) BB1: T (to BB2) BB2: F (to BB3) BB3: T (to BB1) BB1: T (to BB2) BB2: T (to BB4) BB4: T (to BB3) BB1 T F BB2 BB5 T F BB4 BB3 - 12 - PEBS Sample

  13. Reconstructing Unsampled Memory Accesses via Forward Replay R1 R2 R3 R4 Memory Access BB1 BB2 BB4 ld R1 M[R2-24] ld R3 M[R1] cmp $0, R3 je BB2 ld R4 M[R2-1] cmp R4, R3 je BB4 ld R1 M[R2-28] add $1, R1 st M[R2-28] R1 ldR1 M[R2-24] add $1, R1 st M[R2-24] R1 A A A A - - - - - B B B B B B B B B C C C C C C C C C D - - - - - - - - Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] - 13 - PEBS Sample - Unknown Value

  14. Reconstructing Unsampled Memory Accesses via Backward Replay R1 R2 R3 R4 Memory Access - B - D BB1 BB2 BB4 ld R1 M[R2-24] ld R3 M[R1] cmp $0, R3 je BB2 ld R4 M[R2-1] cmp R4, R3 je BB4 ld R1 M[R2-28] add $1, R1 st M[R2-28] R1 ldR1 M[R2-24] add $1, R1 st M[R2-24] R1 Read [B-24] A B - D Read [A] A B C D A B C D A A A A - - - - - B B B B B B B B B C C C C C C C C C D - - - - - - - - Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] - 14 - PEBS Sample - Unknown Value

  15. Other Techniques Used by ProRace To reduce online sample overhead Custom PMU(PEBS+PT) driver More efficient than off-the-shelf Linux Driver To reconstruct more unsampled memory accesses Memory emulation Reverse execution Reconstruct PC relative memory access Please refer to the paper - 15 -

  16. Implementation Implementation Online: Linux Kernel 4.x Offline: Intel Pin + FastTrack [PLDI 09] Evaluation Performance Evaluation PARSEC benchmark suites with simlarge input Real world applications (Apache, MySQL, Memcached, etc.) Detection Capability 12 real bugs from the previous literatures - 16 -

  17. Performance Overhead: PARSEC 10 100 1000 10000 100000 Trace every N samples: 7% overhead 5 4 Normalized Overhead 3 2 1 0 - 17 -

  18. Performance Overhead: Real App 2.6% overhead 10 100 1000 10000 100000 Trace every N samples: 9.13 5 4 Normalized Overhead 3 2 1 0 - 18 -

  19. Detection Capability Detection Capability at the Sampling Period of 1000 RaceZ ProRace 100% detection Detection Capability (%) 50 25 0 - 19 -

  20. Conclusion Detection Capability ProRace Sampling Based Detector Overhead ProRace lowers the runtime cost of sampling and introduced new technique to reconstruct un-sampled memory accesses, which enhanced data race detection coverage. - 20 -

  21. Q&A Thank you! - 21 -

  22. 1. Performance Overhead Vanilla ProRace 7.52 49.92 7.8 5 Normalized Overhead 4 3 2 1 0 Real Applications PARSEC - 22 -

  23. 3. Offline Analysis Cost Prototype Using Pin Tool Can be optimized and parallelized PT Decoding Trace Reconstruction Data Race Detection Overhead per One Second 10000 Program Execution 100 1 apache mysql cherokee aget pfscan pbzip2 - 23 -

  24. Memory Recovery Rate Basicblock Replay Forward Replay Forward/Backward Relpay 1000 Normalized Memory Recovery Rate 100 10 1 apache mysql cherokee aget pfscan pbzip2 - 24 -

  25. Recover unsampled memory accesses Forward replay Thread 2 Trace Thread 1 Trace p = malloc(); if ( p ) { t0 p = NULL Sample p Sample r t1 r = &p; t2 PEBS SAMPLE **r++; } crash Sample p t3 Recovered Forward replay Backward replay Time - 25 -

  26. Recover unsampled memory accesses Backward replay Thread 2 Trace Thread 1 Trace p = malloc(); if ( p ) { Sample p t0 p = NULL Sample p Sample r t1 r = &p; t2 PEBS SAMPLE **r++; } crash t3 Recovered Forward replay Backward replay Time - 26 -

  27. Recover unsampled memory accesses Replay Across BasicBlock Thread 2 Trace Thread 1 Trace Sample p p = malloc(); if ( p ) { Taken Branch Sample p t0 p = NULL Sample p Sample r t1 r = &p; t2 PT SAMPLE PEBS SAMPLE **r++; } crash t3 Recovered Forward replay Backward replay Time - 27 -

  28. Recover unsampled memory accesses Replay with memory emulation enabled BasicBlock 2(iteration n) %a %b %c Memory Access Read [%b-20] mov MEM[%b-20] %a %a Read [%b-16] mov MEM[%b-16] %c %c add MEM[%c], %a Read [%c] Memory [%b] %c %a [%b-16] [%b-20] available unavailable - 28 -

  29. Recover unsampled memory accesses Replay with memory emulation enabled(continued) BasicBlock 2(iteration n) %a %b %c Memory Access mov MEM[%b-16] %c Read [%b-16] Read [%c] add MEM[%c], %a Memory %c %c [%b-16] BasicBlock 2(iteration n+1) %a %b %c Memory Access Read [%b-16] Read [%c] NOT RECOVERED Recovered through memory emulation mov MEM[%b-16] %c add MEM[%c], %a available unavailable - 29 -

  30. Recover Unsampled Memory Accesses BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge void foo(int x, int *s) { for (int i=0;i<x;i++) { (*s) += i; } } BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp - 30 -

  31. Offline Forward Replay BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp R1 R2 R3 Memory Access Read [C] A B C - - B B C C Write [C] Read [B-20] - B C - B C Write [B-20] - B C - 31 - PEBS Sample

  32. Offline Backward Replay BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp R1 R2 R3 Memory Access Read [B2-20] - B - Read [B2-16] A B - Read [C] A B C - B C Write [C] - B C Read [B-20] - B C - B C Write [B-20] - B C - 32 - PEBS Sample

  33. Replay Across Basic Blocks [Using PT Info] R1 R2 R3 Memory Access BB1 mov R1 M[R2-20] cmp M[R2-4], R1 jge Read [B2-20] - B - Read [B2-14] - B - - B - BB2 mov R1 M[R2-20] mov R3 M[R2-16] add R1 M[R3]+R1 mov M[R3] R1 mov R1 M[R2-20] add R1 1 + R1 mov M[R2-20] R1 jmp Enable PT to Play Across BB Read [B2-20] - B - Read [B2-16] A B - Read [C] A B C - B C Write [C] - B C Read [B-20] - B C - B C Write [B-20] - B C - 33 - PEBS Sample

  34. Intel Processors Supports for Program Tracing PT (Process Tracing) Complete control-flow trace PEBS(Precise Event Based Sampling) Architecture State Configurable Frequency/Event Process Trace (PT) BB1: T (to BB2) BB2: F (to BB3) BB3: T (to BB1) BB1: T (to BB2) BB2: T (to BB4) BB4: T (to BB3) BB0 BB1 T F BB2 mov R4 M[R2-1] BB2 BB5 T F Architecture State R1 R2 R3 R4 A B BB4 C D BB3 - 34 -

  35. Program Path Recording via Process Trace BB1 BB2 BB4 mov R1 M[R2-24] mov R3 M[R1] cmp $0, R3 je BB2 mov R4 M[R2-1] cmp R4, R3 je BB4 mov R1 M[R2-28] add $1, R1 mov M[R2-28] R1 movR1 M[R2-24] add $1, R1 mov M[R2-24] R1 BB0 Process Trace (PT) BB1: T (to BB2) BB2: F (to BB3) BB3: T (to BB1) BB1: T (to BB2) BB2: T (to BB4) BB4: T (to BB3) BB1 T F BB2 BB5 T F BB4 BB3 - 35 - PEBS Sample

  36. Reconstructing Unsampled Memory Accesses via Forward Replay BB1 BB2 BB4 mov R1 M[R2-24] mov R3 M[R1] cmp $0, R3 je BB2 mov R4 M[R2-1] cmp R4, R3 je BB4 mov R1 M[R2-28] add $1, R1 mov M[R2-28] R1 movR1 M[R2-24] add $1, R1 mov M[R2-24] R1 R1 R2 R3 R4 A B A B A B A B - B - B - B - B - B C C C C C C C C C D - - - - - - - - Memory Access Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] - 36 - PEBS Sample - Unknown Value

  37. Reconstructing Unsampled Memory Accesses via Backward Replay BB1 BB2 BB4 mov R1 M[R2-24] mov R3 M[R1] cmp $0, R3 je BB2 mov R4 M[R2-1] cmp R4, R3 je BB4 mov R1 M[R2-28] add $1, R1 mov M[R2-28] R1 movR1 M[R2-24] add $1, R1 mov M[R2-24] R1 R1 R2 R3 R4 A B A B A B A B - B C C C - - D D D D D Memory Access Read [B-1] Read [B-28] Write [B-28] Read [B-24] Write [B-24] Read [A] Read [B-24] - 37 - PEBS Sample - Unknown Value

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#