Consistency at Facebook: A Study on Existential Consistency

Existential Consistency:
Measuring and Understanding
Consistency at Facebook
Haonan Lu*
,
 
Kaushik Veeraraghavan
,
 Philippe Ajoux
, Jim Hunt
,
Yee Jiun Song
, Wendy Tobagus
,
Sanjeev Kumar
, Wyatt Lloyd*
*
University of Southern California, 
Facebook
1
2
3
4
Performance
Fundamental Tension
5
Consistency
Performance
 
Eliminates anomalies
(Oculus example)
 
Lower latency
First study of consistency in a large-scale,
production system – Facebook TAO
 
Difficult to quantify
 
Simple to quantify
 
Makes systems
easier to program
 
Higher throughput
Anomaly: Unexpected Behavior
 
6
 
Post Example
Anomaly: Unexpected Behavior
 
7
Oculus Example
Does Facebook have
consistency anomalies?
How many?
What type?
8
 
TAO: Eventually Consistent Cache
9
M
Vulnerability window
: time during asynchronous
replication when anomalies can happen
Quantifying Anomalies
How often do anomalies occur?
Collect trace of requests to TAO
What consistency would prevent them?
Run anomaly checkers on the trace
10
Trace Collection
 
Collect trace on web servers
 
Challenges in tracing production system
Volume of requests
Time skew between web servers
Missing requests
11
Challenge: Volume of Requests
 
12
Billions of requests per second [ATC ’13]
Too many to log
Sample on objects
Object: vertex in social graph
Log all requests to objects in sample
Sufficient for local consistency models
Local Property Enables Sampling
 
13
 
“… the system as a whole satisfies P whenever
each individual object satisfies P.”
[1]
 
 
 
Local
Linearizability
Per-Object Sequential
Read-After-Write
 
 
 
Local consistency models can be
checked on a per object basis
[1] M. P. Herlihy and J. M. Wing “Linearizability: A Correctness Condition for Concurrent Objects.” ACM TOPLAS, 1990 
Challenge: Time Skew
Time skew across web servers
99.9 percentile for 1 week: 35ms
Add time skew to request’s duration
More overlapped requests
Eliminates false positives
14
Start time
Finish time
Read or write
Value: match read with write
Logging Details
 
15
 
Logged information:
Start time
Finish time
Read or write
Value: match read with write
 
Sampling rate: 1 out of 1 million objects
~
 
100% of requests to sampled objects
 
Post
 
(new)
Trace Statistics
 
16
12 days (8/20 – 8/31)
17 
million
 objects
3 
billion
 requests
Check Trace for Anomalies
 
17
 
Linearizability checker
Paxos provides
 
Per-Object Sequential checker
PNUTS provides
 
Read-After-Write checker
TAO provides within a cluster
 
Linearizability
 
18
 
Strongest non-transactional consistency
Real-time constraint
Post example
 
 
 
 
Total order constraint
Oculus example!
 
Should return
“new”
Linearizability Checker
 
19
 
Graph captures state transitions
Vertex: write operations
Edge: real-time order
 
Merge read with its write
Captures state transitions seen by users
 
Anomaly if merge causes a cycle
Cycle indicates user’s view ≠ system view
 
 
 
Linearizability Checker
 
20
Captures real-time constraint
Read should return new post instead
Should return
new post
 
21
More Complex Cases
http://tinyurl.com/sosp15-demo
w(0)
r(1)
w(1)
w(2)
w(3)
r(2)
r(3)
r(3)
r(2)
r(1)
Result Overview
 
Linearizability
 
Per-Object Sequential
 
Read-After-Write
 
Bounds on non-local consistency models
22
Anomalies found for all consistency models
– adopting them would have benefits
Linearizability Results
 
23
 
5 anomalies per million reads
Prevented by Paxos-based implementation
 
Upper bound on TAO anomalies
Strongest consistency we checked
 
 
TAO is highly consistent
Linearizability Results
Real-Time Constraint Violations
 
24
4 per million reads
 
Replica A:
 
Master M:
 
Replica B:
 
25
1 per million reads
 
Replica A:
Master M:
Replica B:
Linearizability Results
Total Order Constraint Violations
Per-Object Sequential Results
 
26
1 anomaly per million reads
Total order constraint
User session constraint (1 per 10 million)
Users should see their writes
Old
Infer Bounds on Causal
27
Linearizability
5 per million reads
Causal
Per-Object Sequential
 1 per million reads
 
≤ 5 per million reads
 
≥ 1 per million reads
 
Subset of causal anomalies
 
Superset of causal anomalies
Lower Bounds on Transactions
28
Linearizability
5 per million reads
Causal
Per-Object Sequential
1 per million reads
Strict Serializability
Causal with Transactions
Future research should
provide transactions
Real-Time Consistency Monitor
Checkers cannot run in real-time
Φ-consistency
Measure convergence of replicas
A real-time health monitor
Alarms when a replica falls behind
29
Conclusion
30
Benefits of consistency are hard to quantify
First study of a large-scale production system
Measure Facebook’s TAO system
Collect trace and run anomaly checkers
Real-world challenges
Results
TAO is highly consistent
Benefits of adopting stronger consistency exist
Research should provide transactions
Slide Note
Embed
Share

This study explores the measurement and comprehension of consistency at Facebook, focusing on existential consistency. Key topics covered include consistency performance, fundamental tension between consistency and performance, anomalies in Facebook systems, and strategies for quantifying and preventing anomalies. The study delves into TAO, Facebook's large-scale production system, to investigate consistency issues and the impact on system behavior.

  • Facebook
  • Consistency
  • Anomalies
  • TAO
  • Performance

Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Existential Consistency: Measuring and Understanding Consistency at Facebook Haonan Lu* , Kaushik Veeraraghavan , Philippe Ajoux , Jim Hunt , Yee Jiun Song , Wendy Tobagus , Sanjeev Kumar , Wyatt Lloyd* *University of Southern California, Facebook 1

  2. 2

  3. 3

  4. Consistency Performance 4

  5. Fundamental Tension Consistency Performance Eliminates anomalies (Oculus example) Makes systems easier to program Lower latency Higher throughput Difficult to quantify Simple to quantify First study of consistency in a large-scale, production system Facebook TAO 5

  6. Anomaly: Unexpected Behavior Post Example Hey, I mentioned you in a post New post @Wyatt, you should check out this game! Read friend s timeline Old posts 6

  7. Anomaly: Unexpected Behavior Oculus Example 1. Mine! yeah~ lucky! 1. I wouldn t mind 1. I wouldn t mind 2. Mine! yeah~ lucky! 7

  8. Does Facebook have consistency anomalies? How many? What type? 8

  9. TAO: Eventually Consistent Cache Vulnerability window: time during asynchronous replication when anomalies can happen new post read B done value A C old post M 9

  10. Quantifying Anomalies How often do anomalies occur? Collect trace of requests to TAO What consistency would prevent them? Run anomaly checkers on the trace 10

  11. Trace Collection Collect trace on web servers Challenges in tracing production system Volume of requests Time skew between web servers Missing requests 11

  12. Challenge: Volume of Requests Billions of requests per second [ATC 13] Too many to log Sample on objects Object: vertex in social graph Log all requests to objects in sample Sufficient for local consistency models 12

  13. Local Property Enables Sampling the system as a whole satisfies P whenever each individual object satisfies P. [1] Local consistency models can be checked on a per object basis Local Linearizability Per-Object Sequential Read-After-Write [1] M. P. Herlihy and J. M. Wing Linearizability: A Correctness Condition for Concurrent Objects. ACM TOPLAS, 1990 13

  14. Challenge: Time Skew Time skew across web servers 99.9 percentile for 1 week: 35ms Add time skew to request s duration More overlapped requests Eliminates false positives 14

  15. Logging Details Logged information: Start time Finish time Read or write Start time Finish time Read or write Value: match read with write Value: match read with write Determine real time ordering of requests Post (new) Sampling rate: 1 out of 1 million objects ~100% of requests to sampled objects 15

  16. Trace Statistics 12 days (8/20 8/31) 17 million objects 3 billion requests 16

  17. Check Trace for Anomalies Linearizability checker Paxos provides Per-Object Sequential checker PNUTS provides Read-After-Write checker TAO provides within a cluster 17

  18. Linearizability Strongest non-transactional consistency Real-time constraint Post example Should return new Post (old) Post (new) Read (old) Haonan Haonan Wyatt Total order constraint Oculus example! 18

  19. Linearizability Checker Graph captures state transitions Vertex: write operations Edge: real-time order Merge read with its write Captures state transitions seen by users Anomaly if merge causes a cycle Cycle indicates user s view system view 19

  20. Linearizability Checker Captures real-time constraint Read should return new post instead Post (old) Post (new) Read (old) Haonan Haonan Wyatt Should return new post Post (new) Post (old) Read (old) Anomaly 20

  21. More Complex Cases http://tinyurl.com/sosp15-demo w(0) r(1) w(1) w(2) w(3) 2 r(2) Anomalies r(3) r(3) r(2) r(1) 21

  22. Result Overview Linearizability Per-Object Sequential Read-After-Write Bounds on non-local consistency models Anomalies found for all consistency models adopting them would have benefits 22

  23. Linearizability Results 5 anomalies per million reads Prevented by Paxos-based implementation Upper bound on TAO anomalies Strongest consistency we checked TAO is highly consistent 23

  24. Linearizability Results Real-Time Constraint Violations 4 per million reads Read Post (new) Post (new) starts Post (new) finishes B A M Replica A: Master M: Replica B: Read (old) 24

  25. Linearizability Results Total Order Constraint Violations 1 per million reads Comment(W) Comment(H) B A Read (H) H finishes H starts M Replica A: Master M: W H Replica B: Read (W) W finishes W starts 25

  26. Per-Object Sequential Results 1 anomaly per million reads Total order constraint User session constraint (1 per 10 million) Users should see their writes Read Post(new) B A Old M 26

  27. Infer Bounds on Causal Linearizability 5 per million reads Superset of causal anomalies 5 per million reads Causal 1 per million reads Per-Object Sequential 1 per million reads Subset of causal anomalies 27

  28. Lower Bounds on Transactions Strict Serializability > 5 per million reads Future research should provide transactions Linearizability 5 per million reads Causal with Transactions > 1 per million reads Causal Per-Object Sequential 1 per million reads 28

  29. Real-Time Consistency Monitor Checkers cannot run in real-time -consistency Measure convergence of replicas A real-time health monitor Alarms when a replica falls behind 29

  30. Conclusion Benefits of consistency are hard to quantify First study of a large-scale production system Measure Facebook s TAO system Collect trace and run anomaly checkers Real-world challenges Results TAO is highly consistent Benefits of adopting stronger consistency exist Research should provide transactions 30

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#