Fault-Tolerant Engineered Networks Overview
Explore the design and challenges of fault-tolerant engineered networks such as FatTrees, with a focus on recovery strategies, topology innovations, and failure detection mechanisms. Learn about the co-design of topology, routing protocols, and failure detectors for optimal performance in data centers.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
F10: A Fault-Tolerant Engineered Network Vincent Liu, Daniel Halperin, Arvind Krishnamurthy, Thomas Anderson University of Washington
Todays Data Centers *From Al-Fares et al. SIGCOMM 08 Today s data centers are built using multi-rooted trees Commodity switches for cost, bisection bandwidth, and resilience to failures 2
FatTree Example: PortLand Heartbeats to detect failures Centralized controller installs updated routes Exploits path redundancy 3
Unsolved Issues with FatTrees Slow Detection Commodity switches fail often Not always sure they failed (gray/partial failures) Slow Recovery Failure recovery is not local Topology does not support local reroutes Suboptimal Flow Assignment Failures result in an unbalanced tree Loses load balancing properties 4
F10 Co-design of topology, routing protocols and failure detector Novel topology that enables local, fast recovery Cascading protocols for optimal recovery Fine-grained failure detector for fast detection Same # of switches/links as FatTrees 5
Outline Motivation & Approach Topology: AB FatTree Cascaded Failover Protocols Failure Detection Evaluation Conclusion 6
Why is FatTree Recovery Slow? dst src Lots of redundancy on the upward path Immediately restore connectivity at the point of failure 7
Why is FatTree Recovery Slow? dst src src No redundancy on the way down Alternatives are many hops away No direct path Has alternate path 8
Type A Subtree Consecutive Parents 1 2 3 4 x y 9
Type B Subtree Strided Parents 1 2 3 4 x y 10
AB FatTree 11
Alternatives in AB FatTrees dst src src More nodes have alternative, direct paths One hop away from node with an alternative No direct path Has alternate path 12
Cascaded Failover Protocols A local rerouting mechanism Immediate restoration A pushback notification scheme Restore direct paths An epoch-based centralized scheduler globally re-optimizes traffic s ms s 13
Local Rerouting u dst Route to a sibling in an opposite-type subtree Immediate, local rerouting around the failure 14
Local Rerouting Multiple Failures u dst Resilient to multiple failures, refer to paper Increased load and path dilation 15
Pushback Notification u u Detecting switch broadcasts notification Restores direct paths, but not finished yet No direct path Has alternate path 16
Centralized Scheduler Related to existing work (Hedera, MicroTE) Gather traffic matrices Place long-lived flows based on their size Place shorter flows with weighted ECMP 17
Outline Motivation & Approach Topology: AB FatTree Cascaded Failover Protocols Failure Detection Evaluation Conclusion 18
Why are Todays Detectors Slow? Based on loss of multiple heartbeats Detector is separated from failure Slow because: Congestion Gray failures Don t want to waste too many resources 19
F10 Failure Detector Look at the link itself Send traffic to physical neighbors when idle Monitor incoming bit transitions and packets Stop sending and reroute the very next packet Can be fast because rerouting is cheap 20
Outline Motivation & Approach Topology: AB FatTree Cascaded Failover Protocols Failure Detection Evaluation Conclusion 21
Evaluation 1. Can F10 reroute quickly? 2. Can F10 avoid congestion loss that results from failures? 3. How much does this effect application performance? 22
Methodology Testbed Emulab w/ Click implementation Used smaller packets to account for slower speed Packet-level simulator 24-port 10GbE switches, 3 levels Traffic model from Benson et al. IMC 2010 Failure model from Gill et al. SIGCOMM 2011 Validated using testbed 23
F10 Can Reroute Quickly 70 Congestion Window 60 50 40 30 20 Without Failure With Failure 10 0 0 5000 10000 15000 20000 time (ms) F10 can recover from failures in under a millisecond Much less time than a TCP timeout 24
F10 Can Avoid Congestion Loss 1 1 CDF over Time Intervals CDF over Time Intervals 0.8 0.8 0.6 0.6 0.4 0.4 0.2 F10 0.2 F10 PortLand PortLand 0 0 0 0.002 0.004 0.006 0.008 0.01 0 0.002 Normalized Congestion Loss 0.004 0.006 0.008 0.01 Normalized Congestion Loss PortLand has 7.6x the congestion loss of F10 under realistic traffic and failure conditions 25
F10 Improves App Performance Speedup of a MapReduce computation 1 1 0.8 CDF over trials 0.8 CDF over trials 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2 2.5 3 Job completion time with PortLand/F10, i.e., Speedup Job completion time with PortLand/F10, i.e., Speedup Median speedup is 1.3x 26
Conclusion F10 is a co-design of topology, routing protocols, and failure detector: AB FatTrees to allow local recovery and increase path diversity Pushback and global re-optimization restore congestion-free operation Significant benefit to application performance on typical workloads and failure conditions Thanks! 27