Building Hyperscale IoT Services with TLA+
Explore the potential of TLA+ in building hyper-scale IoT services, gaining insights, taking actions, and addressing exponential growth challenges. Learn about IoT examples, model checking in the product lifecycle, design costs, and project scope coverage. Discover how TLA+ can enhance project efficiency, reduce errors, and ensure robustness throughout the design process.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
BUILDING HYPERSCALE IOT SERVICES WITH TLA+ 1
Insights Actions Things IoT Hub Insights Insights Actions Actions
IOT EXAMPLE : INTRUDER DETECTION Data Source: Camera Enrichment : Add Location AI: Identify Objects Rule: Detect Intruder Action: Warn Intruder Send Command Back 3
EXPONENTIAL GROWTH IoT is Growing Rare is Relative Billions to devices connect to cloud and IoT is still growing exponentially. Gradually, we start hitting rare race conditions and failure sequences that were unlikely. Cellular IoT connections are expected to reach 3.5B in 2023. [Forbes] Can t test for every permutation of failure and race conditions. TLA+ to detect and eliminated them during design. 4
MODEL CHECKING A DESIGN IN THE PRODUCT LIFECYCLE Collect and Prioritize Requirements Define Abstract Solution Implement, Test, Deploy Cycles TLA+ Model Check Work Estimate 5
MODEL CHECKING A DESIGN COST, SKILL AND PROCESS Project Overhead Skill Set Less than 5% of overall project time One engineer proficient in TLA+ in a team of five. Others including managers review it. Prevent expensive postproduction issues We model what we change, not the whole system 6
MODEL CHECKING A DESIGN SCOPE OF MODEL TLA+ covers Traditional Tests Cover The algorithm and cross component interactions. Implementation of each component. Async interactions, states and state transitions. Mostly synchronous operations modeled as actions. 7
EXAMPLE PROJECTS Cross Region Failover Message Routing Device Cache Azure IoT hub allows cross region failovers. The design is model checked to ensure RPO, RTO guarantees. IoT telemetry message routing system enriches and transforms data.. New routing features are model checked. Azure in-memory device store is backed by Azure storage as the durable store. Every aspect of the design is model checked. 8
EXAMPLE PROJECT * SOME DETAILS ARE EXCLUDED TO MAINTAIN CONFIDENTIALITY Device State Notifications 9
DEVICE STATE NOTIFICATIONS Core problem Scale Devices (like security cameras) can connect and disconnect from the Azure IoT cloud. Millions of devices can connect or disconnect at any time. Allow customers (on a cellphone app or a backend service) to receive notifications when device status changes. Thousands of servers can failover at the same time. Notification system as well and the backend database wouldn t operate at this scale. Duplicates are allowed, ordering is not guaranteed. 10
RUDIMENTARY APPROACH Device Connect 1 Connect 1 Disconnect 2 Disconnect 2 Primary Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnected 2 11
RUDIMENTARY APPROACH RESEND Device Connect 1 Disconnect 2 Primary Disconnect 2 Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnected 2 12
RUDIMENTARY APPROACH RESEND Device Connect 1 Disconnect 2 Disconnect 2 Disconnect 2 Primary Disconnect 2 Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnected 2 13
RUDIMENTARY APPROACH FAILURE Device Disconnect 1 Disconnect 1 Connect 2 Connect 2 Primary Standby (New) Primary Failed Client Topic Connected 0 Database Connected 0 Disconnected 1 14
DEVICE STATE NOTIFICATIONS Components States Device Connected/Disconnected Service/In-memory Notified/Pending Database/Persistence Last notification Client Notifications received 15
Components States Device Connected/Disconnected Service/In-memory Notified/Pending Database/Persistence Last notification Client/Notified-State Notifications received TLA+ 16
INVARIANT 17
FIX Generate Sequence Numbers Checkpoint Failover Service assigns a sequence number and stores the event in the DB A background process sends the notification and tracks the sequence number processed On failover the new service resumes from where the last one left off We use algorithm for in-memory sequence numbers generation. 20
SOPHISTICATE SOLUTION SIMPLE Device Connect 1 Connect 1 Disconnect 2 Disconnect 2 Primary Standby (New) Primary Failed Client Topic Database 1 2 Disconnected 0 Connected 1 Disconnected 2 21
SOPHISTICATE SOLUTION - FAILOVER Device Connect 1 Disconnect 2 Disconnect 2 Primary Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnect 2 1 2 22
Components States Device Connected/Disconnected Service/In-memory Notified/Pending Database/Persistence Last notification Client/Notified-State Notifications received TLA+ 23
24 INVARIANT
TAKEAWAYS Correctness and Optimizations Cost and Savings We can process millions of concurrent device connects and disconnects correctly. We invested less than a week on the TLA+ model. No design changes were needed at the development and testing stages. Database writes and Sending notifications are not in the critical path. Optimizations reduced operational cost by an order of magnitude. in-memory sequence numbers instead of hitting database. 26
CONCLUSION Operating at Scale Agility and Clarity Reduces post-production issues in count and complexity. It s easier to read, review and iterate over 50 lines of TLA+ than 10 pages of design doc. Eliminates design reworks that previously took 20-40% of project time. Makes cross-component integrations simpler as everyone codes against same specification. Allows aggressive optimizations that are necessary at IOT scale. Code reviews use the TLA+ specifications as reference and primarily focus on code quality. 27
QUESTIONS ? 28