Building Hyperscale IoT Services with TLA+

 
BUILDING HYPERSCALE
 IOT SERVICES
WITH TLA+
 
1
Things
Insights
Actions
IoT Hub
 
IOT EXAMPLE : INTRUDER DETECTION
 
 
3
 
EXPONENTIAL GROWTH
 
 
4
 
MODEL CHECKING A DESIGN
 
IN THE PRODUCT LIFECYCLE
 
5
 
MODEL CHECKING A DESIGN
 
COST, SKILL AND PROCESS
 
 
6
 
MODEL CHECKING A DESIGN
 
SCOPE OF MODEL
 
7
EXAMPLE PROJECTS
 
 
8
 
EXAMPLE PROJECT
 
* SOME DETAILS ARE EXCLUDED TO MAINTAIN CONFIDENTIALITY
 
Device State Notifications
 
9
 
DEVICE STATE NOTIFICATIONS
 
 
10
RUDIMENTARY APPROACH
 
                          Database
Primary
Standby
Topic
Client
Disconnected 0
Disconnected 0
Connect 1
Connect 1
Connect 1
Connect 1
Connected 1
Connected 1
Disconnect 2
Disconnect 2
Disconnect 2
Disconnect 2
Disconnected 2
Disconnected 2
Device
Failed
Primary
(New)
11
RUDIMENTARY APPROACH – RESEND
 
                          Database
Primary
Standby
Topic
Client
Disconnected 0
Disconnected 0
Connect 1
Connect 1
Connected 1
Connected 1
Device
Failed
Primary
(New)
Disconnect 2
Disconnect 2
Disconnect 2
Disconnect 2
Disconnected 2
Disconnected 2
12
RUDIMENTARY APPROACH – RESEND
 
                          Database
Primary
Standby
Topic
Client
Disconnected 0
Disconnected 0
Connect 1
Connect 1
Connected 1
Connected 1
Device
Failed
Primary
(New)
Disconnect 2
Disconnect 2
Disconnect 2
Disconnect 2
Disconnected 2
Disconnected 2
13
Disconnect 2
Disconnect 2
Disconnect 2
Disconnect 2
RUDIMENTARY APPROACH – FAILURE
 
                          Database
Primary
Standby
Topic
Client
Connected 0
Connected 0
Connect 2
Connect 2
Connect 2
Connect 2
Disconnected 1
Disconnected 1
Disconnect 1
Disconnect 1
Disconnect 1
Disconnect 1
Device
Failed
Primary
(New)
Connected 0
Connected 0
14
 
DEVICE STATE NOTIFICATIONS
 
 
15
 
16
 
TLA+
 
INVARIANT
 
 
17
 
18
 
 
INVARIANT FAILURE
 
 
19
 
FIX
 
 
20
SOPHISTICATE SOLUTION – SIMPLE
 
                          Database
Primary
Standby
Topic
Client
Disconnected 0
Disconnected 0
Connect 1
Connect 1
Connect 1
Connect 1
Connected 1
Connected 1
Disconnect 2
Disconnect 2
Disconnect 2
Disconnect 2
Disconnected 2
Disconnected 2
Device
Failed
Primary
(New)
1
1
2
2
21
SOPHISTICATE SOLUTION - FAILOVER
 
                          Database
Primary
Standby
Topic
Client
Disconnected 0
Disconnected 0
Connect 1
Connect 1
Connected 1
Connected 1
Device
Failed
Primary
(New)
Disconnect 2
Disconnect 2
1
1
Disconnect 2
Disconnect 2
Disconnect 2
Disconnect 2
2
2
22
 
TLA+
 
23
 
INVARIANT
 
24
 
25
 
 
TAKEAWAYS
 
 
26
 
CONCLUSION
 
 
27
 
QUESTIONS ?
 
28
AZURE IOT
CONNECTING DEVICES TO
CLOUD
29
Devices
(Things)
Enrichment and Transformation
and Storage
Actions
Slide Note

Hi, I am Vaibhav from Azure IoT

Today I want to talk about how we use TLA+ to build cloud solutions that operate at IoT scale.

Embed
Share

Explore the potential of TLA+ in building hyper-scale IoT services, gaining insights, taking actions, and addressing exponential growth challenges. Learn about IoT examples, model checking in the product lifecycle, design costs, and project scope coverage. Discover how TLA+ can enhance project efficiency, reduce errors, and ensure robustness throughout the design process.

  • IoT Services
  • TLA+
  • Exponential Growth
  • Model Checking
  • Project Scope

Uploaded on Sep 13, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. BUILDING HYPERSCALE IOT SERVICES WITH TLA+ 1

  2. Insights Actions Things IoT Hub Insights Insights Actions Actions

  3. IOT EXAMPLE : INTRUDER DETECTION Data Source: Camera Enrichment : Add Location AI: Identify Objects Rule: Detect Intruder Action: Warn Intruder Send Command Back 3

  4. EXPONENTIAL GROWTH IoT is Growing Rare is Relative Billions to devices connect to cloud and IoT is still growing exponentially. Gradually, we start hitting rare race conditions and failure sequences that were unlikely. Cellular IoT connections are expected to reach 3.5B in 2023. [Forbes] Can t test for every permutation of failure and race conditions. TLA+ to detect and eliminated them during design. 4

  5. MODEL CHECKING A DESIGN IN THE PRODUCT LIFECYCLE Collect and Prioritize Requirements Define Abstract Solution Implement, Test, Deploy Cycles TLA+ Model Check Work Estimate 5

  6. MODEL CHECKING A DESIGN COST, SKILL AND PROCESS Project Overhead Skill Set Less than 5% of overall project time One engineer proficient in TLA+ in a team of five. Others including managers review it. Prevent expensive postproduction issues We model what we change, not the whole system 6

  7. MODEL CHECKING A DESIGN SCOPE OF MODEL TLA+ covers Traditional Tests Cover The algorithm and cross component interactions. Implementation of each component. Async interactions, states and state transitions. Mostly synchronous operations modeled as actions. 7

  8. EXAMPLE PROJECTS Cross Region Failover Message Routing Device Cache Azure IoT hub allows cross region failovers. The design is model checked to ensure RPO, RTO guarantees. IoT telemetry message routing system enriches and transforms data.. New routing features are model checked. Azure in-memory device store is backed by Azure storage as the durable store. Every aspect of the design is model checked. 8

  9. EXAMPLE PROJECT * SOME DETAILS ARE EXCLUDED TO MAINTAIN CONFIDENTIALITY Device State Notifications 9

  10. DEVICE STATE NOTIFICATIONS Core problem Scale Devices (like security cameras) can connect and disconnect from the Azure IoT cloud. Millions of devices can connect or disconnect at any time. Allow customers (on a cellphone app or a backend service) to receive notifications when device status changes. Thousands of servers can failover at the same time. Notification system as well and the backend database wouldn t operate at this scale. Duplicates are allowed, ordering is not guaranteed. 10

  11. RUDIMENTARY APPROACH Device Connect 1 Connect 1 Disconnect 2 Disconnect 2 Primary Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnected 2 11

  12. RUDIMENTARY APPROACH RESEND Device Connect 1 Disconnect 2 Primary Disconnect 2 Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnected 2 12

  13. RUDIMENTARY APPROACH RESEND Device Connect 1 Disconnect 2 Disconnect 2 Disconnect 2 Primary Disconnect 2 Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnected 2 13

  14. RUDIMENTARY APPROACH FAILURE Device Disconnect 1 Disconnect 1 Connect 2 Connect 2 Primary Standby (New) Primary Failed Client Topic Connected 0 Database Connected 0 Disconnected 1 14

  15. DEVICE STATE NOTIFICATIONS Components States Device Connected/Disconnected Service/In-memory Notified/Pending Database/Persistence Last notification Client Notifications received 15

  16. Components States Device Connected/Disconnected Service/In-memory Notified/Pending Database/Persistence Last notification Client/Notified-State Notifications received TLA+ 16

  17. INVARIANT 17

  18. 18

  19. INVARIANT FAILURE 19

  20. FIX Generate Sequence Numbers Checkpoint Failover Service assigns a sequence number and stores the event in the DB A background process sends the notification and tracks the sequence number processed On failover the new service resumes from where the last one left off We use algorithm for in-memory sequence numbers generation. 20

  21. SOPHISTICATE SOLUTION SIMPLE Device Connect 1 Connect 1 Disconnect 2 Disconnect 2 Primary Standby (New) Primary Failed Client Topic Database 1 2 Disconnected 0 Connected 1 Disconnected 2 21

  22. SOPHISTICATE SOLUTION - FAILOVER Device Connect 1 Disconnect 2 Disconnect 2 Primary Standby (New) Primary Failed Client Topic Database Disconnected 0 Connected 1 Disconnect 2 1 2 22

  23. Components States Device Connected/Disconnected Service/In-memory Notified/Pending Database/Persistence Last notification Client/Notified-State Notifications received TLA+ 23

  24. 24 INVARIANT

  25. 25

  26. TAKEAWAYS Correctness and Optimizations Cost and Savings We can process millions of concurrent device connects and disconnects correctly. We invested less than a week on the TLA+ model. No design changes were needed at the development and testing stages. Database writes and Sending notifications are not in the critical path. Optimizations reduced operational cost by an order of magnitude. in-memory sequence numbers instead of hitting database. 26

  27. CONCLUSION Operating at Scale Agility and Clarity Reduces post-production issues in count and complexity. It s easier to read, review and iterate over 50 lines of TLA+ than 10 pages of design doc. Eliminates design reworks that previously took 20-40% of project time. Makes cross-component integrations simpler as everyone codes against same specification. Allows aggressive optimizations that are necessary at IOT scale. Code reviews use the TLA+ specifications as reference and primarily focus on code quality. 27

  28. QUESTIONS ? 28

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#