Real-Time Data Insights with Azure Databricks

 
Realtime Structured
Streaming in Azure
Databricks
 
Brian Steele - Principal Consultant
bsteele@pragmaticworks.com
 
You currently have high volume data that you are processing
in a batch format
You are you trying to get real-time insights from your data
You have great knowledge of your data, but limited
knowledge on of Azure Databricks or other Spark systems
 
Your Current Situation
 
Prior Architecture
 
New Architecture
 
Azure Databricks is an Apache Spark-based analytics platform
optimized for the Microsoft Azure cloud services platform.
Designed with the founders of Apache Spark, Databricks is integrated
with Azure to provide one-click setup, streamlined workflows, and an
interactive workspace that enables collaboration between data
scientists, data engineers, and business analysts.
Azure Databricks is a fast, easy, and collaborative Apache Spark-based
analytics service.
 
Why Azure Databricks?
 
For a big data pipeline, the data (raw or structured) is ingested into
Azure through Azure Data Factory in batches, or streamed near real-
time using Kafka, Event Hub, or IoT Hub.
This data lands in a data lake for long term persisted storage, in Azure
Blob Storage or Azure Data Lake Storage.
As part of your analytics workflow, use Azure Databricks to read data
from multiple data sources such as Azure Blob Storage, Azure Data
Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and
turn it into breakthrough insights using Spark.
Azure Databricks provides enterprise-grade Azure security, including
Azure Active Directory integration, role-based controls, and SLAs that
protect your data and your business.
 
Structured Streaming is the Apache Spark API that lets you express
computation on streaming data in the same way you express a
batch computation on static data.
The Spark SQL engine performs the computation incrementally and
continuously updates the result as streaming data arrives.
Databricks maintains the current checkpoint of the data processed,
making restart after failure nearly seamless.
Can bring impactful insights to the users in almost real-time.
 
Advantages of Structured Streaming
 
Streaming Data Source/Sinks
 
Source Parameters
Source Format/Location
Batch/File Size
Transformations
Streaming data can be transformed in the same
ways as static data
Output Parameters
Output Format/Location
Checkpoint Location
 
Structured Streaming
 
DEMO
 
Join Operations
 
Join Types
Inner
Left
Not Stateful by default
 
 
 
Stream-Static Joins
 
DEMO
 
Join Types
Inner (Watermark and Time
Constraint Optional)
Left Outer (Watermark and Time
Constraint Req)
Right Outer (Watermark and Time
Constraint Req)
You can also Join Static
Tables/Files into your Stream-
Stream Join
 
 
Stream-Stream Joins
 
Watermark – How late a record can
arrive and after what time can it be
removed from the state.
Time Constraint – How log the records
will be kept in state in relation to the
other stream
Only used in stateful operation
Ignored in non-stateful streaming
queries and batch queries
 
Watermark vs. Time Constraint
 
DEMO
 
Allows Batch Type Processing to be performed on Streaming Data
Perform Processes with out adding to state
dropDuplicates
Aggregating data
Perform a Merge/Upsert with Existing Static Data
Write Data to multiple sinks/destinations
Write Data to sinks not support in Structured Streaming
 
foreachBatch
 
DEMO
 
Spark Shuffle Partitions –
Equal to the number of cores on the Cluster
Maximum Records per Micro-Batch
File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger
EventHubs – maxEventsPerTrigger
Limit Stateful – limits state and memory errors
Watermarking
MERGE/Join/Aggregation
Broadcast Joins
Output Tables – Influences downstream streams
Manually re-partition
Delta Lake – Auto-Optimize
 
Going to Production
 
 
Conclusion
Slide Note
Embed
Share

Processing high-volume data in real-time can be achieved efficiently using Azure Databricks, a powerful Apache Spark-based analytics platform integrated with Microsoft Azure. By transitioning from batch processing to structured streaming, you can gain valuable real-time insights from your data, enabling collaborative analytics workflows and seamless data processing. Azure Databricks offers advantages such as streamlined workflows, interactive workspaces, and enterprise-grade security for optimizing data processing and analysis in a cloud environment.

  • Real-Time Data Insights
  • Azure Databricks
  • Structured Streaming
  • Apache Spark
  • Big Data Analytics

Uploaded on Jul 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Realtime Structured Streaming in Azure Databricks Brian Steele - Principal Consultant bsteele@pragmaticworks.com

  2. Your Current Situation You currently have high volume data that you are processing in a batch format You are you trying to get real-time insights from your data You have great knowledge of your data, but limited knowledge on of Azure Databricks or other Spark systems

  3. Prior Architecture Batch Processing Source System Azure Data Factory Daily File Extract

  4. New Architecture Structured Streaming Realtime Message Streaming to Event Hubs Realtime Transaction Processing Bypass Source System

  5. Why Azure Databricks? Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service.

  6. For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real- time using Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark. Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.

  7. Advantages of Structured Streaming Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives. Databricks maintains the current checkpoint of the data processed, making restart after failure nearly seamless. Can bring impactful insights to the users in almost real-time.

  8. Streaming Data Source/Sinks Sources Sinks Azure Event Hubs/IOT Hubs Databricks Delta Tables Azure Data Lake Gen2 (Auto Loader) Apache Kafka Amazon Kinesis Amazon S3 with Amazon SQS Databricks Delta Tables Almost any Sink using foreachBatch

  9. Structured Streaming Source Parameters Source Format/Location Batch/File Size Transformations Streaming data can be transformed in the same ways as static data Output Parameters Output Format/Location Checkpoint Location Structured Streaming EVENT HUB

  10. DEMO

  11. Join Operations

  12. Stream-Static Joins Join Types Inner Left Not Stateful by default Structured Streaming EVENT HUB STATIC FILE

  13. DEMO

  14. Stream-Stream Joins Join Types Inner (Watermark and Time Constraint Optional) Left Outer (Watermark and Time Constraint Req) Right Outer (Watermark and Time Constraint Req) You can also Join Static Tables/Files into your Stream- Stream Join Structured Streaming Micro Batch EVENT HUB Structured Streaming EVENT HUB STATIC FILE

  15. Watermark vs. Time Constraint Watermark How late a record can arrive and after what time can it be removed from the state. Time Constraint How log the records will be kept in state in relation to the other stream Only used in stateful operation Ignored in non-stateful streaming queries and batch queries

  16. Structured Streaming Watermark 10 Minutes EVENT HUB Transaction 1/Customer 1/Item 1 Transaction 2/Customer 2/Item 1 Transaction 3/Customer 1/Item 2 Time constraint View.timeStamp >= Transaction.timeStamp and View.timeStamp <= Transaction.timeStamp + interval 5 minutes Structured Streaming EVENT HUB Watermark 5 Minutes View 1/Customer 1/Item 1 View 2/Customer 2/Item 2 View 3/Customer 3/Item 3 View 4/Customer 1/Item 2

  17. 10:01 10:00 - 10:10 Watermark Time Transaction 1 Recieved 10:00 Transaction 1 Occurs 10:00 - 10:05 View 6 Watermark 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:00 10:15 10:00 - 10:05 Constraint Time 10:02 View 1 10:06 View 4 10:00 View 6 10:03 View 2 10:04 10:08 10:12 View 3 Occurs View 3 Received View 5 Received 10:04 View 5 Occurs

  18. DEMO

  19. foreachBatch Allows Batch Type Processing to be performed on Streaming Data Perform Processes with out adding to state dropDuplicates Aggregating data Perform a Merge/Upsert with Existing Static Data Write Data to multiple sinks/destinations Write Data to sinks not support in Structured Streaming

  20. DEMO

  21. Going to Production Spark Shuffle Partitions Equal to the number of cores on the Cluster Maximum Records per Micro-Batch File Source/Delta Lake maxFilesPerTrigger, maxBytesPerTrigger EventHubs maxEventsPerTrigger Limit Stateful limits state and memory errors Watermarking MERGE/Join/Aggregation Broadcast Joins Output Tables Influences downstream streams Manually re-partition Delta Lake Auto-Optimize

  22. Conclusion

  23. Have Any Questions?

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#