Real-Time Data Insights with Azure Databricks
Processing high-volume data in real-time can be achieved efficiently using Azure Databricks, a powerful Apache Spark-based analytics platform integrated with Microsoft Azure. By transitioning from batch processing to structured streaming, you can gain valuable real-time insights from your data, enabling collaborative analytics workflows and seamless data processing. Azure Databricks offers advantages such as streamlined workflows, interactive workspaces, and enterprise-grade security for optimizing data processing and analysis in a cloud environment.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Realtime Structured Streaming in Azure Databricks Brian Steele - Principal Consultant bsteele@pragmaticworks.com
Your Current Situation You currently have high volume data that you are processing in a batch format You are you trying to get real-time insights from your data You have great knowledge of your data, but limited knowledge on of Azure Databricks or other Spark systems
Prior Architecture Batch Processing Source System Azure Data Factory Daily File Extract
New Architecture Structured Streaming Realtime Message Streaming to Event Hubs Realtime Transaction Processing Bypass Source System
Why Azure Databricks? Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service.
For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real- time using Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark. Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.
Advantages of Structured Streaming Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives. Databricks maintains the current checkpoint of the data processed, making restart after failure nearly seamless. Can bring impactful insights to the users in almost real-time.
Streaming Data Source/Sinks Sources Sinks Azure Event Hubs/IOT Hubs Databricks Delta Tables Azure Data Lake Gen2 (Auto Loader) Apache Kafka Amazon Kinesis Amazon S3 with Amazon SQS Databricks Delta Tables Almost any Sink using foreachBatch
Structured Streaming Source Parameters Source Format/Location Batch/File Size Transformations Streaming data can be transformed in the same ways as static data Output Parameters Output Format/Location Checkpoint Location Structured Streaming EVENT HUB
Stream-Static Joins Join Types Inner Left Not Stateful by default Structured Streaming EVENT HUB STATIC FILE
Stream-Stream Joins Join Types Inner (Watermark and Time Constraint Optional) Left Outer (Watermark and Time Constraint Req) Right Outer (Watermark and Time Constraint Req) You can also Join Static Tables/Files into your Stream- Stream Join Structured Streaming Micro Batch EVENT HUB Structured Streaming EVENT HUB STATIC FILE
Watermark vs. Time Constraint Watermark How late a record can arrive and after what time can it be removed from the state. Time Constraint How log the records will be kept in state in relation to the other stream Only used in stateful operation Ignored in non-stateful streaming queries and batch queries
Structured Streaming Watermark 10 Minutes EVENT HUB Transaction 1/Customer 1/Item 1 Transaction 2/Customer 2/Item 1 Transaction 3/Customer 1/Item 2 Time constraint View.timeStamp >= Transaction.timeStamp and View.timeStamp <= Transaction.timeStamp + interval 5 minutes Structured Streaming EVENT HUB Watermark 5 Minutes View 1/Customer 1/Item 1 View 2/Customer 2/Item 2 View 3/Customer 3/Item 3 View 4/Customer 1/Item 2
10:01 10:00 - 10:10 Watermark Time Transaction 1 Recieved 10:00 Transaction 1 Occurs 10:00 - 10:05 View 6 Watermark 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:00 10:15 10:00 - 10:05 Constraint Time 10:02 View 1 10:06 View 4 10:00 View 6 10:03 View 2 10:04 10:08 10:12 View 3 Occurs View 3 Received View 5 Received 10:04 View 5 Occurs
foreachBatch Allows Batch Type Processing to be performed on Streaming Data Perform Processes with out adding to state dropDuplicates Aggregating data Perform a Merge/Upsert with Existing Static Data Write Data to multiple sinks/destinations Write Data to sinks not support in Structured Streaming
Going to Production Spark Shuffle Partitions Equal to the number of cores on the Cluster Maximum Records per Micro-Batch File Source/Delta Lake maxFilesPerTrigger, maxBytesPerTrigger EventHubs maxEventsPerTrigger Limit Stateful limits state and memory errors Watermarking MERGE/Join/Aggregation Broadcast Joins Output Tables Influences downstream streams Manually re-partition Delta Lake Auto-Optimize