Real-Time Data Insights with Azure Databricks

Realtime Structured

Streaming in Azure

Databricks

Brian Steele - Principal Consultant

bsteele@pragmaticworks.com

•

You currently have high volume data that you are processing

in a batch format

•

You are you trying to get real-time insights from your data

•

You have great knowledge of your data, but limited

knowledge on of Azure Databricks or other Spark systems

Your Current Situation

Prior Architecture

New Architecture

•

Azure Databricks is an Apache Spark-based analytics platform

optimized for the Microsoft Azure cloud services platform.

•

Designed with the founders of Apache Spark, Databricks is integrated

with Azure to provide one-click setup, streamlined workflows, and an

interactive workspace that enables collaboration between data

scientists, data engineers, and business analysts.

•

Azure Databricks is a fast, easy, and collaborative Apache Spark-based

analytics service.

Why Azure Databricks?

•

For a big data pipeline, the data (raw or structured) is ingested into

Azure through Azure Data Factory in batches, or streamed near real-

time using Kafka, Event Hub, or IoT Hub.

•

This data lands in a data lake for long term persisted storage, in Azure

Blob Storage or Azure Data Lake Storage.

•

As part of your analytics workflow, use Azure Databricks to read data

from multiple data sources such as Azure Blob Storage, Azure Data

Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and

turn it into breakthrough insights using Spark.

•

Azure Databricks provides enterprise-grade Azure security, including

Azure Active Directory integration, role-based controls, and SLAs that

protect your data and your business.

•

Structured Streaming is the Apache Spark API that lets you express

computation on streaming data in the same way you express a

batch computation on static data.

•

The Spark SQL engine performs the computation incrementally and

continuously updates the result as streaming data arrives.

•

Databricks maintains the current checkpoint of the data processed,

making restart after failure nearly seamless.

•

Can bring impactful insights to the users in almost real-time.

Advantages of Structured Streaming

Streaming Data Source/Sinks

•

Source Parameters

•

Source Format/Location

•

Batch/File Size

•

Transformations

•

Streaming data can be transformed in the same

ways as static data

•

Output Parameters

•

Output Format/Location

•

Checkpoint Location

Structured Streaming

DEMO

Join Operations

•

Join Types

•

Inner

•

Left

•

Not Stateful by default

Stream-Static Joins

DEMO

•

Join Types

•

Inner (Watermark and Time

Constraint Optional)

•

Left Outer (Watermark and Time

Constraint Req)

•

Right Outer (Watermark and Time

Constraint Req)

•

You can also Join Static

Tables/Files into your Stream-

Stream Join

Stream-Stream Joins

•

Watermark – How late a record can

arrive and after what time can it be

removed from the state.

•

Time Constraint – How log the records

will be kept in state in relation to the

other stream

•

Only used in stateful operation

•

Ignored in non-stateful streaming

queries and batch queries

Watermark vs. Time Constraint

DEMO

•

Allows Batch Type Processing to be performed on Streaming Data

•

Perform Processes with out adding to state

•

dropDuplicates

•

Aggregating data

•

Perform a Merge/Upsert with Existing Static Data

•

Write Data to multiple sinks/destinations

•

Write Data to sinks not support in Structured Streaming

foreachBatch

DEMO

•

Spark Shuffle Partitions –

•

Equal to the number of cores on the Cluster

•

Maximum Records per Micro-Batch

•

File Source/Delta Lake – maxFilesPerTrigger, maxBytesPerTrigger

•

EventHubs – maxEventsPerTrigger

•

Limit Stateful – limits state and memory errors

•

Watermarking

•

MERGE/Join/Aggregation

•

Broadcast Joins

•

Output Tables – Influences downstream streams

•

Manually re-partition

•

Delta Lake – Auto-Optimize

Going to Production

Conclusion

Slide Note

Embed Share

Download

Processing high-volume data in real-time can be achieved efficiently using Azure Databricks, a powerful Apache Spark-based analytics platform integrated with Microsoft Azure. By transitioning from batch processing to structured streaming, you can gain valuable real-time insights from your data, enabling collaborative analytics workflows and seamless data processing. Azure Databricks offers advantages such as streamlined workflows, interactive workspaces, and enterprise-grade security for optimizing data processing and analysis in a cloud environment.

nisa Follow

Uploaded on Jul 11, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Realtime Structured Streaming in Azure Databricks Brian Steele - Principal Consultant bsteele@pragmaticworks.com

Your Current Situation You currently have high volume data that you are processing in a batch format You are you trying to get real-time insights from your data You have great knowledge of your data, but limited knowledge on of Azure Databricks or other Spark systems

Prior Architecture Batch Processing Source System Azure Data Factory Daily File Extract

New Architecture Structured Streaming Realtime Message Streaming to Event Hubs Realtime Transaction Processing Bypass Source System

Why Azure Databricks? Azure Databricks is an Apache Spark-based analytics platform optimized for the Microsoft Azure cloud services platform. Designed with the founders of Apache Spark, Databricks is integrated with Azure to provide one-click setup, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts. Azure Databricks is a fast, easy, and collaborative Apache Spark-based analytics service.

For a big data pipeline, the data (raw or structured) is ingested into Azure through Azure Data Factory in batches, or streamed near real- time using Kafka, Event Hub, or IoT Hub. This data lands in a data lake for long term persisted storage, in Azure Blob Storage or Azure Data Lake Storage. As part of your analytics workflow, use Azure Databricks to read data from multiple data sources such as Azure Blob Storage, Azure Data Lake Storage, Azure Cosmos DB, or Azure SQL Data Warehouse and turn it into breakthrough insights using Spark. Azure Databricks provides enterprise-grade Azure security, including Azure Active Directory integration, role-based controls, and SLAs that protect your data and your business.

Advantages of Structured Streaming Structured Streaming is the Apache Spark API that lets you express computation on streaming data in the same way you express a batch computation on static data. The Spark SQL engine performs the computation incrementally and continuously updates the result as streaming data arrives. Databricks maintains the current checkpoint of the data processed, making restart after failure nearly seamless. Can bring impactful insights to the users in almost real-time.

Streaming Data Source/Sinks Sources Sinks Azure Event Hubs/IOT Hubs Databricks Delta Tables Azure Data Lake Gen2 (Auto Loader) Apache Kafka Amazon Kinesis Amazon S3 with Amazon SQS Databricks Delta Tables Almost any Sink using foreachBatch

Structured Streaming Source Parameters Source Format/Location Batch/File Size Transformations Streaming data can be transformed in the same ways as static data Output Parameters Output Format/Location Checkpoint Location Structured Streaming EVENT HUB

DEMO

Join Operations

Stream-Static Joins Join Types Inner Left Not Stateful by default Structured Streaming EVENT HUB STATIC FILE

DEMO

Stream-Stream Joins Join Types Inner (Watermark and Time Constraint Optional) Left Outer (Watermark and Time Constraint Req) Right Outer (Watermark and Time Constraint Req) You can also Join Static Tables/Files into your Stream- Stream Join Structured Streaming Micro Batch EVENT HUB Structured Streaming EVENT HUB STATIC FILE

Watermark vs. Time Constraint Watermark How late a record can arrive and after what time can it be removed from the state. Time Constraint How log the records will be kept in state in relation to the other stream Only used in stateful operation Ignored in non-stateful streaming queries and batch queries

Structured Streaming Watermark 10 Minutes EVENT HUB Transaction 1/Customer 1/Item 1 Transaction 2/Customer 2/Item 1 Transaction 3/Customer 1/Item 2 Time constraint View.timeStamp >= Transaction.timeStamp and View.timeStamp <= Transaction.timeStamp + interval 5 minutes Structured Streaming EVENT HUB Watermark 5 Minutes View 1/Customer 1/Item 1 View 2/Customer 2/Item 2 View 3/Customer 3/Item 3 View 4/Customer 1/Item 2

10:01 10:00 - 10:10 Watermark Time Transaction 1 Recieved 10:00 Transaction 1 Occurs 10:00 - 10:05 View 6 Watermark 10:01 10:02 10:03 10:04 10:05 10:06 10:07 10:08 10:09 10:10 10:11 10:12 10:13 10:14 10:00 10:15 10:00 - 10:05 Constraint Time 10:02 View 1 10:06 View 4 10:00 View 6 10:03 View 2 10:04 10:08 10:12 View 3 Occurs View 3 Received View 5 Received 10:04 View 5 Occurs

DEMO

foreachBatch Allows Batch Type Processing to be performed on Streaming Data Perform Processes with out adding to state dropDuplicates Aggregating data Perform a Merge/Upsert with Existing Static Data Write Data to multiple sinks/destinations Write Data to sinks not support in Structured Streaming

DEMO

Going to Production Spark Shuffle Partitions Equal to the number of cores on the Cluster Maximum Records per Micro-Batch File Source/Delta Lake maxFilesPerTrigger, maxBytesPerTrigger EventHubs maxEventsPerTrigger Limit Stateful limits state and memory errors Watermarking MERGE/Join/Aggregation Broadcast Joins Output Tables Influences downstream streams Manually re-partition Delta Lake Auto-Optimize

Conclusion

Have Any Questions?

Real-Time Data Insights with Azure Databricks

Download Presentation

Presentation Transcript

Related

More Related Content