Introduction to Spark Streaming for Real-Time data analysis

Slide Note

Explore Spark Streaming for real-time data analysis in this informative session by Ellen Kraffmiller, Technical Lead, and Robert Treacy, Senior Software Architect at The Institute for Quantitative Social Science at Harvard University. Discover key concepts, components, and Apache Spark's capabilities for large-scale data processing.

nram Follow

Uploaded on Feb 26, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Introduction to Spark Streaming for Real Time data analysis Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect The Institute for Quantitative Social Science at Harvard University

Agenda Overview of Spark Spark Components and Architecture Streaming Storm, Kafka, Flume, Kinesis Spark Streaming Structured Streaming Demo

Introductions Who we are Who is the audience

Related Presentations BOF5810 - Distinguish Pop Music from Heavy Metal with Apache Spark Mlib CON3165 Introduction to Machine Learning with Apache Spark Mlib CON5189 Getting Started with Spark CON4219 - Analyzing Streaming Video with Apache Spark BOF1337 - Big Data Processing with Apache Spark: Scala or Java? CON1495 - Turning Relational Database Tables into Spark Data Sources CON4998 Java EE 7 with Apache Spark for the World s Largest Credit Card Core Systems

More Related Presentations CON7867 Data Pipeline for a Hyperscale, Real-Time Applications Using Kafka and Cassandra CON2234 Interactive Data Analytics and Visualization with Collaborative Documents CON7682 Adventures in Big Data Scaling Bottleneck Hunt CON1191 Kafka Streams + Tensor Flow + H2o.ai Highly Scalable Deep Learning CON7368 The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners

Apache Spark A fast and general engine for large scale data processing Latest version 2.2 100x faster than Hadoop MapReduce in memory, 10x faster in disk Resilient Distributed Dataset Directed Acyclic Graph (DAG) - lineage

Spark Components Components Spark SQL Spark Streaming MLlib GraphX Spark Core Standalone Cluster YARN Mesos

Streaming Technologies Storm Kafka Flume Kinesis

Apache Storm Low latency , discrete events Real time processing not a queue Events may be processed more than once Trident

Apache Kafka Scalable, fault tolerant message queue Exactly once Pull

Apache Flume Message passing push Hadoop ecosystem No event replication can lose messages

Kinesis AWS pull Simpler than Kafka, but also slower partly because of replication

Spark Streaming Stream processing and batch processing use same API Two types of Streams DStreams Structured Streams (Datasets/Dataframes)

DStreams Series of RDDs Basic Sources File systems, socket connections Advance Sources Kafka, Flume, Kinesis,..

Structured Streaming Built on Spark SQL Engine Dataset/Dataframe API Late events aggregated correctly Exactly once semantics Stream source must track progress (i.e. Kafka offsets) Stream Sink must be idempotent

Source: https://spark.apache.org/docs/latest/structured-streaming- programming-guide.html

https://spark.apache.org/docs/latest/structured-streaming-programming-https://spark.apache.org/docs/latest/structured-streaming-programming- guide.html

Spark ML Pipelines Pipeline consists of a sequence of PipelineStages which are either Transformers or Estimators Transformer transform() converts one DataFrame to another DataFrame Estimator fit() accepts a DataFrame and produces a Model a Model is a Transformer

Bayes Model Very fast, simple ? ? ? =?(?|?)?(?) ?(?)

Demo: Analyze Twitter Streams with Spark ML and Spark Streaming Part 1: Create Na ve Bayes Model with training data Part 2: Sentiment Analysis of live Twitter Stream using Na ve Bayes Model and Spark Dstreams Part 3: Structured Streaming example get Twitter trending hashtags in 10 minute windows using Twitter timestamp https://github.com/ekraffmiller/SparkStreamingDemo

Thank you! Questions?

Introduction to Spark Streaming for Real-Time data analysis

Download Presentation

Presentation Transcript

Related

More Related Content