Introduction to Spark Streaming for Real-Time data analysis
Explore Spark Streaming for real-time data analysis in this informative session by Ellen Kraffmiller, Technical Lead, and Robert Treacy, Senior Software Architect at The Institute for Quantitative Social Science at Harvard University. Discover key concepts, components, and Apache Spark's capabilities for large-scale data processing.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Introduction to Spark Streaming for Real Time data analysis Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect The Institute for Quantitative Social Science at Harvard University
Agenda Overview of Spark Spark Components and Architecture Streaming Storm, Kafka, Flume, Kinesis Spark Streaming Structured Streaming Demo
Introductions Who we are Who is the audience
Related Presentations BOF5810 - Distinguish Pop Music from Heavy Metal with Apache Spark Mlib CON3165 Introduction to Machine Learning with Apache Spark Mlib CON5189 Getting Started with Spark CON4219 - Analyzing Streaming Video with Apache Spark BOF1337 - Big Data Processing with Apache Spark: Scala or Java? CON1495 - Turning Relational Database Tables into Spark Data Sources CON4998 Java EE 7 with Apache Spark for the World s Largest Credit Card Core Systems
More Related Presentations CON7867 Data Pipeline for a Hyperscale, Real-Time Applications Using Kafka and Cassandra CON2234 Interactive Data Analytics and Visualization with Collaborative Documents CON7682 Adventures in Big Data Scaling Bottleneck Hunt CON1191 Kafka Streams + Tensor Flow + H2o.ai Highly Scalable Deep Learning CON7368 The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
Apache Spark A fast and general engine for large scale data processing Latest version 2.2 100x faster than Hadoop MapReduce in memory, 10x faster in disk Resilient Distributed Dataset Directed Acyclic Graph (DAG) - lineage
Spark Components Components Spark SQL Spark Streaming MLlib GraphX Spark Core Standalone Cluster YARN Mesos
Streaming Technologies Storm Kafka Flume Kinesis
Apache Storm Low latency , discrete events Real time processing not a queue Events may be processed more than once Trident
Apache Kafka Scalable, fault tolerant message queue Exactly once Pull
Apache Flume Message passing push Hadoop ecosystem No event replication can lose messages
Kinesis AWS pull Simpler than Kafka, but also slower partly because of replication
Spark Streaming Stream processing and batch processing use same API Two types of Streams DStreams Structured Streams (Datasets/Dataframes)
DStreams Series of RDDs Basic Sources File systems, socket connections Advance Sources Kafka, Flume, Kinesis,..
Structured Streaming Built on Spark SQL Engine Dataset/Dataframe API Late events aggregated correctly Exactly once semantics Stream source must track progress (i.e. Kafka offsets) Stream Sink must be idempotent
Source: https://spark.apache.org/docs/latest/structured-streaming- programming-guide.html
https://spark.apache.org/docs/latest/structured-streaming-programming-https://spark.apache.org/docs/latest/structured-streaming-programming- guide.html
Spark ML Pipelines Pipeline consists of a sequence of PipelineStages which are either Transformers or Estimators Transformer transform() converts one DataFrame to another DataFrame Estimator fit() accepts a DataFrame and produces a Model a Model is a Transformer
Bayes Model Very fast, simple ? ? ? =?(?|?)?(?) ?(?)
Demo: Analyze Twitter Streams with Spark ML and Spark Streaming Part 1: Create Na ve Bayes Model with training data Part 2: Sentiment Analysis of live Twitter Stream using Na ve Bayes Model and Spark Dstreams Part 3: Structured Streaming example get Twitter trending hashtags in 10 minute windows using Twitter timestamp https://github.com/ekraffmiller/SparkStreamingDemo
Thank you! Questions?