Introduction to Spark Streaming for Real-Time data analysis

Introduction to Spark Streaming for Real-Time data analysis
Slide Note
Embed
Share

Explore Spark Streaming for real-time data analysis in this informative session by Ellen Kraffmiller, Technical Lead, and Robert Treacy, Senior Software Architect at The Institute for Quantitative Social Science at Harvard University. Discover key concepts, components, and Apache Spark's capabilities for large-scale data processing.

  • Spark Streaming
  • Real-Time Data Analysis
  • Apache Spark
  • Big Data Processing
  • Data Streaming

Uploaded on Feb 26, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Introduction to Spark Streaming for Real Time data analysis Ellen Kraffmiller - Technical Lead Robert Treacy - Senior Software Architect The Institute for Quantitative Social Science at Harvard University

  2. Agenda Overview of Spark Spark Components and Architecture Streaming Storm, Kafka, Flume, Kinesis Spark Streaming Structured Streaming Demo

  3. Introductions Who we are Who is the audience

  4. Related Presentations BOF5810 - Distinguish Pop Music from Heavy Metal with Apache Spark Mlib CON3165 Introduction to Machine Learning with Apache Spark Mlib CON5189 Getting Started with Spark CON4219 - Analyzing Streaming Video with Apache Spark BOF1337 - Big Data Processing with Apache Spark: Scala or Java? CON1495 - Turning Relational Database Tables into Spark Data Sources CON4998 Java EE 7 with Apache Spark for the World s Largest Credit Card Core Systems

  5. More Related Presentations CON7867 Data Pipeline for a Hyperscale, Real-Time Applications Using Kafka and Cassandra CON2234 Interactive Data Analytics and Visualization with Collaborative Documents CON7682 Adventures in Big Data Scaling Bottleneck Hunt CON1191 Kafka Streams + Tensor Flow + H2o.ai Highly Scalable Deep Learning CON7368 The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners

  6. Apache Spark A fast and general engine for large scale data processing Latest version 2.2 100x faster than Hadoop MapReduce in memory, 10x faster in disk Resilient Distributed Dataset Directed Acyclic Graph (DAG) - lineage

  7. Spark Components Components Spark SQL Spark Streaming MLlib GraphX Spark Core Standalone Cluster YARN Mesos

  8. Streaming Technologies Storm Kafka Flume Kinesis

  9. Apache Storm Low latency , discrete events Real time processing not a queue Events may be processed more than once Trident

  10. Apache Kafka Scalable, fault tolerant message queue Exactly once Pull

  11. Apache Flume Message passing push Hadoop ecosystem No event replication can lose messages

  12. Kinesis AWS pull Simpler than Kafka, but also slower partly because of replication

  13. Spark Streaming Stream processing and batch processing use same API Two types of Streams DStreams Structured Streams (Datasets/Dataframes)

  14. DStreams Series of RDDs Basic Sources File systems, socket connections Advance Sources Kafka, Flume, Kinesis,..

  15. Structured Streaming Built on Spark SQL Engine Dataset/Dataframe API Late events aggregated correctly Exactly once semantics Stream source must track progress (i.e. Kafka offsets) Stream Sink must be idempotent

  16. Source: https://spark.apache.org/docs/latest/structured-streaming- programming-guide.html

  17. https://spark.apache.org/docs/latest/structured-streaming-programming-https://spark.apache.org/docs/latest/structured-streaming-programming- guide.html

  18. Spark ML Pipelines Pipeline consists of a sequence of PipelineStages which are either Transformers or Estimators Transformer transform() converts one DataFrame to another DataFrame Estimator fit() accepts a DataFrame and produces a Model a Model is a Transformer

  19. Bayes Model Very fast, simple ? ? ? =?(?|?)?(?) ?(?)

  20. Demo: Analyze Twitter Streams with Spark ML and Spark Streaming Part 1: Create Na ve Bayes Model with training data Part 2: Sentiment Analysis of live Twitter Stream using Na ve Bayes Model and Spark Dstreams Part 3: Structured Streaming example get Twitter trending hashtags in 10 minute windows using Twitter timestamp https://github.com/ekraffmiller/SparkStreamingDemo

  21. Thank you! Questions?

Related


More Related Content