Programming Models for IoT and Streaming Data

Judy Qiu

Indiana University

•

Query Based

–

Complex Event processing

–

SQL like languages

•

Programming APIs

•

Queries or the Programs run on a continuous stream, unlike Hadoop

where your data is static for the Batch processor

•

Need to address diverse streams – Unbounded sequence of events

•

Examples



Video Camera frames



Tweets



Laser scans from a robot



Log data

•

Aurora – Early Research System

•

Borealis – Early Research System

•

Apache Storm

•

Apache S4

•

Apache Samza

•

Google MillWheel

•

Amazon Kinesis

•

LinkedIn Databus

•

Facebook Puma/Ptail/Scribe/ODS

•

Azure Stream Analytics

•

Will discuss 2 Apache Storm projects at

Indiana University

•

Framework to connect devices to cloud services

•

IoTCloud consists of

–

a set of distributed nodes running close to the devices to gather

data

–

a set of publish-subscribe brokers to relay the information to the

cloud services

–

a distributed stream processing framework (DSPF) coupled with

batch processing frameworks in the Cloud

•

Uses OpenStack environment

•

Improving fault-tolerance and quality of service for especially

guarantees on maximum response time

Built on Apache Storm,

RabbitMQ, Hbase ………

•

Particle Filtering Based SLAM

•

N-Body Collision Avoidance

•

Using parallel algorithms inside

Storm for performance

performance

Map Built from Robot data

Robots need to avoid collisions when they move

Storage

substrate

Batch

analysis

module

Streaming

analysis

module



Non-trivial parallel stream processing algorithm with novel global

synchronization and cluster-delta data transfer to achieve scalability



Clustering of social media streams: real-time processing of 10% Twitter

(“Gardenhose”)





High-dimensional vectors: textual and network information



Expensive similarity computation: 43.4 hours to cluster 1 hour’s data with

sequential algorithm



Online K-Means with sliding time window and outlier detection



Xiaoming Gao, Emilio Ferrara, Judy Qiu. Parallel Clustering of High-Dimensional Social Media Data Streams. To

appear at 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2015).

–

•

Final step statistics for a sequential run over 6 minutes data:

120 clusters, time window length: 6 steps, outlier: 2 standard deviation

Data point 1:

Content_Vector: [“step”:1, “time”:1, “nation”: 1,

                                “ram”:1]

Diffusion_Vector: …

…

Data point 2:

Content_Vector: [“lovin”:1, “support”:1, “vcu”:1,

                                “ram”:1]

Diffusion_Vector: …

…

Centroid:

Content_Vector: [“step”:0.5, “time”:0.5, “nation”: 0.5, “ram”:1.0, “lovin”:0.5,

                                “support”:0.5, “vcu”:0.5]

Diffusion_Vector: …

…

Cluster



Sparsity of high-dimensional vectors make any synchronization expensive

Cluster-delta synchronization strategy reduces message

traffic and synchronization overhead



DAG organization of parallel workers: hard to synchronize cluster information

–

Protomeme

Generator

Spout

Synchronization

Coordinator Bolt

ActiveMQ

Broker

SYNCINIT

CDELTAS

…

Sequential or Parallel Batch Clustering Algorithm

Bootstrap

Information

Worker Process

Clustering Bolt

Clustering Bolt

…

Worker Process

Clustering Bolt

Clustering Bolt

…

PMADD

OUTLIER

SYNCREQ

tweet

stream



1 hour’s data for testing, first 10 mins for bootstrap



33 mins to process 50 mins’ data (better than real time) with

Cluster-delta method due to decreased message sizes

compared to full-centroid approach

Slide Note

Embed Share

Download

Explore programming models for IoT and streaming data with insights from Judy Qiu of Indiana University. Dive into Internet of Things (IoT) and learn about cutting-edge approaches in handling streaming data for innovative solutions.

taio933 Follow

Uploaded on Feb 22, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University

Event Processing Programming Models Query Based Complex Event processing SQL like languages Programming APIs Queries or the Programs run on a continuous stream, unlike Hadoop where your data is static for the Batch processor Need to address diverse streams Unbounded sequence of events Examples Video Camera frames Tweets Laser scans from a robot Log data

Distributed Stream Processing Frameworks (DSPF) Aurora Early Research System Borealis Early Research System Apache Storm Apache S4 Apache Samza Google MillWheel Amazon Kinesis LinkedIn Databus Facebook Puma/Ptail/Scribe/ODS Azure Stream Analytics Will discuss 2 Apache Storm projects at Indiana University

I: IoTCloud Framework to connect devices to cloud services IoTCloud consists of a set of distributed nodes running close to the devices to gather data a set of publish-subscribe brokers to relay the information to the cloud services a distributed stream processing framework (DSPF) coupled with batch processing frameworks in the Cloud Uses OpenStack environment Improving fault-tolerance and quality of service for especially guarantees on maximum response time

IoTCloud Architecture Built on Apache Storm, RabbitMQ, Hbase

IoTCloud Applications Particle Filtering Based SLAM N-Body Collision Avoidance Using parallel algorithms inside Storm for performance performance Response Time better with RabbitMQ Map Built from Robot data Robots need to avoid collisions when they move

II: Batch and Streaming Analysis for Social Media Data Batch analysis module Streaming analysis module Storage substrate

Streaming Analysis Non-trivial parallel stream processing algorithm with novel global synchronization and cluster-delta data transfer to achieve scalability Clustering of social media streams: real-time processing of 10% Twitter ( Gardenhose ) Recent progress in learning data representations and similarity metrics High-dimensional vectors: textual and network information Expensive similarity computation: 43.4 hours to cluster 1 hour s data with sequential algorithm Online K-Means with sliding time window and outlier detection Group tweets as protomemes: hashtags, mentions, URLs, and phrases Xiaoming Gao, Emilio Ferrara, Judy Qiu. Parallel Clustering of High-Dimensional Social Media Data Streams. To appear at 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2015).

Social media data an example data record 9

Sequential clustering algorithm Final step statistics for a sequential run over 6 minutes data: Total Length of Centroids Content Vector Time Step Length (s) Similarity Compute time (s) Centroids Update Time (s) 10 47749 33.305 0.068 20 76146 78.778 0.113 30 128521 209.013 0.213 120 clusters, time window length: 6 steps, outlier: 2 standard deviation

Parallelization with Storm - challenges DAG organization of parallel workers: hard to synchronize cluster information Sparsity of high-dimensional vectors make any synchronization expensive Data point 2: Data point 1: Content_Vector: [ step :1, time :1, nation : 1, ram :1] Diffusion_Vector: Content_Vector: [ lovin :1, support :1, vcu :1, ram :1] Diffusion_Vector: Centroid: Content_Vector: [ step :0.5, time :0.5, nation : 0.5, ram :1.0, lovin :0.5, support :0.5, vcu :0.5] Diffusion_Vector: Cluster - Cluster-delta synchronization strategy reduces message traffic and synchronization overhead

Solution enhanced Apache Storm topology ActiveMQ Broker Worker Process Clustering Bolt SYNCINIT CDELTAS Clustering Bolt PMADD OUTLIER SYNCREQ Synchronization Coordinator Bolt Protomeme Generator Spout tweet stream Worker Process Clustering Bolt Clustering Bolt Bootstrap Information Sequential or Parallel Batch Clustering Algorithm

Scalability comparison 1 hour s data for testing, first 10 mins for bootstrap 33 mins to process 50 mins data (better than real time) with Cluster-delta method due to decreased message sizes compared to full-centroid approach

Programming Models for IoT and Streaming Data

Download Presentation

Presentation Transcript

Related

More Related Content