Programming Models for IoT and Streaming Data

 
P
r
o
g
r
a
m
m
i
n
g
 
M
o
d
e
l
s
 
f
o
r
 
I
o
T
 
a
n
d
S
t
r
e
a
m
i
n
g
 
D
a
t
a
I
C
2
E
 
I
n
t
e
r
n
e
t
 
o
f
 
T
h
i
n
g
s
 
P
a
n
e
l
 
Judy Qiu
Indiana University
 
E
v
e
n
t
 
P
r
o
c
e
s
s
i
n
g
 
P
r
o
g
r
a
m
m
i
n
g
 
M
o
d
e
l
s
 
Query Based
Complex Event processing
SQL like languages
Programming APIs
Queries or the Programs run on a continuous stream, unlike Hadoop
where your data is static for the Batch processor
Need to address diverse streams – Unbounded sequence of events
Examples
Video Camera frames
Tweets
Laser scans from a robot
Log data
 
 
 
 
D
i
s
t
r
i
b
u
t
e
d
 
S
t
r
e
a
m
 
P
r
o
c
e
s
s
i
n
g
F
r
a
m
e
w
o
r
k
s
 
(
D
S
P
F
)
 
Aurora – Early Research System
Borealis – Early Research System
Apache Storm
Apache S4
Apache Samza
Google MillWheel
Amazon Kinesis
LinkedIn Databus
Facebook Puma/Ptail/Scribe/ODS
Azure Stream Analytics
 
Will discuss 2 Apache Storm projects at
Indiana University
 
I
:
 
I
o
T
C
l
o
u
d
 
Framework to connect devices to cloud services
IoTCloud consists of
a set of distributed nodes running close to the devices to gather
data
a set of publish-subscribe brokers to relay the information to the
cloud services
a distributed stream processing framework (DSPF) coupled with
batch processing frameworks in the Cloud
Uses OpenStack environment
Improving fault-tolerance and quality of service for especially
guarantees on maximum response time
 
I
o
T
C
l
o
u
d
 
A
r
c
h
i
t
e
c
t
u
r
e
 
Built on Apache Storm,
RabbitMQ, Hbase ………
 
I
o
T
C
l
o
u
d
 
A
p
p
l
i
c
a
t
i
o
n
s
 
Particle Filtering Based SLAM
N-Body Collision Avoidance
Using parallel algorithms inside
Storm for performance
performance
 
Map Built from Robot data
 
Robots need to avoid collisions when they move
I
I
:
 
B
a
t
c
h
 
a
n
d
 
S
t
r
e
a
m
i
n
g
 
A
n
a
l
y
s
i
s
 
f
o
r
 
S
o
c
i
a
l
 
M
e
d
i
a
 
D
a
t
a
 
Storage
substrate
 
Batch
analysis
module
 
Streaming
analysis
module
S
t
r
e
a
m
i
n
g
 
A
n
a
l
y
s
i
s
 
Non-trivial parallel stream processing algorithm with novel global
synchronization and cluster-delta data transfer to achieve scalability
 
Clustering of social media streams: real-time processing of 10% Twitter
(“Gardenhose”)
R
e
c
e
n
t
 
p
r
o
g
r
e
s
s
 
i
n
 
l
e
a
r
n
i
n
g
 
d
a
t
a
 
r
e
p
r
e
s
e
n
t
a
t
i
o
n
s
 
a
n
d
 
s
i
m
i
l
a
r
i
t
y
m
e
t
r
i
c
s
High-dimensional vectors: textual and network information
Expensive similarity computation: 43.4 hours to cluster 1 hour’s data with
sequential algorithm
Online K-Means with sliding time window and outlier detection
G
r
o
u
p
 
t
w
e
e
t
s
 
a
s
 
p
r
o
t
o
m
e
m
e
s
:
 
h
a
s
h
t
a
g
s
,
 
m
e
n
t
i
o
n
s
,
 
U
R
L
s
,
 
a
n
d
 
p
h
r
a
s
e
s
 
Xiaoming Gao, Emilio Ferrara, Judy Qiu. Parallel Clustering of High-Dimensional Social Media Data Streams. To
appear at 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2015).
 
S
o
c
i
a
l
 
m
e
d
i
a
 
d
a
t
a
 
 
a
n
 
e
x
a
m
p
l
e
 
d
a
t
a
 
r
e
c
o
r
d
 
9
S
e
q
u
e
n
t
i
a
l
 
c
l
u
s
t
e
r
i
n
g
 
a
l
g
o
r
i
t
h
m
Final step statistics for a sequential run over 6 minutes data:
120 clusters, time window length: 6 steps, outlier: 2 standard deviation
P
a
r
a
l
l
e
l
i
z
a
t
i
o
n
 
w
i
t
h
 
S
t
o
r
m
 
-
 
c
h
a
l
l
e
n
g
e
s
 
Data point 1:
Content_Vector: [“step”:1, “time”:1, “nation”: 1,
                                “ram”:1]
Diffusion_Vector: …
 
Data point 2:
Content_Vector: [“lovin”:1, “support”:1, “vcu”:1,
                                “ram”:1]
Diffusion_Vector: …
 
Centroid:
Content_Vector: [“step”:0.5, “time”:0.5, “nation”: 0.5, “ram”:1.0, “lovin”:0.5,
                                “support”:0.5, “vcu”:0.5]
Diffusion_Vector: …
 
Cluster
Sparsity of high-dimensional vectors make any synchronization expensive
 
-
Cluster-delta synchronization strategy reduces message
traffic and synchronization overhead
DAG organization of parallel workers: hard to synchronize cluster information
S
o
l
u
t
i
o
n
 
 
e
n
h
a
n
c
e
d
 
A
p
a
c
h
e
 
S
t
o
r
m
 
t
o
p
o
l
o
g
y
Protomeme
Generator
Spout
Synchronization
Coordinator Bolt
ActiveMQ
Broker
 
SYNCINIT
CDELTAS
Sequential or Parallel Batch Clustering Algorithm
Bootstrap
Information
Worker Process
Clustering Bolt
Clustering Bolt
Worker Process
Clustering Bolt
Clustering Bolt
 
PMADD
OUTLIER
SYNCREQ
 
tweet
stream
 
S
c
a
l
a
b
i
l
i
t
y
 
c
o
m
p
a
r
i
s
o
n
 
1 hour’s data for testing, first 10 mins for bootstrap
33 mins to process 50 mins’ data (better than real time) with
Cluster-delta method due to decreased message sizes
compared to full-centroid approach
Slide Note
Embed
Share

Explore programming models for IoT and streaming data with insights from Judy Qiu of Indiana University. Dive into Internet of Things (IoT) and learn about cutting-edge approaches in handling streaming data for innovative solutions.

  • IoT Programming
  • Streaming Data
  • Judy Qiu
  • Indiana University
  • Innovative Solutions

Uploaded on Feb 22, 2025 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Programming Models for IoT and Streaming Data IC2E Internet of Things Panel Judy Qiu Indiana University

  2. Event Processing Programming Models Query Based Complex Event processing SQL like languages Programming APIs Queries or the Programs run on a continuous stream, unlike Hadoop where your data is static for the Batch processor Need to address diverse streams Unbounded sequence of events Examples Video Camera frames Tweets Laser scans from a robot Log data

  3. Distributed Stream Processing Frameworks (DSPF) Aurora Early Research System Borealis Early Research System Apache Storm Apache S4 Apache Samza Google MillWheel Amazon Kinesis LinkedIn Databus Facebook Puma/Ptail/Scribe/ODS Azure Stream Analytics Will discuss 2 Apache Storm projects at Indiana University

  4. I: IoTCloud Framework to connect devices to cloud services IoTCloud consists of a set of distributed nodes running close to the devices to gather data a set of publish-subscribe brokers to relay the information to the cloud services a distributed stream processing framework (DSPF) coupled with batch processing frameworks in the Cloud Uses OpenStack environment Improving fault-tolerance and quality of service for especially guarantees on maximum response time

  5. IoTCloud Architecture Built on Apache Storm, RabbitMQ, Hbase

  6. IoTCloud Applications Particle Filtering Based SLAM N-Body Collision Avoidance Using parallel algorithms inside Storm for performance performance Response Time better with RabbitMQ Map Built from Robot data Robots need to avoid collisions when they move

  7. II: Batch and Streaming Analysis for Social Media Data Batch analysis module Streaming analysis module Storage substrate

  8. Streaming Analysis Non-trivial parallel stream processing algorithm with novel global synchronization and cluster-delta data transfer to achieve scalability Clustering of social media streams: real-time processing of 10% Twitter ( Gardenhose ) Recent progress in learning data representations and similarity metrics High-dimensional vectors: textual and network information Expensive similarity computation: 43.4 hours to cluster 1 hour s data with sequential algorithm Online K-Means with sliding time window and outlier detection Group tweets as protomemes: hashtags, mentions, URLs, and phrases Xiaoming Gao, Emilio Ferrara, Judy Qiu. Parallel Clustering of High-Dimensional Social Media Data Streams. To appear at 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID 2015).

  9. Social media data an example data record 9

  10. Sequential clustering algorithm Final step statistics for a sequential run over 6 minutes data: Total Length of Centroids Content Vector Time Step Length (s) Similarity Compute time (s) Centroids Update Time (s) 10 47749 33.305 0.068 20 76146 78.778 0.113 30 128521 209.013 0.213 120 clusters, time window length: 6 steps, outlier: 2 standard deviation

  11. Parallelization with Storm - challenges DAG organization of parallel workers: hard to synchronize cluster information Sparsity of high-dimensional vectors make any synchronization expensive Data point 2: Data point 1: Content_Vector: [ step :1, time :1, nation : 1, ram :1] Diffusion_Vector: Content_Vector: [ lovin :1, support :1, vcu :1, ram :1] Diffusion_Vector: Centroid: Content_Vector: [ step :0.5, time :0.5, nation : 0.5, ram :1.0, lovin :0.5, support :0.5, vcu :0.5] Diffusion_Vector: Cluster - Cluster-delta synchronization strategy reduces message traffic and synchronization overhead

  12. Solution enhanced Apache Storm topology ActiveMQ Broker Worker Process Clustering Bolt SYNCINIT CDELTAS Clustering Bolt PMADD OUTLIER SYNCREQ Synchronization Coordinator Bolt Protomeme Generator Spout tweet stream Worker Process Clustering Bolt Clustering Bolt Bootstrap Information Sequential or Parallel Batch Clustering Algorithm

  13. Scalability comparison 1 hour s data for testing, first 10 mins for bootstrap 33 mins to process 50 mins data (better than real time) with Cluster-delta method due to decreased message sizes compared to full-centroid approach

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#