Data Streaming in Big Data Analysis

Data Streaming in
Big Data Analysis
Docent lecture
Vincenzo Gulisano, Ph.D.
Vincenzo Gulisano Data streaming in Big Data analysis 1

Agenda
•Why data streaming?
•How does it work?
•Past, current and future research
•Conclusions
•Bibliography

Agenda
•Conclusions
•Bibliography

... back in the year 2000
• Continuous processing of data
streams
• Real-time fashion
... Store-then-process is not feasible
Financial applications
Sensor networks
ISPs

~ 2010

2017
Advanced Metering Infrastructures
Vehicular Networks
1. Billions of readings per day cannot be
transferred continuously
2. The latency incurred while transferring data
might undermine the utility of the analysis
3. It is not secure to concentrate all the data in
a single place
4. Privacy can be leaked when giving away
fine-grained data
What do we need then?
• Efficient one-pass analysis
• In memory
• Bounded resources

Agenda
•Conclusions
•Bibliography

Main Memory
Disk
1 Data
Query Processing
3 Query
results
2 Query
Main Memory
Query Processing
Continuous
Query
Data
Query
results
9Data streaming in Big Data analysisVincenzo Gulisano
DBMS vs. DSMS

data stream: unbounded sequence of tuples sharing the same schema
10
Example: vehicles’ speed and position reports
time
Field Field
vehicle id text
time (secs) text
speed (Km/h) double
X coordinate double
Y coordinate double
A 8:00 55.5 X1 Y1 A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2
Data streaming in Big Data analysisVincenzo Gulisano

continuous query: Directed Acyclic Graph (DAG) of streams and operators
11
OP
OP
OP
OP OP
OP OP
source op
(1+ out streams)
sink op
(1+ in streams)
stream
op
(1+ in, 1+ out streams)

data streaming operators
• Stateless operators
• do not maintain any state
• one-by-one processing
• Stateful operators
• maintain a state that evolves with the tuples being processed
• produce output tuples that depend on multiple input tuples
12
OP
OP

stateless operators
13
Filter
...
Map
Union
...
Filter / route tuples based on one (or more) conditions
Transform each tuple
Merge multiple streams (with the same schema) into one

stateful operators
14
Aggregate information from multiple tuples
(e.g., compute average speed of the tuples in the last hour)
Compare tuples coming from 2 streams given a certain predicate
(e.g., given the last 5 tuples from each stream, join every pair
reporting the same position)
Aggregate
Join
Since streams are unbounded, windows (over time or tuples)
are defined to bound the portion of tuples to aggregate or join

sample query
For each vehicle, raise an alert if the speed of the latest report is more
than 2 times higher than its average speed in the last 30 days.
15
time
A 8:00 55.5 X1 Y1 A 8:07 34.3 X3 Y3
A 8:03 70.3 X2 Y2

16
Field
vehicle id
time (secs)
speed (Km/h)
X coordinate
Y coordinate
Compute average
speed for each
vehicle during the
last 30 days
Aggregate
Field
vehicle id
time (secs)
avg speed (Km/h)
Join
Check
condition
Filter
Field
vehicle id
time (secs)
speed (Km/h)
Join on
vehicle id
Field
vehicle id
time (secs)
avg speed (Km/h)
speed (Km/h)
sample query

A B C
A B C
A B C
A B C
B
C
A
A
A
B
B

Agenda
•Conclusions
•Bibliography

Faulttolerance
Elasticity
Loadbalancing
Determinism
Parallel execution of streaming operators
Parallel execution of streaming applications
OP1 OP2
OP1 OP2
OP1 OP2
=
1) how to route tuples?
2) where to route tuples?
3) How to merge tuples?
4) How many instances to
deploy per operator?
...

Faulttolerance
Elasticity
Loadbalancing
Determinism
DDoSdetection
andmitigation
Intrusiondetection
Datavalidation
Differentially
privateaggregation
Vehicularnetworks
analysis
Urbanmobility
analysis
Security and
privacy
IoT
Transportation
sustainability

Synchronization / Data structures
Faulttolerance
Elasticity
Loadbalancing
Determinism
Many-core systems / FPGAs
DDoSdetection
andmitigation
Intrusiondetection
Datavalidation
Differentially
privateaggregation
Vehicularnetworks
analysis
Urbanmobility
analysis
Parallel joins
Parallel
aggregates
Joins
modeling
Security and
privacy
IoT
Transportation
sustainability

Synchronization / Data structures
Faulttolerance
Elasticity
Loadbalancing
DeterminismParallel execution of streaming operators
Many-core systems / FPGAs
DDoSdetection
andmitigation
Intrusiondetection
Datavalidation
Differentially
privateaggregation
Vehicularnetworks
analysis
Urbanmobility
analysis
Security and
privacy
Transportatio
sustainabilit
Distributed execution of streaming
applications
first the hardware, then the query
first the query, then the hardware

Agenda
•Conclusions
•Bibliography

Millions of
sensors
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
• ”Hard-wired” routines
• Real-time decisions
• High-throughput / low-latency
Do I really need to
try surströmming?
(... NO)
Danger!!!
Run!!!
(surströmming
can opened)

What traffic
congestion
patterns can I
observe
frequently?
Don’t take
over, car in
opposite lane!
• Store information
• Iterate multiple times over data
• Think, do not rush through decisions
• ”Hard-wired” routines
• Real-time decisions
• High-throughput / low-latency
Millions of
sensors
+ + +

Agenda
•Conclusions
•Bibliography

Bibliography
1. Zhou, Jiazhen, Rose Qingyang Hu, and Yi Qian. "Scalable distributed communication architectures to support advanced
metering infrastructure in smart grid." IEEE Transactions on Parallel and Distributed Systems 23.9 (2012): 1632-1642.
2. Gulisano, Vincenzo, et al. "BES: Differentially Private and Distributed Event Aggregation in Advanced Metering Infrastructures."
Proceedings of the 2nd ACM International Workshop on Cyber-Physical System Security. ACM, 2016.
3. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "Online and scalable data validation in advanced metering
infrastructures." IEEE PES Innovative Smart Grid Technologies, Europe. IEEE, 2014.
4. Gulisano, Vincenzo, Magnus Almgren, and Marina Papatriantafilou. "METIS: a two-tier intrusion detection system for advanced
metering infrastructures." International Conference on Security and Privacy in Communication Systems. Springer International
Publishing, 2014.
5. Yousefi, Saleh, Mahmoud Siadat Mousavi, and Mahmood Fathy. "Vehicular ad hoc networks (VANETs): challenges and
perspectives." 2006 6th International Conference on ITS Telecommunications. IEEE, 2006.
6. El Zarki, Magda, et al. "Security issues in a future vehicular network." European Wireless. Vol. 2. 2002.
7. Georgiadis, Giorgos, and Marina Papatriantafilou. "Dealing with storage without forecasts in smart grids: Problem
transformation and online scheduling algorithm." Proceedings of the 29th Annual ACM Symposium on Applied Computing.
ACM, 2014.
8. Fu, Zhang, et al. "Online temporal-spatial analysis for detection of critical events in Cyber-Physical Systems." Big Data (Big
Data), 2014 IEEE International Conference on. IEEE, 2014.
Data streaming in Big Data analysis 27Vincenzo Gulisano

Bibliography
9. Arasu, Arvind, et al. "Linear road: a stream data management benchmark." Proceedings of the Thirtieth international
conference on Very large data bases-Volume 30. VLDB Endowment, 2004.
10. Lv, Yisheng, et al. "Traffic flow prediction with big data: a deep learning approach." IEEE Transactions on Intelligent
Transportation Systems 16.2 (2015): 865-873.
11. Grochocki, David, et al. "AMI threats, intrusion detection requirements and deployment recommendations." Smart Grid
Communications (SmartGridComm), 2012 IEEE Third International Conference on. IEEE, 2012.
12. Molina-Markham, Andrés, et al. "Private memoirs of a smart meter." Proceedings of the 2nd ACM workshop on embedded
sensing systems for energy-efficiency in building. ACM, 2010.
13. Gulisano, Vincenzo, et al. "Streamcloud: A large scale data streaming system." Distributed Computing Systems (ICDCS), 2010
IEEE 30th International Conference on. IEEE, 2010.
14. Stonebraker, Michael, Uǧur Çetintemel, and Stan Zdonik. "The 8 requirements of real-time stream processing." ACM SIGMOD
Record 34.4 (2005): 42-47.
15. Bonomi, Flavio, et al. "Fog computing and its role in the internet of things." Proceedings of the first edition of the MCC
workshop on Mobile cloud computing. ACM, 2012.

Bibliography
16. Gulisano, Vincenzo Massimiliano. StreamCloud: An Elastic Parallel-Distributed Stream Processing Engine. Diss. Informatica,
2012.
17. Cardellini, Valeria, et al. "Optimal operator placement for distributed stream processing applications." Proceedings of the 10th
ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
18. Costache, Stefania, et al. "Understanding the Data-Processing Challenges in Intelligent Vehicular Systems." Proceedings of the
2016 IEEE Intelligent Vehicles Symposium (IV16).
19. Giatrakos, Nikos, Antonios Deligiannakis, and Minos Garofalakis. "Scalable Approximate Query Tracking over Highly Distributed
Data Streams." Proceedings of the 2016 International Conference on Management of Data. ACM, 2016.
20. Gulisano, Vincenzo, et al. "Streamcloud: An elastic and scalable data streaming system." IEEE Transactions on Parallel and
Distributed Systems 23.12 (2012): 2351-2365.
21. Shah, Mehul A., et al. "Flux: An adaptive partitioning operator for continuous query systems." Data Engineering, 2003.
Proceedings. 19th International Conference on. IEEE, 2003.

Bibliography
22. Cederman, Daniel, et al. "Brief announcement: concurrent data structures for efficient streaming aggregation." Proceedings of
the 26th ACM symposium on Parallelism in algorithms and architectures. ACM, 2014.
23. Ji, Yuanzhen, et al. "Quality-driven processing of sliding window aggregates over out-of-order data streams." Proceedings of
the 9th ACM International Conference on Distributed Event-Based Systems. ACM, 2015.
24. Ji, Yuanzhen, et al. "Quality-driven disorder handling for concurrent windowed stream queries with shared operators."
Proceedings of the 10th ACM International Conference on Distributed and Event-based Systems. ACM, 2016.
25. Gulisano, Vincenzo, et al. "Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join." Big Data (Big Data), 2015
IEEE International Conference on. IEEE, 2015.
26. Ottenwälder, Beate, et al. "MigCEP: operator migration for mobility driven distributed complex event processing." Proceedings
of the 7th ACM international conference on Distributed event-based systems. ACM, 2013.
27. De Matteis, Tiziano, and Gabriele Mencagli. "Keep calm and react with foresight: strategies for low-latency and energy-efficient
elastic data stream processing." Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming. ACM, 2016.
28. Balazinska, Magdalena, et al. "Fault-tolerance in the Borealis distributed stream processing system." ACM Transactions on
Database Systems (TODS) 33.1 (2008): 3.
29. Castro Fernandez, Raul, et al. "Integrating scale out and fault tolerance in stream processing using operator state
management." Proceedings of the 2013 ACM SIGMOD international conference on Management of data. ACM, 2013.

Bibliography
30. Dwork, Cynthia. "Differential privacy: A survey of results." International Conference on Theory and Applications of Models of
Computation. Springer Berlin Heidelberg, 2008.
31. Dwork, Cynthia, et al. "Differential privacy under continual observation." Proceedings of the forty-second ACM symposium on
Theory of computing. ACM, 2010.
32. Kargl, Frank, Arik Friedman, and Roksana Boreli. "Differential privacy in intelligent transportation systems." Proceedings of the
sixth ACM conference on Security and privacy in wireless and mobile networks. ACM, 2013.

Data Streaming in Big Data Analysis

Recommended

More Related Content

What's hot (20)

Similar to Data Streaming in Big Data Analysis (20)

More from Vincenzo Gulisano (7)

Recently uploaded (20)

Data Streaming in Big Data Analysis

Editor's Notes