SlideShare a Scribd company logo
A real-time machine learning and
visualization framework for
scientific workflows
Feng Li, Fengguang Song
Indiana University-Purdue university, Indianapolis
Email: lifen@iupui.edu, fgsong@iupui.edu
Outlines
• Background
• An non-parametric anomaly detection method
• A scientific workflow & software framework
• Experiments
• Conclusion and future work
Background
• Recent advancements in parallel computing and HPC
• Intensive computation helps solve extremely large problem
• This also brings huge amount of data.
• Much faster interconnect.(infiniband->RDMA)
• How do we deal with such large data:
• Post-hoc?(let simulation results flushed to disk)
• In-situ?(closely coupled)
• Something in between?
Tuple Space & Dataspaces
• Tuple Space
• Data is access by content and type, rather than by their raw memory address.
• Ability to describe data without referring any computer architecture!
• Ideal model to couple different simulation and analysis applications.
• Dataspaces
• Developed by Rutgers University(Docan, 2012)
• Derived from the concept of tuple space
• coupling different applications together use RDMA
• Example Interface:
• dspaces_put(var_name, version, elem_size, boundaries, pdata)
Vortex detection inTurbulence flow
• Vortex?
• Swirling motion of fluid around a
center region.
• Precise definition?
• Why it matters?
• How to identify it?
Vortex detection inTurbulence flow
• A region-based non-parametric anomaly detection method.(Póczos, 2012)
• Simulation data is divided into “regions”
• Each region(A collection of data points) can be regarded as a random sampling from a
distribution with density p.
• Features: (vx , vy, dc)
• v,Velocity in both dimensions
• dc, Distance to center.
BA C D
Vortex detection inTurbulence flow,
Cont.
• Density in this data point can be estimated using kNN methods.
• Divergence(distance between regions)
• L2 divergence: 𝐿(𝑝| 𝑞 = ∫((𝑝 𝑥 − 𝑞(𝑥))+ 𝑑𝑥)-/+
• Once we know how to measure the ’difference’ of two regions, we can use
distance-based clustering method.(eg. kmedoids)
𝑝0 𝑋2 = 𝑘/( 𝑛 − 1 ∗ 𝑐 ∗ 𝑣0 𝑖 )
Demonstration of theWorkflow
• Divide → Divergence →
Generate new medoids→ Re-
assign Cluster ID
• Different application coupled
using Dataspaces.
• Components in the
Workflow:
• Simulators
• Data processing
• Data Analysis(Anomaly
detection)
• Visualization tool(Paraview
Catalyst)
Distributed Sampling
• Original k-medoids method does scale
well with large dataset
• O(𝑛+
) , n is number of regions
• Data processing will be the bottle neck
• CLARA(Clustering LARge Application,
Kaufman 2009)
• Such small granularity operations are
expensive in Dataspaces(“boundaries”)
• Reorganize the data:
• Single Dataspaces operation will
operate on a larger chunk of data.
Experiments
• Testbed: Karst indiana University.
• Other tools used:
• John HopkinsTurbulence Databases.
• Paraview Catalyst(Co-processing Library, Fabian 2011)
• Originally designed for the ’in-situ’ approach
• generic visualization framework, support ”live visualization”
• Openfoam(Open Source CFDToolbox)
• DataSpaces adaptor to default writer.
Experiment with JHTDB data
• Data description
• Forced isotropic dataset from John HopkinsTurbulence Databases(JHTDB)
• 1024*1024*1024 grid, 5028 frames
• Well formatted inVTK/HDF5
• All regions are clustered into:
• Steady flow
• Unsteady flow
• Random flow
• Client side
• Paraview GUI
• Two views are linked together.
ExecutionTime Breakdown
• Analyzes the communication overhead and efficiency of the framework
• How time is spend in each component in this workflow
• Average data transfer time and computation time for all the four
applications(simulator, data processing, data analysis, catalyst visualization)
Strong Scalability
• How well the workflow can scale with more computing resources
• Fixed Input size of 4096*4096(larger) in each timestep, which contains 1GB
velocity/pressure data
• More processes for all applications?
• Latency_produce/Latency_consume
Conclusion
• a flexible and extensible software framework
• A parallel non-parametric clustering method
• Using DataSpaces greatly reduces I/O time and scales well.
Thanks
• Questions?
References
1. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. Dataspaces: an interaction
and coordination framework for coupled simulation workflows.Cluster Computing
15, 2 (2012), 163–181.
2. Barnabás Póczos, Liang Xiong, and Jeff Schneider. 2012. Nonparametric divergence
estimation with applications to machine learning on distributions. arXiv preprint
arXiv:1202.3758 (2012)
3. Nathan Fabian, Kenneth Moreland, DavidThompson, Andrew C Bauer, Pat Marion,
Berk Gevecik, Michel Rasquin, and Kenneth E Jansen. 2011.The paraview
coprocessing library: A scalable, general purpose in situ visualization library. In Large
Data Analysis andVisualization (LDAV), 2011 IEEE Symposium on. IEEE, 89–96.
4. Raymond Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data:
an introduction to cluster analysis.Vol. 344. JohnWiley & Sons.
SpecialThanks go to

More Related Content

What's hot (20)

Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
2017 nov reflow sbtb
2017 nov reflow sbtb2017 nov reflow sbtb
2017 nov reflow sbtb
mariuseriksen4
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
Antonio Severien
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Dataconomy Media
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
Vincenzo Gulisano
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Esteban Donato
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
k_tauhid
 
DIET_BLAST
DIET_BLASTDIET_BLAST
DIET_BLAST
Frederic Desprez
 
A0360109
A0360109A0360109
A0360109
iosrjournals
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster
 
C0312023
C0312023C0312023
C0312023
iosrjournals
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
Eran Chinthaka Withana
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
Natalino Busa
 
B0330811
B0330811B0330811
B0330811
iosrjournals
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
bigdataviz_bay
 
Scalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data StreamsScalable Distributed Real-Time Clustering for Big Data Streams
Scalable Distributed Real-Time Clustering for Big Data Streams
Antonio Severien
 
5.1 mining data streams
5.1 mining data streams5.1 mining data streams
5.1 mining data streams
Krish_ver2
 
Big Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many ClusteringBig Data on Implementation of Many to Many Clustering
Big Data on Implementation of Many to Many Clustering
paperpublications3
 
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler..."Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Dataconomy Media
 
Data Streaming in Big Data Analysis
Data Streaming in Big Data AnalysisData Streaming in Big Data Analysis
Data Streaming in Big Data Analysis
Vincenzo Gulisano
 
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams   Esteban DonatoEvaluating Classification Algorithms Applied To Data Streams   Esteban Donato
Evaluating Classification Algorithms Applied To Data Streams Esteban Donato
Esteban Donato
 
Data Stream Management
Data Stream ManagementData Stream Management
Data Stream Management
k_tauhid
 
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Ian Foster
 
A Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in ParallelA Tale of Data Pattern Discovery in Parallel
A Tale of Data Pattern Discovery in Parallel
Jenny Liu
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
Joe Kelley
 
Usage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in CloudsUsage Patterns to Provision for Scientific Experiments in Clouds
Usage Patterns to Provision for Scientific Experiments in Clouds
Eran Chinthaka Withana
 
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
Empirical Analysis of Radix Sort using Curve Fitting Technique in Personal Co...
IRJET Journal
 
Streaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologiesStreaming computing: architectures, and tchnologies
Streaming computing: architectures, and tchnologies
Natalino Busa
 
Interactive Latency in Big Data Visualization
Interactive Latency in Big Data VisualizationInteractive Latency in Big Data Visualization
Interactive Latency in Big Data Visualization
bigdataviz_bay
 

Similar to PEARC17:A real-time machine learning and visualization framework for scientific workflows (20)

Out-of-Core Dimensionality Reduction.pptx
Out-of-Core Dimensionality Reduction.pptxOut-of-Core Dimensionality Reduction.pptx
Out-of-Core Dimensionality Reduction.pptx
aishwaryamurahari3
 
04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
Marco Quartulli
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
ramikaurraminder
 
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMSolr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Lucidworks
 
Vivarana fyp report
Vivarana fyp reportVivarana fyp report
Vivarana fyp report
Tharindu Ranasinghe
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
Jose Luis Lopez Pino
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of Endpoints
PlanetData Network of Excellence
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Marco Parenzan
 
Microsoft Dryad
Microsoft DryadMicrosoft Dryad
Microsoft Dryad
Colin Clark
 
Big Data Visualization With ParaView
Big Data Visualization With ParaViewBig Data Visualization With ParaView
Big Data Visualization With ParaView
Swiss Big Data User Group
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Spark
SparkSpark
Spark
newmooxx
 
Dynamic scene understanding using temporal association rules
Dynamic scene understanding using temporal association rulesDynamic scene understanding using temporal association rules
Dynamic scene understanding using temporal association rules
ijunejo
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Numenta Anomaly Benchmark - SF Data Science Meetup
Numenta Anomaly Benchmark - SF Data Science Meetup Numenta Anomaly Benchmark - SF Data Science Meetup
Numenta Anomaly Benchmark - SF Data Science Meetup
Numenta
 
Out-of-Core Dimensionality Reduction.pptx
Out-of-Core Dimensionality Reduction.pptxOut-of-Core Dimensionality Reduction.pptx
Out-of-Core Dimensionality Reduction.pptx
aishwaryamurahari3
 
A big-data architecture for real-time analytics
A big-data architecture for real-time analyticsA big-data architecture for real-time analytics
A big-data architecture for real-time analytics
ramikaurraminder
 
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBMSolr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Solr and Machine Vision - Scott Cote, Lucidworks & Trevor Grant, IBM
Lucidworks
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
Massive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using RMassive scale analytics with Stratosphere using R
Massive scale analytics with Stratosphere using R
Jose Luis Lopez Pino
 
Adaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of EndpointsAdaptive Semantic Data Management Techniques for Federations of Endpoints
Adaptive Semantic Data Management Techniques for Federations of Endpoints
PlanetData Network of Excellence
 
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnetDeep Dive Time Series Anomaly Detection in Azure with dotnet
Deep Dive Time Series Anomaly Detection in Azure with dotnet
Marco Parenzan
 
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
 Unexpected Challenges in Large Scale Machine Learning by Charles Parker Unexpected Challenges in Large Scale Machine Learning by Charles Parker
Unexpected Challenges in Large Scale Machine Learning by Charles Parker
BigMine
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
Andy Petrella
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Natalino Busa
 
Dynamic scene understanding using temporal association rules
Dynamic scene understanding using temporal association rulesDynamic scene understanding using temporal association rules
Dynamic scene understanding using temporal association rules
ijunejo
 
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
From Data Platforms to Dataspaces: Enabling Data Ecosystems for Intelligent S...
Edward Curry
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
OSCON Byrum
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
Animesh Chaturvedi
 
Numenta Anomaly Benchmark - SF Data Science Meetup
Numenta Anomaly Benchmark - SF Data Science Meetup Numenta Anomaly Benchmark - SF Data Science Meetup
Numenta Anomaly Benchmark - SF Data Science Meetup
Numenta
 

Recently uploaded (20)

Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PCWondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Mudasir
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
AI Unboxed - How to Approach AI for Maximum Return
AI Unboxed - How to Approach AI for Maximum ReturnAI Unboxed - How to Approach AI for Maximum Return
AI Unboxed - How to Approach AI for Maximum Return
Merelda
 
SQL Database Design For Developers at PhpTek 2025.pptx
SQL Database Design For Developers at PhpTek 2025.pptxSQL Database Design For Developers at PhpTek 2025.pptx
SQL Database Design For Developers at PhpTek 2025.pptx
Scott Keck-Warren
 
MuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World Tips
MuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World TipsMuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World Tips
MuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World Tips
Patryk Bandurski
 
Automating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docx
Automating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docxAutomating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docx
Automating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docx
Ihor Hamal
 
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AIAI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
Buhake Sindi
 
I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...
ShapeBlue
 
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Chris Bingham
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
Partner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdfPartner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdf
ssuser3d62c6
 
Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025
Peter Morgan
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
Dr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit ADr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit A
Dr. Jimmy Schwarzkopf
 
How to Integrate FME with Databricks (and Why You’ll Want To)
How to Integrate FME with Databricks (and Why You’ll Want To)How to Integrate FME with Databricks (and Why You’ll Want To)
How to Integrate FME with Databricks (and Why You’ll Want To)
Safe Software
 
Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025
Scott Keck-Warren
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Storage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStackStorage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStack
ShapeBlue
 
Build your own NES Emulator... with Kotlin
Build your own NES Emulator... with KotlinBuild your own NES Emulator... with Kotlin
Build your own NES Emulator... with Kotlin
Artur Skowroński
 
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PCWondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Wondershare Filmora 14.3.2 Crack + License Key Free for Windows PC
Mudasir
 
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStackProposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
Proposed Feature: Monitoring and Managing Cloud Usage Costs in Apache CloudStack
ShapeBlue
 
AI Unboxed - How to Approach AI for Maximum Return
AI Unboxed - How to Approach AI for Maximum ReturnAI Unboxed - How to Approach AI for Maximum Return
AI Unboxed - How to Approach AI for Maximum Return
Merelda
 
SQL Database Design For Developers at PhpTek 2025.pptx
SQL Database Design For Developers at PhpTek 2025.pptxSQL Database Design For Developers at PhpTek 2025.pptx
SQL Database Design For Developers at PhpTek 2025.pptx
Scott Keck-Warren
 
MuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World Tips
MuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World TipsMuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World Tips
MuleSoft RTF & Flex Gateway on AKS – Setup, Insights & Real-World Tips
Patryk Bandurski
 
Automating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docx
Automating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docxAutomating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docx
Automating Call Centers with AI Agents_ Achieving Sub-700ms Latency.docx
Ihor Hamal
 
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AIAI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
AI in Java - MCP in Action, Langchain4J-CDI, SmallRye-LLM, Spring AI
Buhake Sindi
 
I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...I’d like to resell your CloudStack services, but...
I’d like to resell your CloudStack services, but...
ShapeBlue
 
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Managing Geospatial Open Data Serverlessly [AWS Community Day CH 2025]
Chris Bingham
 
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCPMCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
MCP Dev Summit - Pragmatic Scaling of Enterprise GenAI with MCP
Sambhav Kothari
 
Partner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdfPartner Tableau Next Product First Call Deck.pdf
Partner Tableau Next Product First Call Deck.pdf
ssuser3d62c6
 
Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025Agentic AI, A Business Overview - May 2025
Agentic AI, A Business Overview - May 2025
Peter Morgan
 
AI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological ImpactAI and Gender: Decoding the Sociological Impact
AI and Gender: Decoding the Sociological Impact
SaikatBasu37
 
AI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptxAI needs Hybrid Cloud - TEC conference 2025.pptx
AI needs Hybrid Cloud - TEC conference 2025.pptx
Shikha Srivastava
 
Dr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit ADr Schwarzkopf presentation on STKI Summit A
Dr Schwarzkopf presentation on STKI Summit A
Dr. Jimmy Schwarzkopf
 
How to Integrate FME with Databricks (and Why You’ll Want To)
How to Integrate FME with Databricks (and Why You’ll Want To)How to Integrate FME with Databricks (and Why You’ll Want To)
How to Integrate FME with Databricks (and Why You’ll Want To)
Safe Software
 
Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025Reducing Bugs With Static Code Analysis php tek 2025
Reducing Bugs With Static Code Analysis php tek 2025
Scott Keck-Warren
 
Building Agents with LangGraph & Gemini
Building Agents with LangGraph &  GeminiBuilding Agents with LangGraph &  Gemini
Building Agents with LangGraph & Gemini
HusseinMalikMammadli
 
Storage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStackStorage Setup for LINSTOR/DRBD/CloudStack
Storage Setup for LINSTOR/DRBD/CloudStack
ShapeBlue
 

PEARC17:A real-time machine learning and visualization framework for scientific workflows

  • 1. A real-time machine learning and visualization framework for scientific workflows Feng Li, Fengguang Song Indiana University-Purdue university, Indianapolis Email: lifen@iupui.edu, fgsong@iupui.edu
  • 2. Outlines • Background • An non-parametric anomaly detection method • A scientific workflow & software framework • Experiments • Conclusion and future work
  • 3. Background • Recent advancements in parallel computing and HPC • Intensive computation helps solve extremely large problem • This also brings huge amount of data. • Much faster interconnect.(infiniband->RDMA) • How do we deal with such large data: • Post-hoc?(let simulation results flushed to disk) • In-situ?(closely coupled) • Something in between?
  • 4. Tuple Space & Dataspaces • Tuple Space • Data is access by content and type, rather than by their raw memory address. • Ability to describe data without referring any computer architecture! • Ideal model to couple different simulation and analysis applications. • Dataspaces • Developed by Rutgers University(Docan, 2012) • Derived from the concept of tuple space • coupling different applications together use RDMA • Example Interface: • dspaces_put(var_name, version, elem_size, boundaries, pdata)
  • 5. Vortex detection inTurbulence flow • Vortex? • Swirling motion of fluid around a center region. • Precise definition? • Why it matters? • How to identify it?
  • 6. Vortex detection inTurbulence flow • A region-based non-parametric anomaly detection method.(Póczos, 2012) • Simulation data is divided into “regions” • Each region(A collection of data points) can be regarded as a random sampling from a distribution with density p. • Features: (vx , vy, dc) • v,Velocity in both dimensions • dc, Distance to center. BA C D
  • 7. Vortex detection inTurbulence flow, Cont. • Density in this data point can be estimated using kNN methods. • Divergence(distance between regions) • L2 divergence: 𝐿(𝑝| 𝑞 = ∫((𝑝 𝑥 − 𝑞(𝑥))+ 𝑑𝑥)-/+ • Once we know how to measure the ’difference’ of two regions, we can use distance-based clustering method.(eg. kmedoids) 𝑝0 𝑋2 = 𝑘/( 𝑛 − 1 ∗ 𝑐 ∗ 𝑣0 𝑖 )
  • 8. Demonstration of theWorkflow • Divide → Divergence → Generate new medoids→ Re- assign Cluster ID • Different application coupled using Dataspaces. • Components in the Workflow: • Simulators • Data processing • Data Analysis(Anomaly detection) • Visualization tool(Paraview Catalyst)
  • 9. Distributed Sampling • Original k-medoids method does scale well with large dataset • O(𝑛+ ) , n is number of regions • Data processing will be the bottle neck • CLARA(Clustering LARge Application, Kaufman 2009) • Such small granularity operations are expensive in Dataspaces(“boundaries”) • Reorganize the data: • Single Dataspaces operation will operate on a larger chunk of data.
  • 10. Experiments • Testbed: Karst indiana University. • Other tools used: • John HopkinsTurbulence Databases. • Paraview Catalyst(Co-processing Library, Fabian 2011) • Originally designed for the ’in-situ’ approach • generic visualization framework, support ”live visualization” • Openfoam(Open Source CFDToolbox) • DataSpaces adaptor to default writer.
  • 11. Experiment with JHTDB data • Data description • Forced isotropic dataset from John HopkinsTurbulence Databases(JHTDB) • 1024*1024*1024 grid, 5028 frames • Well formatted inVTK/HDF5 • All regions are clustered into: • Steady flow • Unsteady flow • Random flow • Client side • Paraview GUI • Two views are linked together.
  • 12. ExecutionTime Breakdown • Analyzes the communication overhead and efficiency of the framework • How time is spend in each component in this workflow • Average data transfer time and computation time for all the four applications(simulator, data processing, data analysis, catalyst visualization)
  • 13. Strong Scalability • How well the workflow can scale with more computing resources • Fixed Input size of 4096*4096(larger) in each timestep, which contains 1GB velocity/pressure data • More processes for all applications? • Latency_produce/Latency_consume
  • 14. Conclusion • a flexible and extensible software framework • A parallel non-parametric clustering method • Using DataSpaces greatly reduces I/O time and scales well.
  • 16. References 1. Ciprian Docan, Manish Parashar, and Scott Klasky. 2012. Dataspaces: an interaction and coordination framework for coupled simulation workflows.Cluster Computing 15, 2 (2012), 163–181. 2. Barnabás Póczos, Liang Xiong, and Jeff Schneider. 2012. Nonparametric divergence estimation with applications to machine learning on distributions. arXiv preprint arXiv:1202.3758 (2012) 3. Nathan Fabian, Kenneth Moreland, DavidThompson, Andrew C Bauer, Pat Marion, Berk Gevecik, Michel Rasquin, and Kenneth E Jansen. 2011.The paraview coprocessing library: A scalable, general purpose in situ visualization library. In Large Data Analysis andVisualization (LDAV), 2011 IEEE Symposium on. IEEE, 89–96. 4. Raymond Leonard Kaufman and Peter J Rousseeuw. 2009. Finding groups in data: an introduction to cluster analysis.Vol. 344. JohnWiley & Sons.
  翻译: