SlideShare a Scribd company logo
Building A Scalable Data Science
Platform with R
Mario Inchiosa, PhD
Principal Software Engineer
Hadoop Summit San Jose
June 30, 2016
What is
• The most popular statistical programming language
• A data visualization tool
• Open source
• 2.5+M users
• Taught in most universities
• Thriving user groups worldwide
• 8000+ contributed packages
• New and recent grad’s use it
Language
Platform
Community
Ecosystem
• Rich application & platform integration
Common R use cases
Vertical Sales & Marketing Finance & Risk Customer & Channel
Operations &
Workforce
Retail
Demand Forecasting
Loyalty Programs
Cross-sell & Upsell
Customer Acquisition
Fraud Detection
Pricing Strategy
Personalization
Lifetime Customer Value
Product Segmentation
Store Location Demographics
Supply Chain Management
Inventory Management
Financial Services
Customer Churn
Loyalty Programs
Cross-sell & Upsell
Customer Acquisition
Fraud Detection
Risk& Compliance
Loan Defaults
Personalization
Lifetime Customer Value
Call Center Optimization
Pay for Performance
Healthcare
Marketing Mix Optimization
Patient Acquisition
Fraud Detection
Bill Collection
Population Health
Patient Demographics
Operational Efficiency
Pay for Performance
Manufacturing
Demand Forecasting
Marketing mix Optimization
Pricing Strategy
Perf Risk Management
Supply Chain Optimization
Personalization
Remote Monitoring
Predictive Maintenance
Asset Management
IEEE Spectrum July 2015
Data Flows Overwhelm Open Source R
– In-Memory Operation
– Lack of Parallelism
– Expensive Data Movement &
Duplication
Not enterprise ready
– Inadequacy of Community Support
– Lack of Guaranteed Support Timeliness
– No SLAs or Support models
R Adoption is on a Tear, but
Open Source R is not Enterprise Class
R from Microsoft brings
Peace of
mind
Efficiency Speed and
scalability
Flexibility
and agility
Portability & investment assurance
Write Once – Deploy Anywhere
R Server portfolio
Cloud
RDBMS
Desktops & Servers
Hadoop & Spark
EDW
R Server Technology
R Script for Execution in MapReduce
Sample R Script:
rxSetComputeContext( RxHadoopMR(…) )
inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS)
model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData)
Define
Compute
Context
Define Data
Source
Train Predictive
Model
Easy to Switch From MapReduce to Spark
Keep other code
unchanged
Sample R Script:
rxSetComputeContext( RxSpark(…) )
Change the
Compute
Context
R Server: scale-out R, Enterprise Class!
• 100% compatible with open source R
• Any code/package that works today with R will work in R Server
• Wide range of scalable and distributed R functions
• Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict()
• Ability to parallelize any R function
• Ideal for parameter sweeps, simulation, scoring
Parallelized & Distributed Algorithms
 Data import – Delimited, Fixed, SAS, SPSS,
OBDC
 Variable creation & transformation
 Recode variables
 Factor variables
 Missing value handling
 Sort, Merge, Split
 Aggregate by category (means, sums)
 Chi Square Test
 Kendall Rank Correlation
 Fisher’s Exact Test
 Student’s t-Test
ETL Statistical Tests
 Subsample (observations & variables)
 Random Sampling
Sampling
 Min / Max, Mean, Median (approx.)
 Quantiles (approx.)
 Standard Deviation
 Variance
 Correlation
 Covariance
 Sum of Squares (cross product matrix for set
variables)
 Pairwise Cross tabs
 Risk Ratio & Odds Ratio
 Cross-Tabulation of Data (standard tables & long
form)
 Marginal Summaries of Cross Tabulations
Descriptive Statistics
 Sum of Squares (cross product matrix for set
variables)
 Multiple Linear Regression
 Generalized Linear Models (GLM) exponential
family distributions: binomial, Gaussian, inverse
Gaussian, Poisson, Tweedie. Standard link
functions: cauchit, identity, log, logit, probit. User
defined distributions & link functions.
 Covariance & Correlation Matrices
 Logistic Regression
 Predictions/scoring for models
 Residuals for all models
Predictive Statistics
 K-Means
Clustering
 Decision Trees
 Decision Forests
 Gradient Boosted Decision Trees
 Naïve Bayes
Machine Learning
Simulation
 Simulation (e.g. Monte Carlo)
 Parallel Random Number Generation
Custom Parallelization
 rxDataStep
 rxExec
 PEMA-R API
Variable Selection
 Stepwise Regression
R Server Hadoop Architecture
R R R R R
R R R R R
R Server
Master R process on Edge Node
Apache YARN and Spark
Worker R processes on Data Nodes
Data in Distributed Storage
R process on Edge Node
R Server for Hadoop - Connectivity
Worker
Task
R Server
Master Task
Finalizer
Initiator
Edge Node
Worker
Task
Worker
Task
Remote Execution:
ssh
Web Services
DeployR
ssh or R Tools for Visual Studio
BI Tools & Applications
Jupyter Notebooks
Thin Client IDEs
https://
https://
or
MapReduce
HDInsight + R Server: Managed Hadoop for
Advanced Analytics in the Cloud
SparkR functions RevoScaleR functions
R
Spark and Hadoop
Blob Storage
Data Lake Storage
• Easy setup, elastic, SLA
• Spark
• Integrated notebooks experience
• Upgraded to latest Version 1.6.1
• R Server
• Leverage R skills with massively scalable
algorithms and statistical functions
• Reuse existing R functions over multiple
machines
R Server on Hadoop/HDInsight scales to hundreds
of nodes, billions of rows and terabytes of data
0 1 2 3 4 5 6 7 8 9 10 11 12 13
ElapsedTime
Billions of rows
Logistic Regression on NYC Taxi Dataset
2.2 TB
OperationalizeModelPrepare
Typical advanced analytics lifecycle
• Clean/Join – Using SparkR from R Server
• Train/Score/Evaluate – Scalable R Server functions
• Deploy/Consume – Using AzureML from R Server
Airline Arrival Delay Prediction Demo
• Passenger flight on-time performance data from the
US Department of Transportation’s TranStats data
collection
• >20 years of data
• 300+ Airports
• Every carrier, every commercial flight
• http://www.transtats.bts.gov
Airline data set
• Hourly land-based weather observations from
NOAA
• > 2,000 weather stations
• http://www.ncdc.noaa.gov/orders/qclcd/
Weather data set
Provisioning a cluster with R Server
Scaling a cluster
Clean and Join using SparkR in R Server
Train, Score, and Evaluate using R Server
Publish Web Service from R
• HDInsight Premium Hadoop cluster
• Spark on YARN distributed computing
• R Server R interpreter
• SparkR data manipulation functions
• RevoScaleR Statistical & Machine Learning functions
• AzureML R package and Azure ML web service
Demo Technologies
Building a genetic disease risk application with R
Data
• Public genome data from 1000 Genomes
• About 2TB of raw data
Processing
• VariantTools R package (Bioconductor)
• Match against NHGRI GWAS catalog
Analytics
• Disease Risk
• Ancestry
Presentation
• Expose as Web Service APIs
• Phone app, Web page, Enterprise
applications
BAM BAM BAM BAM
VariantTools
GWAS
BAM
Platform
• HDInsight Hadoop (8 clusters)
• 1500 cores, 4 data centers
• Microsoft R Server
The Four Transformational Trends
cloud
computing
2011  2016 5x increase
data
science
Universities filling
300,000 US talent gap
90% of the data in the world
today has been created in
the last two years alone
big
data
open
source
including R, Linux, Hadoop
microsoft.com/r-server
microsoft.com/hdinsight
Ad

More Related Content

What's hot (20)

Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
DataWorks Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
Real Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with SparkReal Time Machine Learning Visualization with Spark
Real Time Machine Learning Visualization with Spark
DataWorks Summit/Hadoop Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
DataWorks Summit
 
Automated Analytics at Scale
Automated Analytics at ScaleAutomated Analytics at Scale
Automated Analytics at Scale
DataWorks Summit/Hadoop Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
DataWorks Summit/Hadoop Summit
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
MapR Technologies
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
Attunity
 
Operating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environmentOperating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environment
DataWorks Summit
 
Insights into Real World Data Management Challenges
Insights into Real World Data Management ChallengesInsights into Real World Data Management Challenges
Insights into Real World Data Management Challenges
DataWorks Summit
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
DataWorks Summit
 
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan SaldichSpark in the Enterprise - 2 Years Later by Alan Saldich
Spark in the Enterprise - 2 Years Later by Alan Saldich
Spark Summit
 
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSetsWebinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Webinar: The Modern Streaming Data Stack with Kinetica & StreamSets
Kinetica
 
High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark High Performance Spatial-Temporal Trajectory Analysis with Spark
High Performance Spatial-Temporal Trajectory Analysis with Spark
DataWorks Summit/Hadoop Summit
 
Building Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSetsBuilding Data Pipelines with Spark and StreamSets
Building Data Pipelines with Spark and StreamSets
Pat Patterson
 
Building Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion PipelinesBuilding Continuously Curated Ingestion Pipelines
Building Continuously Curated Ingestion Pipelines
Arvind Prabhakar
 
Real-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in ActionReal-Time Robot Predictive Maintenance in Action
Real-Time Robot Predictive Maintenance in Action
DataWorks Summit
 
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation CarrierDisrupting Insurance with Advanced Analytics The Next Generation Carrier
Disrupting Insurance with Advanced Analytics The Next Generation Carrier
DataWorks Summit/Hadoop Summit
 
Free Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s ApproachFree Servers to Build Big Data System on: Bing’s Approach
Free Servers to Build Big Data System on: Bing’s Approach
DataWorks Summit
 
Tools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloudTools and approaches for migrating big datasets to the cloud
Tools and approaches for migrating big datasets to the cloud
DataWorks Summit
 
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
ING- CoreIntel- Collect and Process Network Logs Across Data Centers in Real ...
DataWorks Summit/Hadoop Summit
 
Build Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsightBuild Big Data Enterprise Solutions Faster on Azure HDInsight
Build Big Data Enterprise Solutions Faster on Azure HDInsight
DataWorks Summit/Hadoop Summit
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
DataWorks Summit/Hadoop Summit
 
When Streaming Becomes Strategic
When Streaming Becomes StrategicWhen Streaming Becomes Strategic
When Streaming Becomes Strategic
MapR Technologies
 
Which data should you move to Hadoop?
Which data should you move to Hadoop?Which data should you move to Hadoop?
Which data should you move to Hadoop?
Attunity
 
Operating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environmentOperating a secure big data platform in a multi-cloud environment
Operating a secure big data platform in a multi-cloud environment
DataWorks Summit
 

Viewers also liked (20)

Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
Revolution Analytics
 
Latest seo news, tips and tricks website lists
Latest seo news, tips and tricks website listsLatest seo news, tips and tricks website lists
Latest seo news, tips and tricks website lists
Manickam Srinivasan
 
Electron Configuration
Electron ConfigurationElectron Configuration
Electron Configuration
crumpjason
 
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Goran S. Milovanovic
 
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Goran S. Milovanovic
 
10-15 511 genetic algorithms and machine learning (alan nochenson)
10-15 511 genetic algorithms and machine learning (alan nochenson)10-15 511 genetic algorithms and machine learning (alan nochenson)
10-15 511 genetic algorithms and machine learning (alan nochenson)
Alan Nochenson
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
butest
 
Github in a Graph
Github in a GraphGithub in a Graph
Github in a Graph
akollegger
 
Finding the insights hidden in your graph data
Finding the insights hidden in your graph dataFinding the insights hidden in your graph data
Finding the insights hidden in your graph data
DataStax
 
Fouille de données issues d’un grand graphe par carte de Kohonen à noyau
Fouille de données issues d’un grand graphe par carte de Kohonen à noyauFouille de données issues d’un grand graphe par carte de Kohonen à noyau
Fouille de données issues d’un grand graphe par carte de Kohonen à noyau
tuxette
 
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Goran S. Milovanovic
 
20141015 how graphs revolutionize access management
20141015 how graphs revolutionize access management20141015 how graphs revolutionize access management
20141015 how graphs revolutionize access management
Rik Van Bruggen
 
Graph Database Prototyping made easy with Graphgen
Graph Database Prototyping made easy with GraphgenGraph Database Prototyping made easy with Graphgen
Graph Database Prototyping made easy with Graphgen
Christophe Willemsen
 
Bringing graph technologies to data analysis : the case of Azerbaijan in th...
Bringing graph technologies to data  analysis : the case of Azerbaijan in  th...Bringing graph technologies to data  analysis : the case of Azerbaijan in  th...
Bringing graph technologies to data analysis : the case of Azerbaijan in th...
Linkurious
 
The Impact of Algorithmic Trading
The Impact of Algorithmic TradingThe Impact of Algorithmic Trading
The Impact of Algorithmic Trading
Lov Loothra
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
Eugene Hanikblum
 
Link Analysis
Link AnalysisLink Analysis
Link Analysis
Carlos Castillo (ChaTo)
 
Introduction to R for Data Science :: Session 4
Introduction to R for Data Science :: Session 4Introduction to R for Data Science :: Session 4
Introduction to R for Data Science :: Session 4
Goran S. Milovanovic
 
An Introduction to Neural Networks and Machine Learning
An Introduction to Neural Networks and Machine LearningAn Introduction to Neural Networks and Machine Learning
An Introduction to Neural Networks and Machine Learning
Chris Nicholls
 
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
Building a Graph Database in Neo4j with Spark & Spark SQL to gain new insight...
DataWorks Summit/Hadoop Summit
 
Building a scalable data science platform with R
Building a scalable data science platform with RBuilding a scalable data science platform with R
Building a scalable data science platform with R
Revolution Analytics
 
Latest seo news, tips and tricks website lists
Latest seo news, tips and tricks website listsLatest seo news, tips and tricks website lists
Latest seo news, tips and tricks website lists
Manickam Srinivasan
 
Electron Configuration
Electron ConfigurationElectron Configuration
Electron Configuration
crumpjason
 
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Uvod u R za Data Science :: Sesija 1 [Intro to R for Data Science :: Session 1]
Goran S. Milovanovic
 
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R] Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Introduction to R for Data Science :: Session 6 [Linear Regression in R]
Goran S. Milovanovic
 
10-15 511 genetic algorithms and machine learning (alan nochenson)
10-15 511 genetic algorithms and machine learning (alan nochenson)10-15 511 genetic algorithms and machine learning (alan nochenson)
10-15 511 genetic algorithms and machine learning (alan nochenson)
Alan Nochenson
 
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...Machine Learning, Data Mining, Genetic Algorithms, Neural ...
Machine Learning, Data Mining, Genetic Algorithms, Neural ...
butest
 
Github in a Graph
Github in a GraphGithub in a Graph
Github in a Graph
akollegger
 
Finding the insights hidden in your graph data
Finding the insights hidden in your graph dataFinding the insights hidden in your graph data
Finding the insights hidden in your graph data
DataStax
 
Fouille de données issues d’un grand graphe par carte de Kohonen à noyau
Fouille de données issues d’un grand graphe par carte de Kohonen à noyauFouille de données issues d’un grand graphe par carte de Kohonen à noyau
Fouille de données issues d’un grand graphe par carte de Kohonen à noyau
tuxette
 
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Introduction to R for Data Science :: Session 7 [Multiple Linear Regression i...
Goran S. Milovanovic
 
20141015 how graphs revolutionize access management
20141015 how graphs revolutionize access management20141015 how graphs revolutionize access management
20141015 how graphs revolutionize access management
Rik Van Bruggen
 
Graph Database Prototyping made easy with Graphgen
Graph Database Prototyping made easy with GraphgenGraph Database Prototyping made easy with Graphgen
Graph Database Prototyping made easy with Graphgen
Christophe Willemsen
 
Bringing graph technologies to data analysis : the case of Azerbaijan in th...
Bringing graph technologies to data  analysis : the case of Azerbaijan in  th...Bringing graph technologies to data  analysis : the case of Azerbaijan in  th...
Bringing graph technologies to data analysis : the case of Azerbaijan in th...
Linkurious
 
The Impact of Algorithmic Trading
The Impact of Algorithmic TradingThe Impact of Algorithmic Trading
The Impact of Algorithmic Trading
Lov Loothra
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
Eugene Hanikblum
 
Introduction to R for Data Science :: Session 4
Introduction to R for Data Science :: Session 4Introduction to R for Data Science :: Session 4
Introduction to R for Data Science :: Session 4
Goran S. Milovanovic
 
An Introduction to Neural Networks and Machine Learning
An Introduction to Neural Networks and Machine LearningAn Introduction to Neural Networks and Machine Learning
An Introduction to Neural Networks and Machine Learning
Chris Nicholls
 
Ad

Similar to Building a Scalable Data Science Platform with R (20)

TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
BAINIDA
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
Ml2
Ml2Ml2
Ml2
poovarasu maniandan
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Michal Marušan: Scalable R
Michal Marušan: Scalable RMichal Marušan: Scalable R
Michal Marušan: Scalable R
GapData Institute
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Revolution Analytics
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
Revolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Revolution Analytics
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
Debraj GuhaThakurta
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
Debraj GuhaThakurta
 
microsoft r server for distributed computing
microsoft r server for distributed computingmicrosoft r server for distributed computing
microsoft r server for distributed computing
BAINIDA
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013Revolution R Enterprise - Portland R User Group, November 2013
Revolution R Enterprise - Portland R User Group, November 2013
Revolution Analytics
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
DataWorks Summit
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
6° Sessione - Ambiti applicativi nella ricerca di tecnologie statistiche avan...
Jürgen Ambrosi
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
20160317 - PAZUR - PowerBI & R
20160317  - PAZUR - PowerBI & R20160317  - PAZUR - PowerBI & R
20160317 - PAZUR - PowerBI & R
Łukasz Grala
 
High Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and HadoopHigh Performance Predictive Analytics in R and Hadoop
High Performance Predictive Analytics in R and Hadoop
Revolution Analytics
 
Analytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using RAnalytics Beyond RAM Capacity using R
Analytics Beyond RAM Capacity using R
Alex Palamides
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
Alex Zeltov
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
Data Science Thailand
 
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Big Data Predictive Analytics with Revolution R Enterprise (Gartner BI Summit...
Revolution Analytics
 
In-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and RevolutionIn-Database Analytics Deep Dive with Teradata and Revolution
In-Database Analytics Deep Dive with Teradata and Revolution
Revolution Analytics
 
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Performance and Scale Options for R with Hadoop: A comparison of potential ar...
Revolution Analytics
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Ad

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 

Recently uploaded (20)

Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Top-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptxTop-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptx
BR Softech
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
The No-Code Way to Build a Marketing Team with One AI Agent (Download the n8n...
SOFTTECHHUB
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
Could Virtual Threads cast away the usage of Kotlin Coroutines - DevoxxUK2025
João Esperancinha
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
Top-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptxTop-AI-Based-Tools-for-Game-Developers (1).pptx
Top-AI-Based-Tools-for-Game-Developers (1).pptx
BR Softech
 

Building a Scalable Data Science Platform with R

  • 1. Building A Scalable Data Science Platform with R Mario Inchiosa, PhD Principal Software Engineer Hadoop Summit San Jose June 30, 2016
  • 2. What is • The most popular statistical programming language • A data visualization tool • Open source • 2.5+M users • Taught in most universities • Thriving user groups worldwide • 8000+ contributed packages • New and recent grad’s use it Language Platform Community Ecosystem • Rich application & platform integration
  • 3. Common R use cases Vertical Sales & Marketing Finance & Risk Customer & Channel Operations & Workforce Retail Demand Forecasting Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Pricing Strategy Personalization Lifetime Customer Value Product Segmentation Store Location Demographics Supply Chain Management Inventory Management Financial Services Customer Churn Loyalty Programs Cross-sell & Upsell Customer Acquisition Fraud Detection Risk& Compliance Loan Defaults Personalization Lifetime Customer Value Call Center Optimization Pay for Performance Healthcare Marketing Mix Optimization Patient Acquisition Fraud Detection Bill Collection Population Health Patient Demographics Operational Efficiency Pay for Performance Manufacturing Demand Forecasting Marketing mix Optimization Pricing Strategy Perf Risk Management Supply Chain Optimization Personalization Remote Monitoring Predictive Maintenance Asset Management
  • 4. IEEE Spectrum July 2015 Data Flows Overwhelm Open Source R – In-Memory Operation – Lack of Parallelism – Expensive Data Movement & Duplication Not enterprise ready – Inadequacy of Community Support – Lack of Guaranteed Support Timeliness – No SLAs or Support models R Adoption is on a Tear, but Open Source R is not Enterprise Class
  • 5. R from Microsoft brings Peace of mind Efficiency Speed and scalability Flexibility and agility
  • 6. Portability & investment assurance Write Once – Deploy Anywhere R Server portfolio Cloud RDBMS Desktops & Servers Hadoop & Spark EDW R Server Technology
  • 7. R Script for Execution in MapReduce Sample R Script: rxSetComputeContext( RxHadoopMR(…) ) inData <- RxTextData(“/ds/AirOnTime.csv”, fileSystem = hdfsFS) model <- rxLogit(ARR_DEL15 ~ DAY_OF_WEEK + UNIQUE_CARRIER, data = inData) Define Compute Context Define Data Source Train Predictive Model
  • 8. Easy to Switch From MapReduce to Spark Keep other code unchanged Sample R Script: rxSetComputeContext( RxSpark(…) ) Change the Compute Context
  • 9. R Server: scale-out R, Enterprise Class! • 100% compatible with open source R • Any code/package that works today with R will work in R Server • Wide range of scalable and distributed R functions • Examples: rxDataStep(), rxSummary(), rxGlm(), rxDForest(), rxPredict() • Ability to parallelize any R function • Ideal for parameter sweeps, simulation, scoring
  • 10. Parallelized & Distributed Algorithms  Data import – Delimited, Fixed, SAS, SPSS, OBDC  Variable creation & transformation  Recode variables  Factor variables  Missing value handling  Sort, Merge, Split  Aggregate by category (means, sums)  Chi Square Test  Kendall Rank Correlation  Fisher’s Exact Test  Student’s t-Test ETL Statistical Tests  Subsample (observations & variables)  Random Sampling Sampling  Min / Max, Mean, Median (approx.)  Quantiles (approx.)  Standard Deviation  Variance  Correlation  Covariance  Sum of Squares (cross product matrix for set variables)  Pairwise Cross tabs  Risk Ratio & Odds Ratio  Cross-Tabulation of Data (standard tables & long form)  Marginal Summaries of Cross Tabulations Descriptive Statistics  Sum of Squares (cross product matrix for set variables)  Multiple Linear Regression  Generalized Linear Models (GLM) exponential family distributions: binomial, Gaussian, inverse Gaussian, Poisson, Tweedie. Standard link functions: cauchit, identity, log, logit, probit. User defined distributions & link functions.  Covariance & Correlation Matrices  Logistic Regression  Predictions/scoring for models  Residuals for all models Predictive Statistics  K-Means Clustering  Decision Trees  Decision Forests  Gradient Boosted Decision Trees  Naïve Bayes Machine Learning Simulation  Simulation (e.g. Monte Carlo)  Parallel Random Number Generation Custom Parallelization  rxDataStep  rxExec  PEMA-R API Variable Selection  Stepwise Regression
  • 11. R Server Hadoop Architecture R R R R R R R R R R R Server Master R process on Edge Node Apache YARN and Spark Worker R processes on Data Nodes Data in Distributed Storage R process on Edge Node
  • 12. R Server for Hadoop - Connectivity Worker Task R Server Master Task Finalizer Initiator Edge Node Worker Task Worker Task Remote Execution: ssh Web Services DeployR ssh or R Tools for Visual Studio BI Tools & Applications Jupyter Notebooks Thin Client IDEs https:// https:// or MapReduce
  • 13. HDInsight + R Server: Managed Hadoop for Advanced Analytics in the Cloud SparkR functions RevoScaleR functions R Spark and Hadoop Blob Storage Data Lake Storage • Easy setup, elastic, SLA • Spark • Integrated notebooks experience • Upgraded to latest Version 1.6.1 • R Server • Leverage R skills with massively scalable algorithms and statistical functions • Reuse existing R functions over multiple machines
  • 14. R Server on Hadoop/HDInsight scales to hundreds of nodes, billions of rows and terabytes of data 0 1 2 3 4 5 6 7 8 9 10 11 12 13 ElapsedTime Billions of rows Logistic Regression on NYC Taxi Dataset 2.2 TB
  • 16. • Clean/Join – Using SparkR from R Server • Train/Score/Evaluate – Scalable R Server functions • Deploy/Consume – Using AzureML from R Server Airline Arrival Delay Prediction Demo
  • 17. • Passenger flight on-time performance data from the US Department of Transportation’s TranStats data collection • >20 years of data • 300+ Airports • Every carrier, every commercial flight • http://www.transtats.bts.gov Airline data set
  • 18. • Hourly land-based weather observations from NOAA • > 2,000 weather stations • http://www.ncdc.noaa.gov/orders/qclcd/ Weather data set
  • 19. Provisioning a cluster with R Server
  • 21. Clean and Join using SparkR in R Server
  • 22. Train, Score, and Evaluate using R Server
  • 24. • HDInsight Premium Hadoop cluster • Spark on YARN distributed computing • R Server R interpreter • SparkR data manipulation functions • RevoScaleR Statistical & Machine Learning functions • AzureML R package and Azure ML web service Demo Technologies
  • 25. Building a genetic disease risk application with R Data • Public genome data from 1000 Genomes • About 2TB of raw data Processing • VariantTools R package (Bioconductor) • Match against NHGRI GWAS catalog Analytics • Disease Risk • Ancestry Presentation • Expose as Web Service APIs • Phone app, Web page, Enterprise applications BAM BAM BAM BAM VariantTools GWAS BAM Platform • HDInsight Hadoop (8 clusters) • 1500 cores, 4 data centers • Microsoft R Server
  • 26. The Four Transformational Trends cloud computing 2011  2016 5x increase data science Universities filling 300,000 US talent gap 90% of the data in the world today has been created in the last two years alone big data open source including R, Linux, Hadoop

Editor's Notes

  • #2: Abstract: Hadoop is famously scalable. Cloud Computing is famously scalable. R – the thriving and extensible open source Data Science software – not so much. But what if we seamlessly combined Hadoop, Cloud Computing, and R to create a scalable Data Science platform? Imagine exploring, transforming, modeling, and scoring data at any scale from the comfort of your favorite R environment. Now, imagine calling a simple R function to operationalize your predictive model as a scalable, cloud-based Web Services API. Come learn how to leverage the magic of Hadoop on-premises or in the cloud to run your R code, thousands of open source R extension packages, and distributed implementations of the most popular machine learning algorithms at scale.
  • #26: Disease Risk Prediction Pipeline Alignment of unaligned short reads – done beforehand using MSR’s SNAP Variant Calling – pick most likely letter (SNP) from ~30 overlapping reads Risk Scoring – look up disease risks for each SNP in GWAS table and aggregate by disease The Thousand Genomes Project Raw sequence data – unaligned short reads The Bioconductor R Packages VariantTools Genome Wide Associates Studies (GWAS) Associations between SNPs (i.e. genetic mutations) and disease risk Compiled by the National Human Genome Research Institute
  翻译: