SlideShare a Scribd company logo
Visualizing the
Model Selection
Process
Benjamin Bengfort
@bbengfort
District Data Labs
Abstract
Machine learning is the hacker art of describing the features of instances that we want to
make predictions about, then fitting the data that describes those instances to a model
form. Applied machine learning has come a long way from it's beginnings in academia, and
with tools like Scikit-Learn, it's easier than ever to generate operational models for a wide
variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the
primary job of the data scientist is model selection. Model selection involves performing
feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of
machine learning often lead computer scientists towards automatic model selection via
optimization (maximization) of a model's evaluation metric. However, the search space is
large, and grid search approaches to machine learning can easily lead to failure and
frustration. Human intuition is still essential to machine learning, and visual analysis in
concert with automatic methods can allow data scientists to steer model selection towards
better fitted models, faster. In this talk, we will discuss interactive visual methods for better
understanding, steering, and tuning machine learning models.
So I read about this
great ML model
Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix factorization techniques for
recommender systems." Computer 42.8 (2009): 30-37.
def nnmf(R, k=2, steps=5000, alpha=0.0002, beta=0.02):
n, m = R.shape
P = np.random.rand(n,k)
Q = np.random.rand(m,k).T
for step in range(steps):
for idx in range(n):
for jdx in range(m):
if R[idx][jdx] > 0:
eij = R[idx][jdx] - np.dot(P[idx,:], Q[:,jdx])
for kdx in range(K):
P[idx][kdx] = P[idx][kdx] + alpha * (2 * eij * Q[kdx][jdx] - beta * P[idx][kdx])
Q[kdx][jdx] = Q[kdx][jdx] + alpha * (2 * eij * P[idx][kdx] - beta * Q[kdx][jdx])
e = 0
for idx in range(n):
for jdx in range(m):
if R[idx][jdx] > 0:
e += (R[idx][jdx] - np.dot(P[idx,:], Q[:,jdx])) ** 2
if e < 0.001:
break
return P, Q.T
Life with Scikit-Learn
from sklearn.decomposition import NMF
model = NMF(n_components=2, init='random', random_state=0)
model.fit(R)
from sklearn.decomposition import NMF, TruncatedSVD, PCA
models = [
NMF(n_components=2, init='random', random_state=0),
TruncatedSVD(n_components=2),
PCA(n_components=2),
]
for model in models:
model.fit(R)
So now I’m all
Made Possible by the Scikit-Learn API
Buitinck, Lars, et al. "API design for machine learning software: experiences from
the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013).
class Estimator(object):
def fit(self, X, y=None):
"""
Fits estimator to data.
"""
# set state of self
return self
def predict(self, X):
"""
Predict response of X
"""
# compute predictions pred
return pred
class Transformer(Estimator):
def transform(self, X):
"""
Transforms the input data.
"""
# transform X to X_prime
return X_prime
class Pipeline(Transfomer):
@property
def named_steps(self):
"""
Returns a sequence of estimators
"""
return self.steps
@property
def _final_estimator(self):
"""
Terminating estimator
"""
return self.steps[-1]
Algorithm design
stays in the hands of
Academia
Wizardry When Applied
The Model Selection Triple
Arun Kumar http://bit.ly/2abVNrI
Feature Analysis
Algorithm Selection
Hyperparameter
Tuning
The Model Selection Triple
- Define a bounded, high
dimensional feature space
that can be effectively
modeled.
- Transform and manipulate
the space to make
modeling easier.
- Extract a feature
representation of each
instance in the space.
Feature Analysis
Algorithm Selection
The Model Selection Triple
- Select a model family that
best/correctly defines the
relationship between the
variables of interest.
- Define a model form that
specifies exactly how
features interact to make a
prediction.
- Train a fitted model by
optimizing internal
parameters to the data.
Hyperparameter
Tuning
The Model Selection Triple
- Evaluate how the model
form is interacting with the
feature space.
- Identify hyperparameters
(parameters that affect
training or the prior, not
prediction)
- Tune the fitting and
prediction process by
modifying these params.
Can it be automated?
Regularization is a form of automatic feature analysis.
X0
X1
X0
X1
L1 Normalization
Possibility that a feature is eliminated by setting its
coefficient equal to zero.
L2 Normalization
Features are kept balanced by minimizing the
relative change of coefficients during learning.
Automatic Model Selection Criteria
from sklearn.cross_validation import KFold
kfolds = KFold(n=len(X), n_folds=12)
scores = [
model.fit(
X[train], y[train]
).score(
X[test], y[test]
)
for train, test in kfolds
]
F1
R2
Automatic Model Selection: Try Them All!
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import cross_validation as cv
classifiers = [
KNeighborsClassifier(5),
SVC(kernel="linear", C=0.025),
RandomForestClassifier(max_depth=5),
AdaBoostClassifier(),
GaussianNB(),
]
kfold = cv.KFold(len(X), n_folds=12)
max([
cv.cross_val_score(model, X, y, cv=kfold).mean
for model in classifiers
])
Automatic Model Selection: Search Param Space
from sklearn.feature_extraction.text import *
from sklearn.linear_model import SGDClassifier
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('vect', CountVectorizer()),
('tfidf', TfidfTransformer()),
('model', SGDClassifier()),
])
parameters = {
'vect__max_df': (0.5, 0.75, 1.0),
'vect__max_features': (None, 5000, 10000),
'tfidf__use_idf': (True, False),
'tfidf__norm': ('l1', 'l2'),
'model__alpha': (0.00001, 0.000001),
'model__penalty': ('l2', 'elasticnet'),
}
search = GridSearchCV(pipeline, parameters)
search.fit(X, y)
Maybe not so Wizard?
Automatic Model Selection: Search?
Search is difficult particularly
in high dimensional space.
Even with techniques like
genetic algorithms or particle
swarm optimization, there is
no guarantee of a solution.
As the search space gets
larger, the amount of time
increases exponentially.
Anscombe, Francis J. "Graphs in statistical analysis."
The American Statistician 27.1 (1973): 17-21.
Anscombe’s Quartet
Through visualization
we can steer the model
selection process
Model Selection Management Systems
Kumar, Arun, et al. "Model selection management systems: The next frontier of
advanced analytics." ACM SIGMOD Record 44.4 (2016): 17-22.
Optimized Implementations
User Interfaces and DSLs
Model Selection Triples
{ {FE} x {AS} X {HT} }
Can we visualize
machine learning?
Data Management
Wrangling
Standardization
Normalization
Selection & Joins
Model Evaluation +
Hyperparameter Tuning
Model Selection
Feature Analysis
Linear
Models
Nearest
Neighbors
SVM
Ensemble Trees Bayes
Feature
Analysis
Feature
Selection
Model
Selection
Revisit
Features
Iterate!
Initial
Model
Model
Storage
Data and Model Management
Is “GitHub for Data” Enough?
Visualizing Feature Analysis
SPLOM (Scatterplot Matrices)
Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive
exploration of multidimensional data." Information visualization 4.2 (2005): 96-113.
Visual Rank by Feature: 1 Dimension
Rank by:
1. Normality of distribution
(Shapiro-Wilk and
Kolmogorov-Smirnov)
2. Uniformity of distribution
(entropy)
3. Number of potential outliers
4. Number of hapaxes
5. Size of gap
Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive
exploration of multidimensional data." Information visualization 4.2 (2005): 96-113.
Visual Rank by Feature: 1 Dimension
Rank by:
1. Normality of distribution
(Shapiro-Wilk and
Kolmogorov-Smirnov)
2. Uniformity of distribution
(entropy)
3. Number of potential outliers
4. Number of hapaxes
5. Size of gap
Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive
exploration of multidimensional data." Information visualization 4.2 (2005): 96-113.
Visual Rank by Feature: 2 Dimensions
Rank by:
1. Correlation Coefficient
(Pearson, Spearman)
2. Least-squares error
3. Quadracity
4. Density based outlier
detection.
5. Uniformity (entropy of grids)
6. Number of items in the most
dense region of the plot.
Joint Plots: Diving Deeper after Rank by Feature
Special thanks to Seaborn for doing statistical visualization right!
Detecting Separablity
Radviz: Radial Visualization
Parallel Coordinates
Decomposition (PCA, SVD) of Feature Space
Visualizing Model Selection
Confusion Matrices
Receiver Operator Characteristic (ROC) and Area Under Curve (AUC)
Prediction Error Plots
Visualizing Residuals
Model Families vs. Model Forms vs. Fitted Models
Rebecca Bilbro http://bit.ly/2a1YoTs
kNN Tuning Slider in 2 Dimensions
Scott Fortmann-Roe http://bit.ly/29P4SS1
Visualizing Evaluation/Tuning
Cross Validation Curves
Visual Grid Search
Integrating Visual Model
Selection with Scikit-Learn
Yellowbrick
Scikit-Learn Pipelines: fit() and predict()
Data Loader
Transformer
Transformer
Estimator
Data Loader
Transformer
Transformer
Estimator
Transformer
Yellowbrick Visual Transformers
Data Loader
Transformer(s)
Feature
Visualization
Estimator
fit()
draw()
predict()
Data Loader
Transformer(s)
EstimatorCV
Evaluation
Visualization
fit()
predict()
score()
draw()
Model Selection Pipelines
Multi-Estimator
Visualization
Data Loader
Transformer(s)
EstimatorEstimatorEstimatorEstimator
Cross Validation Cross Validation Cross Validation Cross Validation
Employ Interactivity to Visualize More
Health and Wealth of Nations Recreated by Mike Bostock
Originally by Hans Rosling http://bit.ly/29RYBJD
Visual Analytics Mantra:
Overview First; Zoom & Filter; Details on Demand
Heer, Jeffrey, and Ben Shneiderman. "Interactive dynamics
for visual analysis." Queue 10.2 (2012): 30.
Codename Trinket
Visual Model Management System
Yellowbrick
http://bit.ly/2a5otxB
DDL Trinket
http://bit.ly/2a2Y0jy
DDL Open Source Projects on GitHub
Questions!
Ad

More Related Content

What's hot (20)

Time series deep learning
Time series   deep learningTime series   deep learning
Time series deep learning
Alberto Arrigoni
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Edureka!
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
odsc
 
Research of adversarial example on a deep neural network
Research of adversarial example on a deep neural networkResearch of adversarial example on a deep neural network
Research of adversarial example on a deep neural network
NAVER Engineering
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CF
Yusuke Yamamoto
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
Sujith Kumar
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural Networks
Dean Wyatte
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
 
Pandas
PandasPandas
Pandas
maikroeder
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Kamalakshi Deshmukh-Samag
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Universitat Politècnica de Catalunya
 
K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)K-Nearest Neighbor(KNN)
K-Nearest Neighbor(KNN)
Abdullah al Mamun
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
Amy Hodler
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Python For Data Analysis | Python Pandas Tutorial | Learn Python | Python Tra...
Edureka!
 
Feature Engineering
Feature Engineering Feature Engineering
Feature Engineering
odsc
 
Research of adversarial example on a deep neural network
Research of adversarial example on a deep neural networkResearch of adversarial example on a deep neural network
Research of adversarial example on a deep neural network
NAVER Engineering
 
Collaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CFCollaborative Filtering 1: User-based CF
Collaborative Filtering 1: User-based CF
Yusuke Yamamoto
 
Regular expressions in Python
Regular expressions in PythonRegular expressions in Python
Regular expressions in Python
Sujith Kumar
 
Intro to Neural Networks
Intro to Neural NetworksIntro to Neural Networks
Intro to Neural Networks
Dean Wyatte
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
Megha Sharma
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
Jon Lederman
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
pyingkodi maran
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
Benjamin Bengfort
 
ML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problemsML Drift - How to find issues before they become problems
ML Drift - How to find issues before they become problems
Amy Hodler
 
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Data Analysis: Evaluation Metrics for Supervised Learning Models of Machine L...
Md. Main Uddin Rony
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
Machine Learning Valencia
 
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Machine Learning Tutorial Part - 1 | Machine Learning Tutorial For Beginners ...
Simplilearn
 

Viewers also liked (20)

Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
Benjamin Bengfort
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Benjamin Bengfort
 
Data Product Architectures
Data Product ArchitecturesData Product Architectures
Data Product Architectures
Benjamin Bengfort
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
Benjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
Benjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
Benjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
Benjamin Bengfort
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-fei
Tianlu Wang
 
Visualizing Threats: Network Visualization for Cyber Security
Visualizing Threats: Network Visualization for Cyber SecurityVisualizing Threats: Network Visualization for Cyber Security
Visualizing Threats: Network Visualization for Cyber Security
Cambridge Intelligence
 
Annotation with Redfox
Annotation with RedfoxAnnotation with Redfox
Annotation with Redfox
Benjamin Bengfort
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
Benjamin Bengfort
 
Rasta processing of speech
Rasta processing of speechRasta processing of speech
Rasta processing of speech
Benjamin Bengfort
 
Solving graph problems using networkX
Solving graph problems using networkXSolving graph problems using networkX
Solving graph problems using networkX
Krishna Sangeeth KS
 
Plotcon 2016 Visualization Talk by Alexandra Johnson
Plotcon 2016 Visualization Talk  by Alexandra JohnsonPlotcon 2016 Visualization Talk  by Alexandra Johnson
Plotcon 2016 Visualization Talk by Alexandra Johnson
SigOpt
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
Benjamin Bengfort
 
PROTEUS H2020
PROTEUS H2020 PROTEUS H2020
PROTEUS H2020
Bonaventura Del Monte
 
Visualization and Theories of Learning in Education
Visualization and Theories of Learning in EducationVisualization and Theories of Learning in Education
Visualization and Theories of Learning in Education
Liz Dorland
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
Jimmy Lai
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYC
Gilad Lotan
 
Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)Dynamics in graph analysis (PyData Carolinas 2016)
Dynamics in graph analysis (PyData Carolinas 2016)
Benjamin Bengfort
 
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Visualizing Model Selection with Scikit-Yellowbrick: An Introduction to Devel...
Benjamin Bengfort
 
An Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation ReportAn Interactive Visual Analytics Dashboard for the Employment Situation Report
An Interactive Visual Analytics Dashboard for the Employment Situation Report
Benjamin Bengfort
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
Benjamin Bengfort
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
Benjamin Bengfort
 
Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)Building Data Products with Python (Georgetown)
Building Data Products with Python (Georgetown)
Benjamin Bengfort
 
Beginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix FactorizationBeginners Guide to Non-Negative Matrix Factorization
Beginners Guide to Non-Negative Matrix Factorization
Benjamin Bengfort
 
Lecture7 xing fei-fei
Lecture7 xing fei-feiLecture7 xing fei-fei
Lecture7 xing fei-fei
Tianlu Wang
 
Visualizing Threats: Network Visualization for Cyber Security
Visualizing Threats: Network Visualization for Cyber SecurityVisualizing Threats: Network Visualization for Cyber Security
Visualizing Threats: Network Visualization for Cyber Security
Cambridge Intelligence
 
Graph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational DataGraph Based Machine Learning on Relational Data
Graph Based Machine Learning on Relational Data
Benjamin Bengfort
 
Solving graph problems using networkX
Solving graph problems using networkXSolving graph problems using networkX
Solving graph problems using networkX
Krishna Sangeeth KS
 
Plotcon 2016 Visualization Talk by Alexandra Johnson
Plotcon 2016 Visualization Talk  by Alexandra JohnsonPlotcon 2016 Visualization Talk  by Alexandra Johnson
Plotcon 2016 Visualization Talk by Alexandra Johnson
SigOpt
 
Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)Evolutionary Design of Swarms (SSCI 2014)
Evolutionary Design of Swarms (SSCI 2014)
Benjamin Bengfort
 
Visualization and Theories of Learning in Education
Visualization and Theories of Learning in EducationVisualization and Theories of Learning in Education
Visualization and Theories of Learning in Education
Liz Dorland
 
NetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHugNetworkX - python graph analysis and visualization @ PyHug
NetworkX - python graph analysis and visualization @ PyHug
Jimmy Lai
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYC
Gilad Lotan
 
Ad

Similar to Visualizing the Model Selection Process (20)

Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
Benjamin Bengfort
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
Yaxin Liu
 
Learning with Relative Attributes
Learning with Relative AttributesLearning with Relative Attributes
Learning with Relative Attributes
Vikas Jain
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
Kenta Oono
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
mrphilroth
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
Manish Pandey
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
Daniel Chan
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
HONGJOO LEE
 
Pointcuts and Analysis
Pointcuts and AnalysisPointcuts and Analysis
Pointcuts and Analysis
Wiwat Ruengmee
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
AAKANKSHA JAIN
 
IEEE Projects 2014-2015
IEEE Projects 2014-2015IEEE Projects 2014-2015
IEEE Projects 2014-2015
Finalyear Projects
 
Avihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slidesAvihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slides
wolf
 
Report
ReportReport
Report
Conor McMenamin
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
Yvonne K. Matos
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Craig Chao
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
Tien-Yang (Aiden) Wu
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in Images
Anil Kumar Gupta
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
imu409
 
Visual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learningVisual diagnostics for more effective machine learning
Visual diagnostics for more effective machine learning
Benjamin Bengfort
 
EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171EE660_Report_YaxinLiu_8448347171
EE660_Report_YaxinLiu_8448347171
Yaxin Liu
 
Learning with Relative Attributes
Learning with Relative AttributesLearning with Relative Attributes
Learning with Relative Attributes
Vikas Jain
 
Deep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistryDeep learning for molecules, introduction to chainer chemistry
Deep learning for molecules, introduction to chainer chemistry
Kenta Oono
 
Machine Learning Model Bakeoff
Machine Learning Model BakeoffMachine Learning Model Bakeoff
Machine Learning Model Bakeoff
mrphilroth
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
Manish Pandey
 
Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)Machine Learning: Classification Concepts (Part 1)
Machine Learning: Classification Concepts (Part 1)
Daniel Chan
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
HONGJOO LEE
 
Pointcuts and Analysis
Pointcuts and AnalysisPointcuts and Analysis
Pointcuts and Analysis
Wiwat Ruengmee
 
Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]Dimension reduction techniques[Feature Selection]
Dimension reduction techniques[Feature Selection]
AAKANKSHA JAIN
 
Avihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slidesAvihu Efrat's Viola and Jones face detection slides
Avihu Efrat's Viola and Jones face detection slides
wolf
 
Learning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and KaggleLearning Predictive Modeling with TSA and Kaggle
Learning Predictive Modeling with TSA and Kaggle
Yvonne K. Matos
 
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Leveraging R in Big Data of Mobile Ads (R在行動廣告大數據的應用)
Craig Chao
 
A simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representationsA simple framework for contrastive learning of visual representations
A simple framework for contrastive learning of visual representations
Devansh16
 
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Yao Yao
 
Obscenity Detection in Images
Obscenity Detection in ImagesObscenity Detection in Images
Obscenity Detection in Images
Anil Kumar Gupta
 
powerpoint feb
powerpoint febpowerpoint feb
powerpoint feb
imu409
 
Ad

More from Benjamin Bengfort (9)

Privacy and Security in the Age of Generative AI - C4AI.pdf
Privacy and Security in the Age of Generative AI - C4AI.pdfPrivacy and Security in the Age of Generative AI - C4AI.pdf
Privacy and Security in the Age of Generative AI - C4AI.pdf
Benjamin Bengfort
 
Implementing Function Calling LLMs without Fear.pdf
Implementing Function Calling LLMs without Fear.pdfImplementing Function Calling LLMs without Fear.pdf
Implementing Function Calling LLMs without Fear.pdf
Benjamin Bengfort
 
Privacy and Security in the Age of Generative AI
Privacy and Security in the Age of Generative AIPrivacy and Security in the Age of Generative AI
Privacy and Security in the Age of Generative AI
Benjamin Bengfort
 
Digitocracy without Borders: the unifying and destabilizing effects of softwa...
Digitocracy without Borders: the unifying and destabilizing effects of softwa...Digitocracy without Borders: the unifying and destabilizing effects of softwa...
Digitocracy without Borders: the unifying and destabilizing effects of softwa...
Benjamin Bengfort
 
Getting Started with TRISA
Getting Started with TRISAGetting Started with TRISA
Getting Started with TRISA
Benjamin Bengfort
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 
Privacy and Security in the Age of Generative AI - C4AI.pdf
Privacy and Security in the Age of Generative AI - C4AI.pdfPrivacy and Security in the Age of Generative AI - C4AI.pdf
Privacy and Security in the Age of Generative AI - C4AI.pdf
Benjamin Bengfort
 
Implementing Function Calling LLMs without Fear.pdf
Implementing Function Calling LLMs without Fear.pdfImplementing Function Calling LLMs without Fear.pdf
Implementing Function Calling LLMs without Fear.pdf
Benjamin Bengfort
 
Privacy and Security in the Age of Generative AI
Privacy and Security in the Age of Generative AIPrivacy and Security in the Age of Generative AI
Privacy and Security in the Age of Generative AI
Benjamin Bengfort
 
Digitocracy without Borders: the unifying and destabilizing effects of softwa...
Digitocracy without Borders: the unifying and destabilizing effects of softwa...Digitocracy without Borders: the unifying and destabilizing effects of softwa...
Digitocracy without Borders: the unifying and destabilizing effects of softwa...
Benjamin Bengfort
 
An Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed DatabaseAn Overview of Spanner: Google's Globally Distributed Database
An Overview of Spanner: Google's Globally Distributed Database
Benjamin Bengfort
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
Benjamin Bengfort
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
Benjamin Bengfort
 
Building Data Apps with Python
Building Data Apps with PythonBuilding Data Apps with Python
Building Data Apps with Python
Benjamin Bengfort
 

Recently uploaded (20)

Preclinical Advances in Nuclear Neurology.pptx
Preclinical Advances in Nuclear Neurology.pptxPreclinical Advances in Nuclear Neurology.pptx
Preclinical Advances in Nuclear Neurology.pptx
MahitaLaveti
 
Introduction to Black Hole and how its formed
Introduction to Black Hole and how its formedIntroduction to Black Hole and how its formed
Introduction to Black Hole and how its formed
MSafiullahALawi
 
External Application in Homoeopathy- Definition,Scope and Types.
External Application  in Homoeopathy- Definition,Scope and Types.External Application  in Homoeopathy- Definition,Scope and Types.
External Application in Homoeopathy- Definition,Scope and Types.
AdharshnaPatrick
 
Mycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes FungiMycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes Fungi
SAYANTANMALLICK5
 
Applications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptxApplications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptx
MahitaLaveti
 
Somato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptxSomato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptx
klynct
 
Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.
klynct
 
Eric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptxEric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptx
ttalbert1
 
The Microbial World. Microbiology , Microbes, infections
The Microbial World. Microbiology , Microbes, infectionsThe Microbial World. Microbiology , Microbes, infections
The Microbial World. Microbiology , Microbes, infections
NABIHANAEEM2
 
Subject name: Introduction to psychology
Subject name: Introduction to psychologySubject name: Introduction to psychology
Subject name: Introduction to psychology
beebussy155
 
Freshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and FactorsFreshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and Factors
mytriplemonlineshop
 
Secondary metabolite ,Plants and Health Care
Secondary metabolite ,Plants and Health CareSecondary metabolite ,Plants and Health Care
Secondary metabolite ,Plants and Health Care
Nistarini College, Purulia (W.B) India
 
AP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of LifeAP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of Life
mseileenlinden
 
Carboxylic-Acid-Derivatives.lecture.presentation
Carboxylic-Acid-Derivatives.lecture.presentationCarboxylic-Acid-Derivatives.lecture.presentation
Carboxylic-Acid-Derivatives.lecture.presentation
GLAEXISAJULGA
 
Fatigue and its management in aviation medicine
Fatigue and its management in aviation medicineFatigue and its management in aviation medicine
Fatigue and its management in aviation medicine
ImranJewel2
 
Black hole and its division and categories
Black hole and its division and categoriesBlack hole and its division and categories
Black hole and its division and categories
MSafiullahALawi
 
Water Pollution control using microorganisms
Water Pollution control using microorganismsWater Pollution control using microorganisms
Water Pollution control using microorganisms
gerefam247
 
Batteries and fuel cells for btech first year
Batteries and fuel cells for btech first yearBatteries and fuel cells for btech first year
Batteries and fuel cells for btech first year
MithilPillai1
 
Astrobiological implications of the stability andreactivity of peptide nuclei...
Astrobiological implications of the stability andreactivity of peptide nuclei...Astrobiological implications of the stability andreactivity of peptide nuclei...
Astrobiological implications of the stability andreactivity of peptide nuclei...
Sérgio Sacani
 
Anti fungal agents Medicinal Chemistry III
Anti fungal agents Medicinal Chemistry  IIIAnti fungal agents Medicinal Chemistry  III
Anti fungal agents Medicinal Chemistry III
HRUTUJA WAGH
 
Preclinical Advances in Nuclear Neurology.pptx
Preclinical Advances in Nuclear Neurology.pptxPreclinical Advances in Nuclear Neurology.pptx
Preclinical Advances in Nuclear Neurology.pptx
MahitaLaveti
 
Introduction to Black Hole and how its formed
Introduction to Black Hole and how its formedIntroduction to Black Hole and how its formed
Introduction to Black Hole and how its formed
MSafiullahALawi
 
External Application in Homoeopathy- Definition,Scope and Types.
External Application  in Homoeopathy- Definition,Scope and Types.External Application  in Homoeopathy- Definition,Scope and Types.
External Application in Homoeopathy- Definition,Scope and Types.
AdharshnaPatrick
 
Mycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes FungiMycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes Fungi
SAYANTANMALLICK5
 
Applications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptxApplications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptx
MahitaLaveti
 
Somato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptxSomato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptx
klynct
 
Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.
klynct
 
Eric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptxEric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptx
ttalbert1
 
The Microbial World. Microbiology , Microbes, infections
The Microbial World. Microbiology , Microbes, infectionsThe Microbial World. Microbiology , Microbes, infections
The Microbial World. Microbiology , Microbes, infections
NABIHANAEEM2
 
Subject name: Introduction to psychology
Subject name: Introduction to psychologySubject name: Introduction to psychology
Subject name: Introduction to psychology
beebussy155
 
Freshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and FactorsFreshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and Factors
mytriplemonlineshop
 
AP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of LifeAP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of Life
mseileenlinden
 
Carboxylic-Acid-Derivatives.lecture.presentation
Carboxylic-Acid-Derivatives.lecture.presentationCarboxylic-Acid-Derivatives.lecture.presentation
Carboxylic-Acid-Derivatives.lecture.presentation
GLAEXISAJULGA
 
Fatigue and its management in aviation medicine
Fatigue and its management in aviation medicineFatigue and its management in aviation medicine
Fatigue and its management in aviation medicine
ImranJewel2
 
Black hole and its division and categories
Black hole and its division and categoriesBlack hole and its division and categories
Black hole and its division and categories
MSafiullahALawi
 
Water Pollution control using microorganisms
Water Pollution control using microorganismsWater Pollution control using microorganisms
Water Pollution control using microorganisms
gerefam247
 
Batteries and fuel cells for btech first year
Batteries and fuel cells for btech first yearBatteries and fuel cells for btech first year
Batteries and fuel cells for btech first year
MithilPillai1
 
Astrobiological implications of the stability andreactivity of peptide nuclei...
Astrobiological implications of the stability andreactivity of peptide nuclei...Astrobiological implications of the stability andreactivity of peptide nuclei...
Astrobiological implications of the stability andreactivity of peptide nuclei...
Sérgio Sacani
 
Anti fungal agents Medicinal Chemistry III
Anti fungal agents Medicinal Chemistry  IIIAnti fungal agents Medicinal Chemistry  III
Anti fungal agents Medicinal Chemistry III
HRUTUJA WAGH
 

Visualizing the Model Selection Process

  • 1. Visualizing the Model Selection Process Benjamin Bengfort @bbengfort District Data Labs
  • 2. Abstract Machine learning is the hacker art of describing the features of instances that we want to make predictions about, then fitting the data that describes those instances to a model form. Applied machine learning has come a long way from it's beginnings in academia, and with tools like Scikit-Learn, it's easier than ever to generate operational models for a wide variety of applications. Thanks to the ease and variety of the tools in Scikit-Learn, the primary job of the data scientist is model selection. Model selection involves performing feature engineering, hyperparameter tuning, and algorithm selection. These dimensions of machine learning often lead computer scientists towards automatic model selection via optimization (maximization) of a model's evaluation metric. However, the search space is large, and grid search approaches to machine learning can easily lead to failure and frustration. Human intuition is still essential to machine learning, and visual analysis in concert with automatic methods can allow data scientists to steer model selection towards better fitted models, faster. In this talk, we will discuss interactive visual methods for better understanding, steering, and tuning machine learning models.
  • 3. So I read about this great ML model
  • 4. Koren, Yehuda, Robert Bell, and Chris Volinsky. "Matrix factorization techniques for recommender systems." Computer 42.8 (2009): 30-37.
  • 5. def nnmf(R, k=2, steps=5000, alpha=0.0002, beta=0.02): n, m = R.shape P = np.random.rand(n,k) Q = np.random.rand(m,k).T for step in range(steps): for idx in range(n): for jdx in range(m): if R[idx][jdx] > 0: eij = R[idx][jdx] - np.dot(P[idx,:], Q[:,jdx]) for kdx in range(K): P[idx][kdx] = P[idx][kdx] + alpha * (2 * eij * Q[kdx][jdx] - beta * P[idx][kdx]) Q[kdx][jdx] = Q[kdx][jdx] + alpha * (2 * eij * P[idx][kdx] - beta * Q[kdx][jdx]) e = 0 for idx in range(n): for jdx in range(m): if R[idx][jdx] > 0: e += (R[idx][jdx] - np.dot(P[idx,:], Q[:,jdx])) ** 2 if e < 0.001: break return P, Q.T
  • 7. from sklearn.decomposition import NMF model = NMF(n_components=2, init='random', random_state=0) model.fit(R)
  • 8. from sklearn.decomposition import NMF, TruncatedSVD, PCA models = [ NMF(n_components=2, init='random', random_state=0), TruncatedSVD(n_components=2), PCA(n_components=2), ] for model in models: model.fit(R)
  • 10. Made Possible by the Scikit-Learn API Buitinck, Lars, et al. "API design for machine learning software: experiences from the scikit-learn project." arXiv preprint arXiv:1309.0238 (2013). class Estimator(object): def fit(self, X, y=None): """ Fits estimator to data. """ # set state of self return self def predict(self, X): """ Predict response of X """ # compute predictions pred return pred class Transformer(Estimator): def transform(self, X): """ Transforms the input data. """ # transform X to X_prime return X_prime class Pipeline(Transfomer): @property def named_steps(self): """ Returns a sequence of estimators """ return self.steps @property def _final_estimator(self): """ Terminating estimator """ return self.steps[-1]
  • 11. Algorithm design stays in the hands of Academia
  • 13. The Model Selection Triple Arun Kumar http://bit.ly/2abVNrI Feature Analysis Algorithm Selection Hyperparameter Tuning
  • 14. The Model Selection Triple - Define a bounded, high dimensional feature space that can be effectively modeled. - Transform and manipulate the space to make modeling easier. - Extract a feature representation of each instance in the space. Feature Analysis
  • 15. Algorithm Selection The Model Selection Triple - Select a model family that best/correctly defines the relationship between the variables of interest. - Define a model form that specifies exactly how features interact to make a prediction. - Train a fitted model by optimizing internal parameters to the data.
  • 16. Hyperparameter Tuning The Model Selection Triple - Evaluate how the model form is interacting with the feature space. - Identify hyperparameters (parameters that affect training or the prior, not prediction) - Tune the fitting and prediction process by modifying these params.
  • 17. Can it be automated?
  • 18. Regularization is a form of automatic feature analysis. X0 X1 X0 X1 L1 Normalization Possibility that a feature is eliminated by setting its coefficient equal to zero. L2 Normalization Features are kept balanced by minimizing the relative change of coefficients during learning.
  • 19. Automatic Model Selection Criteria from sklearn.cross_validation import KFold kfolds = KFold(n=len(X), n_folds=12) scores = [ model.fit( X[train], y[train] ).score( X[test], y[test] ) for train, test in kfolds ] F1 R2
  • 20. Automatic Model Selection: Try Them All! from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import AdaBoostClassifier from sklearn.naive_bayes import GaussianNB from sklearn import cross_validation as cv classifiers = [ KNeighborsClassifier(5), SVC(kernel="linear", C=0.025), RandomForestClassifier(max_depth=5), AdaBoostClassifier(), GaussianNB(), ] kfold = cv.KFold(len(X), n_folds=12) max([ cv.cross_val_score(model, X, y, cv=kfold).mean for model in classifiers ])
  • 21. Automatic Model Selection: Search Param Space from sklearn.feature_extraction.text import * from sklearn.linear_model import SGDClassifier from sklearn.grid_search import GridSearchCV from sklearn.pipeline import Pipeline pipeline = Pipeline([ ('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', SGDClassifier()), ]) parameters = { 'vect__max_df': (0.5, 0.75, 1.0), 'vect__max_features': (None, 5000, 10000), 'tfidf__use_idf': (True, False), 'tfidf__norm': ('l1', 'l2'), 'model__alpha': (0.00001, 0.000001), 'model__penalty': ('l2', 'elasticnet'), } search = GridSearchCV(pipeline, parameters) search.fit(X, y)
  • 22. Maybe not so Wizard?
  • 23. Automatic Model Selection: Search? Search is difficult particularly in high dimensional space. Even with techniques like genetic algorithms or particle swarm optimization, there is no guarantee of a solution. As the search space gets larger, the amount of time increases exponentially.
  • 24. Anscombe, Francis J. "Graphs in statistical analysis." The American Statistician 27.1 (1973): 17-21. Anscombe’s Quartet
  • 25. Through visualization we can steer the model selection process
  • 26. Model Selection Management Systems Kumar, Arun, et al. "Model selection management systems: The next frontier of advanced analytics." ACM SIGMOD Record 44.4 (2016): 17-22. Optimized Implementations User Interfaces and DSLs Model Selection Triples { {FE} x {AS} X {HT} }
  • 28. Data Management Wrangling Standardization Normalization Selection & Joins Model Evaluation + Hyperparameter Tuning Model Selection Feature Analysis Linear Models Nearest Neighbors SVM Ensemble Trees Bayes Feature Analysis Feature Selection Model Selection Revisit Features Iterate! Initial Model Model Storage
  • 29. Data and Model Management
  • 30. Is “GitHub for Data” Enough?
  • 33. Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive exploration of multidimensional data." Information visualization 4.2 (2005): 96-113. Visual Rank by Feature: 1 Dimension Rank by: 1. Normality of distribution (Shapiro-Wilk and Kolmogorov-Smirnov) 2. Uniformity of distribution (entropy) 3. Number of potential outliers 4. Number of hapaxes 5. Size of gap
  • 34. Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive exploration of multidimensional data." Information visualization 4.2 (2005): 96-113. Visual Rank by Feature: 1 Dimension Rank by: 1. Normality of distribution (Shapiro-Wilk and Kolmogorov-Smirnov) 2. Uniformity of distribution (entropy) 3. Number of potential outliers 4. Number of hapaxes 5. Size of gap
  • 35. Seo, Jinwook, and Ben Shneiderman. "A rank-by-feature framework for interactive exploration of multidimensional data." Information visualization 4.2 (2005): 96-113. Visual Rank by Feature: 2 Dimensions Rank by: 1. Correlation Coefficient (Pearson, Spearman) 2. Least-squares error 3. Quadracity 4. Density based outlier detection. 5. Uniformity (entropy of grids) 6. Number of items in the most dense region of the plot.
  • 36. Joint Plots: Diving Deeper after Rank by Feature Special thanks to Seaborn for doing statistical visualization right!
  • 40. Decomposition (PCA, SVD) of Feature Space
  • 43. Receiver Operator Characteristic (ROC) and Area Under Curve (AUC)
  • 46. Model Families vs. Model Forms vs. Fitted Models Rebecca Bilbro http://bit.ly/2a1YoTs
  • 47. kNN Tuning Slider in 2 Dimensions Scott Fortmann-Roe http://bit.ly/29P4SS1
  • 51. Integrating Visual Model Selection with Scikit-Learn Yellowbrick
  • 52. Scikit-Learn Pipelines: fit() and predict() Data Loader Transformer Transformer Estimator Data Loader Transformer Transformer Estimator Transformer
  • 53. Yellowbrick Visual Transformers Data Loader Transformer(s) Feature Visualization Estimator fit() draw() predict() Data Loader Transformer(s) EstimatorCV Evaluation Visualization fit() predict() score() draw()
  • 54. Model Selection Pipelines Multi-Estimator Visualization Data Loader Transformer(s) EstimatorEstimatorEstimatorEstimator Cross Validation Cross Validation Cross Validation Cross Validation
  • 55. Employ Interactivity to Visualize More Health and Wealth of Nations Recreated by Mike Bostock Originally by Hans Rosling http://bit.ly/29RYBJD
  • 56. Visual Analytics Mantra: Overview First; Zoom & Filter; Details on Demand Heer, Jeffrey, and Ben Shneiderman. "Interactive dynamics for visual analysis." Queue 10.2 (2012): 30.
  • 57. Codename Trinket Visual Model Management System
  翻译: