SlideShare a Scribd company logo
Machine
Learning
Source: Introduction to Machine Learning with Python
Authors: Andreas C. Müller and Sarah Guido
Unit – II
Supervised Learning
Agenda
Classification and
Regression
Generalization, Overfitting
and Underfitting
Relation of Model
Complexity to Dataset Size
K- Nearest Neighbors
Agenda
Linear Models
Linear Models for
Classification
Naïve Bayes Classifiers
Decision Trees
Agenda
Ensembles of Decision Tress
Kernalized Support Vector
Machines
Uncertainity Estimates from
Classifiers
Classification
and
Regression
Introduction to Machine
learning
Classification
Regression
Supervised
Learning
§ Supervised learning is used whenever we want to
predict a certain outcome from a given input
§ Goal is to make accurate predictions for new,
never-before-seen data
§ Supervised learning often requires human
effort to build the training set, but afterward
automates and often speeds up an otherwise
laborious or infeasible task
Classification
and
Regression
§ Two major types of supervised machine
learning problems –
§ Classification
§ Regression
Classification
and
Regression
§ Classification
§ Goal is to predict a class label, which is a choice
from a predefined list of possibilities
§ Classification is sometimes separated into
§ Binary classification -
§ Distingution between two classes
§ Multiclass classification -
§ Which is classification between more than two classes
§ Example -
§ Binary Classification -
§ Classifying emails as either spam or not spam
§ Multiclass Classification -
§ Iris
Classification
and
Regression
Regression
§ Goal is to predict -
§ continuous number or a floating-point number in
programming terms
§ Example -
§ Person’s annual income
§ Predicting the yield of a corn
Generalization,
Overfitting and
Underfitting
Generalization
Overfitting
Underfitting
Generalization
Generalization
§ If a model is able to make accurate predictions on
unseen data, we say it is able to generalize from
the training set to the test set.
§ Always build a model that is able to generalize as
accurately as possible
Example -
§ Boat Buyers Prediction -
§ Goal is to send out promotional emails to people who are
likely to actually make a purchase but not bother those
customers who are not interested
Generalization
Example – Boat Buyers Prediction
§ If the customer is older than 45, and has less than 3
children or is not divorced, then they want to buy a boat
Generalization
§ Rule 1: Complex Rule (Complex Model)
§ If the customer is older than 45, and has less than 3 children or is not
divorced, then they want to buy a boat
§ We can make up many rules that work well on this data
§ Our goal is to find whether new customers are likely to buy a boat
§ We therefore want to find a rule that will work well for new
customers, and achieving 100 percent accuracy on the training
set does not help
§ The only way or measure of whether an algorithm will perform well
on new data is the evaluation on the test set
§ Note:
§ Simple models are expected to generalise better to new data
§ Example:
§ “Customer older than 50 want to buy a boat” (Simple rule/Simple
Model)
§ is simple rule which did not involve children and divorce features
§ So it is more generalized or simple model
Overfitting
Overfitting
§ Building a model that is too complex for the amount
of information we have is called overfitting
§ Overfitting occurs when you fit a model too closely
to the particularities of the training set and obtain a
model that works well on the training set but is
not able to generalize to new data
§ Example -
§ Rule 1 - If the customer is older than 45, and has less
than 3 children or is not divorced, then they want to buy
a boat
Underfitting
Underfitting
§ Rule 3 -
§ Everybody who owns a house buys a boat
§ Might not be able to capture all the aspects of and
variability in the data, and your model will do
badly even on the training set
§ If the model is too simple then it will lead to
underfitting
Tradeoff
between
Overfitting and
Underfitting
§ The more complex we allow our model to be, the better
we will be able to predict on the training data
§ But when we start focusing too much on each individual
data point in our training set, and the model will not
generalize well to new data
§ Sweet Spot -
§ will yield the best generalization performance
§ This is the model we want to find
Relation of
Model
Complexity to
Dataset Size
Intro to Supevised Machine
Learning Algorithms
Classification
Regression
Relation of
Model
Complexity to
Dataset Size
Relation of Model Complexity to Dataset Size
§ Model complexity is tied to the variation of
inputs contained in your training dataset
§ The larger variety of data points your dataset
contains, the more complex a model you can
use without overfitting
§ Collecting more data points will yield more
variety
§ So larger datasets allow building more
complex models
Relation of
Model
Complexity to
Dataset Size
Example –
Boat Purchase
§ Added 10,000 more rows of customer data
§ Rule 1 -
§ If the customer is older than 45, and has less than
3 children or is not divorced, then they want to
buy a boat
§ This will be a good rule than when it was
developed using only the 12 rows
§ Note 1:
§ In the real world, we often have the ability to decide
how much data to collect?
§ Large collection of data might be more beneficial than
tweaking and tuning your model
§ Note 2:
§ Never understimate the power of more data
Supervised
Machine
Learning
Algorithms
Introduction to Supervised Machine
Learning Algorithms
§ Note:
§ Many of the machine learning algorithms have a classification and
regression variant
§ Data Sets -
§ Some datasets will be small and synthetic
§ Some datasets will be large (real-world examples)
§ Forge Dataset (Classification Exampe)
§ A synthetic two-class classification dataset is the forge dataset has two
features
§ Scatter plot
§ The plot has the first feature on the x-axis and the second feature on the y-
axis
§ Each data point is represented as one dot
§ The color and shape of the dot indicates its class
Supervised
Machine
Learning
Algorithms
Example –
§ Input
§ Output
Supervised
Machine
Learning
Algorithms
Supervised
Machine
Learning
Algorithms
§ Synthetic wave dataset (Regression
Example)
§ A single input feature and a continuous target
variable (or response)
§ Shows the single feature on x-axis and the
regression target (the output) on the y-axis
Supervised
Machine
Learning
Algorithms
Supervised
Machine
Learning
Algorithms
Note 1:
§ Any intution derived from datasets with
few features (called low-dimensional
datasets) might not hold in datasets with
many features (called high-dimensional
datasets)
Supervised
Machine
Learning
Algorithms
Breast Cancer Example
§ Scikit-learn includes two realworld datasets
§ Wisconsin breast cancer dataset
§ Records clinical measurements of breast cancer
tumors
§ Labeled as “benign” (for harmless tumors)
§ “Malignant” (for cancerous tumors)
§ Task is to learn to predict whether a tumor
is malignant based on the measurements of
the tissue
Supervised
Machine
Learning
Algorithms
§ Input :
Output :
Supervised
Machine
Learning
Algorithms
Note:
§ Datasets included in scikit-learn are
usually stored as Bunch objects
§ which contain some information about the
dataset as well as the actual data
§ Bunch Objects is that they behave like
dictionaries
Supervised
Machine
Learning
Algorithms
§ The dataset consists of 569 data points, with 30
features each:
§ Input :
§ Output :
Supervised
Machine
Learning
Algorithms
§ Of these 569 data points, 212 are labeled as
malignant and 357 as benign:
§ Input :
§ Output :
Supervised
Machine
Learning
Algorithms
§ To get a description of the semantic meaning of
each f eature, we can have a look at the
feature_names attribute:
§ Input :
§ Output :
Supervised
Machine
Learning
Algorithms
Regression Example
§ Boston Housing dataset
§ The task associated with this dataset is to
predict the median value of homes in
several Boston neighborhoods in the 1970s
with information such as
§ Crime rate
§ Proximity to the charles river
§ Highway accessibility
Supervised
Machine
Learning
Algorithms
§ The dataset contains 506 data points, described
by 13 features
§ Input -
§ Output –
Supervised
Machine
Learning
Algorithms
Load_extended_boston function
§ The dataset contains 506 data points, described by
104 features
§ 104 features are the 13 original features together with
the 91 possible ccombinations of two features within
those 13 (with replacement)
§ Input -
§ Output –
k-Nearest
Neighbors
k- Neighbors Classification
k-Neighbors Regression
k-Nearest
Neighbors
k-Nearest Neighbors
§ Simplest machine learning algorithm
§ Building the model consists only of storing the
training dataset
§ To make a prediction for a new data point, the
algorithm finds the closest data points in
the training dataset — its “nearest
neighbors.”
k-Nearest
Neighbors
Classification
k-Neighbors classification
§ In the simplest version, the k-NN algorithm only
considers exactly one nearest neighbor
§ i.e., Closest training data point to the point we want to
make a prediction for
§ Prediction is then simply the known output for this
training point
k-Nearest
Neighbors
Classification
§ Input –
§ Output -
k-Nearest
Neighbors
Classification
§ Added three new data points, shown as stars
§ Marked the closest point in the training set
§ The prediction of the one nearest-neighbor
algorithm is the label of that point (shown by the
color of the cross).
§ Instead of considering only the closest neighbor,
we can also consider an arbitrary number, k, of
neighbors
§ This is how the name of the k-nearest neighbors
algorithm comes from
k-Nearest
Neighbors
Classification
§ When considering more than one neighbor, we
use voting to assign a label
§ This means that for each test point, we count
how many neighbors belong to class 0 and
how many neighbors belong to class 1
§ Assign the class that is more frequent the
major ity class among the k-nearest
neighbors
k-Nearest
Neighbors
Classification
Three closest Neighbors
Input –
§ Output -
k-Nearest
Neighbors
Classification
§ Step 1 –
§ Step 2 –
§ Step 3 -
k-Nearest
Neighbors
Classification
Step 4 -
§ To make predictions on the test data, we call the
predict method
§ Input –
§ Output -
§ Step 5 -
§ How well our model generalizes, we can call the
score method
§ Input -
§ Output -
k-Nearest
Neighbors
Classification
Step 6 – (Analysis using visualization)
§ Visualization
§ Input -
§ Output -
k-Nearest
Neighbors
Classification
Example –
§ Breast Cancer (Real world Dataset)
k-Nearest
Neighbors
Classification
Example –
§ Breast Cancer (Real world Dataset)
k-Nearest
Neighbors
Regression
K-Neigh bors regr es si on ( Simpl e
Example)
§ wave dataset
§ Added three test data points as green
stars on the x-axis
k-Nearest
Neighbors
Regression
§ Input - (Single Neighbour)
§ Output -
k-Nearest
Neighbors
Regressor
§ Input - (Three Neighbours)
§ Output - (Prediction is the average or mean of
the relevant neighbours)
k-Nearest
Neighbors
Regressor
§ K Neighbors Regressor
§ Example -
k-Nearest
Neighbors
Regressor
§ Evaluation -
§ Evaluate the model using the score method
§ For regressors returns the R^2 score
§ The R^2 score, also known as the coefficient of
determination
§ is a measure of goodness of a prediction for a
regression model
§ Yields a score between 0 and 1
§ 1 corresponds to perfect prediction
§ 0 corresponds to a constant model (just predicts the
mean of the training set)
§ Input -
§ Output –
k-Nearest
Neighbors
Regressor
Analyzing KNeighborsRegressor
k-Nearest
Neighbors
Regressor
Analyzing KNeighborsRegressor
k-Nearest
Neighbors
Regressor
§ Using only a single neighbor, each point in the
training set has an obvious influence on the
predictions, and the predicted values go through
all of the data points
§ More neighbors leads to smoother predictions,
but these do not fit the training data as well
k-Nearest
Neighbors
Classifier
Strengths, weaknesses, and parameters
§ Two important parameters to the KNeighbors
classifier
§ Number of neighbors -
§ Using a small number of neighbours like three or five
often works well
§ you should certainly adjust this parameter
§ How you measure distance between data points
§ Euclidean Distance is used which works well in many
settings
k-Nearest
Neighbors
Strengths
§ Very easy to understand, implement
§ Often gives reasonable performance without a lot of
adjustments
§ Good baseline method to try
§ Few hyperparameters
Weaknesses
§ Model is usually very fast, but when your training set is
very large (either in number of features or in number of
samples) prediction can be slow
§ Mandatory to preprocess the data
§ Performs poorly with datasets consisting of many zeros
(Sparse Datasets)
§ Lazy learning algorithm
§ Prone to overfitting
§ Prone to curse of dimensionality
Linear Models
Linear Regression (aka
ordinary least squares)
Ridge Regression
Lasso Regression
Linear Models
§ Introduction
§ Class of models that are widely used in practice
§ Studied extensively in the last few decades
§ With roots going back over a hundred years
§ Linear models make a prediction using a linear
function of the input features
§ Building block for many complex machine learning
algorithms, including deep neural networks
§ It assumes that the data is linearly separable and
tries to learn the weight of each feature
Linear Models
§ Linear Models for Regression
§ x[0] to x[p] denotes the features of a single data
point
§ w and b are parameters of the model that are
learned
§ ŷ is the prediction the model makes
§ Single feature
§ where w[0] is the slope and b is the y-axis offset
§ Note:
§ Predicted response being a weighted sum of the
input features, with weights (which an be negative)
given by entries of w
Linear Models
One -dimensional wave dataset
§ Input -
§ Output -
Linear Models
§ Y-Intercept -
§ this is slightly below which you can also
confirm in the image
§ Linear models for regression can be
characterized as regression models for
which the prediction
§ is a line for a single feature
§ A plane when using two features
§ Hyperplane in higher dimensions
Linear Models
§ Note 1:
§ Using a straight line to make predictions is very restrictive
§ Note 2:
§ It is a strong assumption (somewhat unrealistic) that
our target y is a linear combination of the features
§ Note 3:
§ Linear models are very powerful with datasets having
many features
§ Note 4:
§ Many different models exist for regression
§ Difference between these models lies in
§ How the model parameters W and b are learned from the
training data?
§ How the model complexity can be controlled?
Linear
Regression
Linear regression (aka ordinary least
squares)
§ Linear regression -
§ also known as Ordinary Least Squares (OLS)
§ Simplest and most classic linear method for
regression
§ Linear regression finds the parameters w
and b that
§ Minimize the mean squared error between
predictions and the true regression targets, y,
on the training set
Linear
Regression
§ Mean Squared Error
§ The mean squared error is the sum of the
squared differences between the predictions
and the true values, divided by the number of
samples
§ Linear regression has no parameters
§ Which is a benefit
§ But it also has no way to control model
Linear
Regression
Example –
Linear
Regression
§ The “slope” parameters (w), also called weights or
coefficients
§ Stored in the coef_ attribute
§ Offset or intercept (b) is stored in the intercept_
attribute
Linear
Regression
Example -
§ Input -
§ Output -
§ The intercept_ attribute is always a single float
number, while the coef_ attribute is a NumPy array
with one entry per input feature
Linear
Regression
Training and Test Score (R2) -
§ Input -
§ Output -
Note -
§ R2 value of around 0.66 is not very good.
§ One-dimensional dataset there is a little danger of
underfitting
§ Higher-dimensional datasets, linear models
become more powerful and there is a chance of
overfitting
Linear
Regression
Boston Housing Dataset -
§ Consists of 506 samples and 105 derived
features
§ Input -
§ Output -
Linear
Regression
§ Note -
§ T h e d i s c r e p a n c y b e t w e e n
performance on the training set and
test set is a clear sign of overfitting
§ Solution -
§ Find a model that allows us to control
complexity
§ Some of the alternatives for linear
models are Ridge Regression, Lasso
Regression
Ridge
Regression
Ridge regression
§ Ridge regression is also a linear model for
regression
§ The formula it uses to make predictions is the
same one used for ordinary least squares
§ Coefficients (w) are chosen not only so that
they predict well on the training data, but also
to fit an additional constraint
§ All entries of w should be close to zero
§ This means each feature should have as little effect
on the outcome as possible (which translates to
having a small slope), while still predicting well
§ This constraint is an example of what is called as
Regularization
Ridge
Regression
Regularization -
§ It is a process of explicitly restricting a
model to avoid overfitting
§ The kind of regularization used in Ridge
Regression is L2 Regularization
Ridge
Regression
§ Input -
§ Output -
Ridge
Regression
§Note 1-
§ Training set score of Ridge is lower than for
LinearRegression
§ Note 2-
§ Tes t s et s core of Ridge is greater than f or
LinearRegression
§ Note 3-
§ Ridge is more restricted model, so it is less likely to
overfit
§ Note 4-
§ a l e s s c o m p l e x m o d e l m e a n s w o r s e
performance on the training set but better
generalization
§ Note 5-
§ We are only interested in Generalization
performance
Ridge
Regression
§Note 6-
§ Ridge model makes a trade-off between the
simpicity of the model (near-zero coefficients) and its
performance on the training set.
§ Note 7-
§ The importance the model places on simplicity versus
training set performance can be specified by the user
using alpha parameter
§ Default value of alpha parameter is 1.0
§ The optimum setting of alpha depends on the
particular dataset we are using
§ Increase in value of alpha forces the coefficients to
move more closer towards zero
§ Note 8-
§ Moving coefficients towards zero may decrease
t ra i n i n g s e t p e r f o r m a n c e b u t m i g h t h e l p
generalization
Ridge
Regression
§ Input -
§ Output -
Ridge
Regression
§ For very small values of alpha, coefficients are
barely restricted at all, and we end up with a model
that resembles LinearRegression
§ Input -
§ Output -
Ridge
Regression
§ Input -
§ Output -
Ridge
Regression
§ Regularization
§ Another way to understand the influence of
regularization is to fix a value of alpha but vary the
amount of training data available
§ Input –
§ Output -
Ridge
Regression
§ Note -
§ As more and more data becomes available to
the model, both models improve
§ With enough training data, regularization
becomes less important
§ Given enough data, ridge and linear regression
will have the same performance
Lasso
Regression
Lasso
§ An alternative to Ridge for regularizing
linear regression is Lasso
§ Lasso also restricts coefficients to be
close to zero called L1 regularization
§ W h e n u s i n g t h e l a s s o , s o m e
coefficients are exactly zero
Lasso
Regression
Advantages of Lasso
§ Form of automatic feature selection
§ Some coefficients be exactly zero often
makes a model easier to interpret, and can
reveal the most important features of your
model
Lasso
Regression
Disadvantages of Lasso
§ Some features are entirely ignored by the
model
Lasso
Regression
Extended Boston Housing dataset
§ Input -
§ Output -
Lasso
Regression
§ Lasso does quite badly, both on the training set
and test set
§ Indicates that we are underfitting
§ It used only 4 of the 105 features
§ Lasso also has a regularization parameter,
alpha, that controls how strongly coefficients are
pushed toward zero
§ When we decerease the value of alpha, the
maximum number of iterations to run need to be
increased (max_iter)
Lasso
Regression
§ Input -
§ Output -
Lasso
Regression
§ A lower alpha allowed us to fit a more complex
model
§ This makes this model potentially easier to
understand
§ If we set alpha too low, however, we again remove
the effect of regularization and end up overfitting
Lasso
Regression
§ Input -
§ Output -
Lasso
Regression
§ Input -
§ Output -
Lasso
Regression
§ Ridge regression is usually the first choice
between these two models
§ If you have a large amount of features and
expect only a few of them to be important,
lasso might be better choice
§ If we would like to have a model that is easy to
interpret, lasso will provide a model that is easier
to understand as it will select only a subset of
the input features
Linear models
for
classification
Linear Models for
classification
Linear Models for multiclass
classification
Linear models
for classification
Linear models for classification
§ Linear models are also extensively used for
classification
§ Binary Classification -
§ The formula looks very similar to the one for
linear regression
§ Instead of just returning the weighted sum of the
features, we threshold the predicted value at zero
§ Function is smaller than zero, we predict the class –1
§ If it is larger than zero, we predict the class +1
Linear models
for classification
Linear models for classification
§ For linear models for regression, the output, ŷ, is a
linear function of the features:
§ a line
§ plane
§ hyperplane (in higher dimensions)
§ For linear models for classification separates two
classes using a
§ line
§ plane
§ hyperplane
§ There are many algorithms for learning linear
models
§ The way in which they measure how well a particular
combination of coefficients and intercept fits the
training data
§ What kind of regularization they use?
Linear models
for classification
§ The two most common linear classification
algorithms are
§ L o g i s t i c r e g r e s s i o n i m p l e m e n t e d i n
linear_model.LogisticRegression
§ Linear support vector machines (linear SVMs),
implemented in svm.LinearSVC (SVC stands for
support vector classifier)
Linear models
for classification
Example – Despite its name logistic regression is a classification algorithm
§ Input -
§ Output -
Linear models
for classification
§ Note 1 -
§ Both the models are depicted with straight lines
separating the areas classified by class 0 and class 1
§ Note 2 -
§ Any new data point that lies above the black line
will be classified as class 1 and point below the
black line will be classified as class 0
§ Note 3 -
§ The two models Linear SVC and Logistic Regression
both come up with similar decision boundaries
§ Note 4 -
§ By default both models apply an L2 Regularization
Linear models
for classification
§ Trade-off parameter(“c”)
§ For LogisticRegression and LinearSVC the trade-off
parameter that determines the strength of the
regularization is called C
§ A h i g h v a l u e f o r t h e p a r a m e t e r C ,
LogisticRegression and LinearSVC try to fit the
training set as best as possible
§ higher value of C stresses the importance that each
individual data point be classified correctly
§ Low values of the parameter C, the models put
more emphasis on finding a coefficient vector (w)
that is close to zero
§ Using low values of C will cause the algorithms to try
to adjust to the “majority” of data points
Linear models for
classification
Example – Decision boundaries of Linear SVM
for different values of C
Input –
Output -
Linear models
for classification
§ Left Graph -
§ Very small C - corresponds to a lot of regularization
§ Most of the points in class 0 are at the bottom, and most of the points in class 1
are at the top
§ The strongly regularized model chooses a relatively horizontal line,
misclassifying two points
§ Center Graph -
§ Value of C is slightly higher
§ Model focuses more on the two misclassified samples, tilting the decision
boundary
§ Right Graph -
§ Very high value of C in the model tilts the decision boundary a lot
§ Now correctly classifying all points in class 0
§ One of the points in class 1 is still misclassified, as it is not possible to
correctly classify all points in this dataset using a straight line.
§ The model illustrated on the righthand side tries hard to correctly classify all
points, but might not capture the overall layout of the classes well.
§ In other words, this model is likely overfitting.
§ Similarly to the case of regression, linear models for classification might seem very
restrictive in low-dimensional spaces, only allowing for decision boundaries that are
straight lines or planes
Linear models
for classification
Example – Breast Cancer
§ Input –
§ Output -
§ The default value of C=1
§ Good training and test accuracy
§ Training and Test accuracy are very close - Likely to underfit
Linear models
for classification
Example –
§ Input –
§ Output -
Linear models
for classification
Example –
§ Input –
§ Output -
Underfit
Linear models
for classification
§ As LogisticRegression applies an L2 regularization by
default the result looks similar to that produced by RIDGE
§ Stronger regularization pushes coefficients more and
more toward zero, though coefficients never become
exactly zero
§ More interpretable model, using L1 regularization
might help,as it limits the model to using only a few
features
Linear models for
classification
§ Coefficients learned by the models with the three
different settings of parameter C
Linear models
for classification
Linear models for
classification
§ Input – (Lasso)
§ Output -
Linear models for
classification
Linear models
for Multiclass
Classification
Linear models for multiclass classification
§ Many linear classification models are for binary
classification only and dont extend naturally to the
multiclass case
§ But, Logistic Regression is an exception
§ Techinique used to extend a binary classification
algorithm to a multiclass classification algorithm is
the one-vs-rest approach
§ A binary model is learned for each class that tries to
separate that class from all of the other classes,
resulting in as many binary models as there are
classes
§ To make a prediction, all binary classifiers are run on
a test point
§ The classifier that has the highest score on its single
class “wins,” and this class label is returned as the
prediction
Linear models
for Multiclass
Classification
Linear models for multiclass classification
§ Having one binary classifier per class results in
having one vector of coefficients (w) and one
intercept (b) for each class
§ The class for which the result of the classification
confidence formula given here is highest
§ The mathematics behind multiclass logistic
regression differ from one-vs-rest approach
§ but they also result in one coefficient vector and
intercept
§ same method of making a prediction is applied
§ Classification confidence formula
Linear models
for Multiclass
Classification
Example – one vs rest
§ Input –
§ Output -
Linear models
for Multiclass
Classification
Example –
§ Input –
§ Output -
§ coef_ is (3, 2)
§ each row coefficient vector for one of the three classes
§ each column holds the coefficient value for a specific
feature
§ The intercept_ is a one-dimensional array
Linear models
for Multiclass
Classification
§ Input –
§ Output -
Linear models
for Multiclass
Classification
Input –
Output -
Strengths,
Weaknesses,
Parameters
§ First Decision - Regularization Parameters (alpha & c)
§ The main parameter of linear models is the regularization
parameter
§ alpha in the regression models
§ C in classification models (linear svc and logistic
regression)
§ Large values for alpha or small values for C mean simple
models
§ For regression models, tuning these parameters is quite
important
§ Second Decision - Regularization Techniques (L1 and L2)
§ Decision on what regularization is also important
§ L1 regularization
§ L2 regularization
§ When only a few of your features are actually important, you
should use L1
§ L1 can also be useful if interpretability of the model is important
§ As L1 will use only a few features, it is easier to explain
which features are important to the model
Strengths,
Weaknesses,
Parameters
§Strengths
§ Linear models are very fast to train and also very fast to
predict
§ They scale to very large datasets
§ Works well with sparse data
§ Linear models make it relatively easy to understand how a
prediction is made, using the formulas we saw earlier for
regression and classification
Strengths,
Weaknesses,
Parameters
§Weaknesses -
§ Entirely unclear why coefficients are important
in linear models
§ If the dataset has highly correlated features -
the coefficients might be hard to interpret
§ Linear models often perform well when the
number of features is large compared to the
number of samples
§ Often used on very large datasets
Naive Bayes
Classifiers
Introduction
Types
Naive Bayes
Classifiers
Advantages
Disadvantages
Naive Bayes
Classifiers
Naïve Bayes Classifiers
§ A family of classifiers that are quite similar to the
linear models
§ Advantages
§ They tend to be even faster in training
§ Disadvantages
§ Generalization performance that is slightly
worse than that of linear classifiers (i.e.,
LogisticRegression and LinearSVC)
Naive Bayes
Classifiers
Naïve Bayes Classifiers
§ It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object
§ mainly used in text classification that includes a high-
dimensional training dataset
§ The Naïve Bayes algorithm is comprised of two words
Naïve and Bayes
§ Naïve:
§ It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other
features
§ Naive Bayes assumes that each parameter, also called
features or predictors, has an independent capacity of
predicting the output variable
§ Example -
§ Apple is identified by Shape(Round), Color(Red),Taste (Sweet) -
each feature contributes to model
§ Bayes:
§ It is called Bayes because it depends on the principle of
Bayes' Theorem/Baye’s Rule/Baye’s Law
NaiveBayes
Classifiers
Naïve Bayes Theorem
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on
the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence
given that the probability of a hypothesis is true.
P(A) is Priori Probability: Probability of hypothesis before
observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Naive Bayes
Classifiers
Steps to solve Naïve Bayes
§ Convert the given dataset into frequency tables.
§ Generate Likelihood table by finding the
probabilities of given features.
§ Now, use Bayes theorem to calculate the posterior
probability
NaiveBayes
Classifiers
§ Initial Dataset
NaiveBayes
Classifiers
§ Frequency Table
NaiveBayes
Classifiers
§ Likelihood Table
NaiveBayes
Classifiers
§ Problem
§ If the weather is sunny, then the Player
should play or not?
§ Solution
Naive Bayes
Classifiers
§ Naive Bayes models are so efficient is that they
learn parameters by looking at each feature
individually and collect simple per class
statistics from each feature
Naive Bayes
Classifiers
§ Types of Naive Bayes Classifiers
§ GaussianNB
§ BernoulliNB
§ MultinomialNB
GaussianNB
§ GaussianNB
§ GaussianNB can be applied to any continuous data
§ GaussianNB stores the average value as well as the
standard deviation of each feature for each class
§ If predictors take continuous values instead of
discrete, then the model assumes that these values are
sampled from the Gaussian distribution
§ Gaussian Naive Bayes is a machine lear ning
classification technique based on a probablistic
approach that assumes each class follows a normal
distribution
§ The combination of the prediction for all parameters is
the final prediction that returns a probability of the
dependent variable to be classified in each group
§ The final classification is assigned to the group with
the higher probability
GaussianNB
§ The Gaussian model assumes that features follow a
normal distribution
§ Normal Distribution -
§ Describes the distributions of continuous random variables in
nature and is defined by its bell-shaped curve
§ A normal distribution has a probability distribution that is
centered around the mean
§ This means that the distribution has more data around the
mean
§ The data distribution decreases as you move away from the
center
§ The resulting curve is symmetrical about the mean and
forms a bell-shaped distribution
BernoulliNB
§ BernoulliNB
§ BernoulliNB assumes binary data
§ Used for discrete probability calculation
§ The predictor variables are the independent
Boolean variables
§ Mostly used in text data classification/Document
classification
§ Counts how often every feature of each class is
not zero
BernoulliNB
§ Example –
§ Four data points
§ Four binary features for each data point
§ Two classes 0 and 1
§ For class 0 (the first and third data points), the first
feature is zero two times and nonzero zero times, the
second feature is zero three time and nonzero one
time, and so on
MultinomialNB
§ MultinomialNB
§ MultinomialNB assumes count data
§ Example - that each feature represents an integer
count of something, like how often a word appears in
a sentence
§ Mostly used in text data classification as
BernouliNB
§ MultinomialNB takes into account the average
value of each feature for each class
Naive Bayes
Classifiers
§ Note 1 -
§ To make a prediction a data point is compared to
the statistics for each of the classes and the best
matching class is predicted.
§ Note 2 -
§ The prediction formula for MultinomialNB and
BernoulliNB is same as in linear models
§ Note 3 -
§ Coef_ for the naive bayes models has a different
meaning than in the linear models
Strengths,
Weaknesses,
and Parameters
§ Parameters -
§ MultinomialNB and BernoulliNB have a
single parameter alpha which controls
model complexity
§ Large alpha means more smoothing,
resulting in complex models
§ Note -
§ Algorithms performance is relatively
robust to the value of alpha
§ Setting alpha is NOT critical for
good performance but tuning it
u s u al ly im p r oves a c cu r a cy
somewhat
Strengths,
Weaknesses,
and Parameters
§ GaussianNB is mostly used on ver y high-
dimensional data
§ BernoulliNB and MultinomialNB are widely used
for sparse count data such as text
§ MultinomialNB usually performs better then
BernoulliNB particularly on datasets with a relatively
large number of nonzero features
Strengths,
Weaknesses,
and Parameters
§ Advantages -
§ Very fast to train and to predict
§ Easiest algorithm
§ Training procedure is easy to understand
§ Models work very well with high dimensional
sparse data and are relatively robust to the
parameters
§ Great baseline models
§ Often used on very large datasets
§ Works well for both Binary classifcation and
multiclass classification problems also
§ Best model for Text Classification Problems
Strengths,
Weaknesses,
and Parameters
§ Weakness-
§ Naive Bayes assumes that all features are independent
or unrelated
§ so it cannot learn the relationship between features
Decision
Trees
Building Decision trees
Controlling complexity of
Decision trees
Feature importance in trees
DecisionTrees
§ Widely used models for both classification and
regression
§ They learn a hierarchy of if/else questions,
leading to a decision
Example –
§ Distinguish between the following four animals
§ Bears
§ Hawks
§ Hen
§ Dolphins
DecisionTrees
Example -
§ Input –
§ Output -
DecisionTrees
§ Each node in the tree either
§ represents a question
§ Terminal node (also called a leaf) that contains the
answer
§ In ML we build a model to distinguish between
four classes of animals using the three features
“has feathers,”“can fly,” and “has fins.”
DecisionTrees
§ Building decision trees
§ Example - two_moons dataset
§ The dataset consists of two half- moon shapes, with each class
consisting of 75 data points
§ Learning a decision tree means learning the sequence of
if/else questions that gets us to the true answer most
quickly
§ In the machine learning, these if-else questions are called
tests
§ Question format in case of continuous data -
§ In real life data does not come in the form of binary yes/no
features as in the animal example
§ Data can be continuous in real life situations
§ The tests that are used on continuous data are of the form “Is
feature i larger than value a?”
DecisionTrees
§ Splitting the dataset horizontally at x[1]=0.0596 yields
the most information; it best separates the points in
class 0 from the points in class 1
§ The top node, also called the root, represents the
wholedataset, consisting of 50 points belonging to class
0 and 50 points belonging to class 1
DecisionTrees
DecisionTrees
§ The split is done by testing whether x[1] <= 0.0596
(test), indicated by a black line
§ If test is True -
§ Assigned to the left node, which contains 2 points
belonging to class 0 and 32 points belonging to class 1
§ If test is False -
§ Assigned to the right node, which contains 48 points
belonging to class 0 and 18 points belonging to class 1
§ Though the first split did a good job of separating the
two classes, the bottom region still contains points
belonging to class 0, and the top region still contains
points belonging to class 1
§ Figure 2-25 shows that the most informative next split
for the left and the right region is based on x[0]
DecisionTrees
§ This recursive process yields a binary tree of decisions, with
each node containing a test
§ Each test splits the part of the data that is currently being
considered along one axis
§ This yields a view of the algorithm as building a hierarchical
partition
§ Each test concerns only a single feature
§ which results in partitions into regions that are always parallel
to the axes
§ The recursive partitioning of the data is repeated until each region
in the partition (each leaf in the decision tree) only contains a
single target value (a single class or a single regression value)
§ Pure Leaves-
§ The leaf of the tree that contains data points that all share the same
target value is called PURE
DecisionTrees § The above fig is the final partition
§ A prediction on a new data point is made by checking
which region of the partition of the feature space the
point lies in, and then predicting the majority target in
that region
§ It is also possible to use trees for regression
tasks
§ Where the output for this data point is the mean
target of the training points in this leaf
DecisionTrees
Controlling complexity of decision trees
§ Drawback -
§ Building a tree untill all leaves are PURE leads to
models that are very complex and highly overfit
to training data
§ The overfitting can be seen on the left of Figure 2-26
§ We can see a small strip predicted as class 0 around
the point belonging to class 1
DecisionTrees
Controlling complexity of decision trees
§ Common strategies to prevent overfitting
§ pre-pruning -
§ Stopping the creation of the tree early (also called
pre-pruning)
§ Possible criteria for pre-pruning
§ Limiting the maximum depth of the tree
§ Limiting the maximum number of leave
§ Requiring a minimum number of data points in a node
to keep splitting it
§ post-pruning -
§ Building the tree but then removing or collapsing
nodes that contain little information
§ Also called as pruning
DecisionTrees
§ Decision trees in scikit-learn are implemented in
the
§ DecisionTreeRegressor
§ DecisionTreeClassifier
§ Scikit-learn only implements pre-pruning but
NOT post-pruning
DecisionTrees
Breast Cancer dataset
§ Import the dataset and split it into a training and a test part.
§ Then we build a model using the default setting of fully
developing the tree (growing the tree until all leaves are pure).
§ We fix the random_state in the tree, which is used for tie-
breaking internally
§ Input -
§ Output -
DecisionTrees
§ The accuracy on the training set is 100% —
because the leaves are pure
§ The tree was grown deep enough that it could
perfectly memorize all the labels on the
training data
§ The test set accuracy is slightly worse than for the
linear models
§ Limiting the depth of the tree decreases
overfitting
DecisionTrees
§ Limiting the depth of the tree decreases
overfitting
§ If we don’t restrict the depth of a decision tree, the
tree can become arbitrarily deep and complex
§ Unpruned trees are therefore prone to overfitting
and not generalizing well to new data
§ Prepruning to the tree -
§ will stop developing the tree before we perfectly fit
to the training data
§ How to stop building the tree after a certain depth
has reached
§ Set max_depth=4 - meaning only four consecutive
questions can be asked
§ This will lead to lower training accuracy and
improve test accuracy
DecisionTrees
§ Input -
§ Output -
Analyzing
DecisionTrees
§ Input -
§ Output -
Analyzing
DecisionTrees
§ The example provides a good description for the
decision tree machine learning algorithm which can
be easily explained to nonexperts
§ With a tree of depth four, as seen here, the tree can
become a bit overwhelming.
§ Deeper trees are even harder to grasp
§ One method of inspecting the tree that may be helpful
is to find out which path most of the data actually
takes
Feature
importance in
DecisionTrees
Feature importance in trees
§ Instead of looking at the whole tree, some useful
properties can be used to summarize the tree
§ The most commonly used summary is feature
importance
§ it rates how important each feature is for the
decision a tree makes
§ It is a number between 0 and 1 for each feature,
§ 0 means “not used at all”
§ 1 means “perfectly predicts the target.”
DecisionTrees
§ Input -
§ Output -
DecisionTrees
§ Input-
§ Output-
DecisionTrees
§ Worst radius is by far the most important feature
§ Note 1 -
§ If a feature has a low value in feature_importance_, it
doesn’t mean that this feature is uninformative
§ It only means that the feature was not picked by the tree,
likely because another feature encodes the same
information
§ Note 2 -
§ Feature importances are always positive
§ Note 3 -
§ The feature importances tell us that “worst radius” is
important, but not whether a high radius is
indicative of a sample being benign or
malignant
DecisionTrees
Regressor
§ Decision trees for regression, as implemented in
DecisionTreeRegressor
§ The usage and analysis of regression trees is very
similar to that of classification trees
§ The DecisionTreeRegressor is not able to
extrapolate -
§ make predictions outside of the range of the
training data
DecisionTrees
§ Input-
§ Output-
DecisionTrees
Regressor
§ Compare two simple models -
§ Decision Tree Regressor
§ Linear Regression
§ Rescale the prices using a logarithm
§ This doesn’t make a difference for the Decision
Tree Regressor, but it makes a big difference
for Linear Regression
§ After training the models and making predictions,
we apply the exponential map to undo the
logarithm transform
DecisionTrees
Regressor
DecisionTrees
Regressor
DecisionTrees
Regressor
§ The linear model approximates the data with a line and provides
quite a good forecast for the test data
DecisionTrees
Regressor
§ The tree model, on the other hand, makes perfect
predictions on the training data
§ We did not restrict the complexity of the tree, so it
learned the whole dataset by heart
§ Once we leave the data range for which the model
has data, the model simply keeps predicting the
last known point
§ The tree has no ability to generate “new”
responses, outside of what was seen in the training
data
§ This shortcoming applies to all models based
on trees
DecisionTrees
Regressor
(Strengths,
Weaknessesand
Parameters)
§ Parameters -
§ The parameters that control model complexity in
decision trees are the pre-pruning parameters
that stop the building of the tree before it is fully
developed
§ max_depth
§ max_leaf_nodes
§ min_samples_leaf
§ These parameters are sufficient to prevent
overfitting
DecisionTrees
Regressor
(Strengths,
Weaknessesand
Parameters)
§ Strengths -
§ The resulting model can easily be visualized and
understood by nonexperts (at least for smaller trees)
§ Algorithms are completely invariant to scaling of the data
§ Each feature is processed separately
§ Split of the data don’t depend on scaling
§ NO preprocessing like normalization or standardization of
features is needed for decision tree algorithms
§ Decision trees work well when you have features that are
§ on completely different scales
§ a mix of binary and continuous features
DecisionTrees
Regressor
(Strengths,
Weaknessesand
Parameters)
§ Weaknesses -
§ Without the use of pre-pruning, they tend
t o o v e r f i t a n d p r o v i d e p o o r
generalization performance
Ensembles of
Decision
Trees
Random forests
Gradient boosted regression
trees
Ensembles of
DecisionTrees
Ensembles of Decision Trees
§ What are ensembles?
§ Ensembles are methods that combine multiple machine
learning models to create more powerful models
§ Two ensemble models that have proven to be effective on
a wide range of datasets for classification and regression
§ Random forests
§ Gradient boosted decision trees
§ Both use decision trees as their building blocks
RandomForests
Random Forest
§ Main drawback of decision trees is that they tend to
overfit the training data
§ Random forests are one way to address this problem
§ What?
§ A random forest is essentially a collection of decision
trees, where each tree is slightly different from the
others
§ Idea behind Random Forests -
§ Each tree might do a relatively good job of predicting,
but will likely overfit on part of the data
§ If we build many trees, all of which work well and
overfit in different ways
§ We can reduce the amount of overfitting by averaging
their results
RandomForest
Random Forest
§ Need to build many decision trees
§ Each tree should do an acceptable job of predicting
the target, and should also be different from the
other trees
§ Why Random Forest ?
§ Random forests get their name from injecting
randomness into the tree building to ensure each
tree is different
§ Two ways of randomizing
§ By selecting the data points used to build a tree
§ By selecting the features in each split test
RandomForest
Randomness in RandomForest is decided by
§ Bootstrap sample
§ Selection of features (max_features)
RandomForest
Building Random forests
§ Step 1 -
§ You need to decide on the number of trees to build
(n_estimators parameter)
§ Note -
§ Trees will be built completely independently from
each other
§ Algorithm will make different random choices for
each tree to make sure the trees are distinct
RandomForest
Bootstrap sample
§ To build a tree first we need to take a bootstrap
sample
§ How?
§ From our n_samples data points, we repeatedly draw a
sample randomly with replacement n_samples times
§ Replacement meaning the same sample can be picked
multiple times
§ Example on Boot Strap Sample -
§ Creating a bootstrap sample of the list ['a', 'b', 'c', 'd'].
§ A possible bootstrap sample would be ['b', 'd', 'd', 'c'].
§ Another possible sample would be ['d', 'a', 'd', 'a’]
§ This will create a dataset that is as big as the original
dataset, but some data points will be missing from
it , and some will be repeated
RandomForest
§ Step 2 -
§ A decision tree is built based on this newly created
dataset
§ Instead of looking for the best test for each node, in each node
the algorithm randomly selects a subset of the features, and
it looks for the best possible test involving one of these
features
§ The number of features that are selected is controlled by the
max_features parameter.
§ This selection of a subset of features is repeated separately
in each node, so that each node in a tree can make a decision
using a different subset of the features
§ The bootstrap sampling leads to each decision tree in the
random forest being built on a slightly different dataset
§ Because of the selection of features in each node, each split in
each tree operates on a different subset of features
RandomForest
§ A critical parameter in this process is max_features
§ max_features = n_features means
§ that each split can look at all features in the dataset
§ NO randomness will be injected in the feature selection
§ max_features =1, means
§ that the splits have no choice at all on which feature to
test
§ max_features = HIGH means
§ that the trees in the random forest will be quite similar
§ they will be able to fit the data easily, using the most
distinctive features
§ max_features = LOW means
§ that the trees in the random forest will be quite
different
RandomForest
§Prediction
§ Random forest algorithm predicts by first making
a prediction for every treee in the forest
§ For regression -
§ Average - we can average these results of all the
decision trees to get our final prediction
§ For classification -
§ Soft voting -
§ Each Decision Tree makes a “soft” prediction,
providing a probability for each possible
output label
§ The probabilities predicted by all the trees are
averaged, and the class with the highest
probability is predicted
RandomForest
Analyzing random forests
§ Input –
§ The trees that are built as part of the random forest
are stored in the estimator_ attribute
RandomForest
§ Input –
§ Decision boundaries learned by the five trees are quite
different
§ some of the training points that are plotted here were not
actually included in the training sets of the trees, due to
the bootstrap sampling
§ Note -
§ The random forest overfits less than any of the trees
individually
RandomForest
§ In any real application, we would use many more
trees (often hundreds or thousands), leading to
even smoother boundaries
RandomForest
§ Random forest consisting of 100 trees
§ Input –
§ Output -
RandomForest
RandomForest
(Strengths,
Weaknesses,and
Parameters)
Strengths -
§ They are very powerful
§ Works well without heavy tuning of the
parameters
§ Don’t require scaling of the data
RandomForest
(Strengths,
Weaknesses,and
Parameters)
§ Why still Decision tree is used instead of
Random Forest?
§ Decision trees are compact representation of the
Random Forest in decision-making process
RandomForest
(Strengths,
Weaknesses,and
Parameters)
Weaknesses -
§ It is basically impossible to interpret tens
or hundreds of trees in detail
§ Random forests tend to be deeper than
decision trees (because of the use of feature
subsets)
§ Building random forests on large datasets
might be somewhat time consuming
RandomForest
§Multi-Core Processing -
§ To increase the speed of building random
forests on large datasets
§ Use the n_jobs parameter to adjust the number
of cores to use
§ Using more CPU cores will result in linear
speedups
§ n_jobs=-1 to use all the cores in your computer
RandomForest
(Strengths,
Weaknesses,and
Parameters)
§ Parameters -
§ The important parameters to adjust are
§ n_estimators
§ max_features
§ Possibly pre-pruning options like max_depth
§ Note 1 -
§ For n_estimators, larger is always better
§ Thumb rule is to build as many as you have
time/memory for
§ Note 2 -
§ max_features - determines how random each
tree is
§ Smaller max_features reduces overfitting
§ Thumb rule is
§ max_features = sqrt(n_features) for classification
§ max_features = n_features for regression
RandomForest
(Strengths,
Weaknesses,and
Parameters)
§ Note 1 -
§ The more trees there are in the forest, the more robust it will be against the
choice of random state
§ Note 2 -
§ Random forests don’t tend to perform well on very high dimensional,
sparse data, such as text data
§ Linear models are best choice for very high dimensional and sparse data
§ Note 3 -
§ Random forests usually work well even on very large datasets
§ Note 4 -
§ Training can easily be parallelized over many CPU cores within a
powerful computer
§ Note 5 -
§ Random Forests are slower to train
§ Note 6 -
§ Random forests require more memory
§ Note 7 -
§ If time and memory are crucial linear models are best choice than
Random Forests
Gradient
Boosting
Gradient boosted regression trees
§ Also called as gradient boosting machines
§ Another ensemble method -
§ combines multiple decision trees to create a more powerful model
§ Basic Idea -
§ Combine many simple models (weak learners)
§ Each weak learner (tree) can only provide good predictions on part of
the data
§ More and more trees are added iteratively to improve performance
§ Despite the “regression” in the name, these models can be used
for regression and classification
§ Gradient boosting works by building trees in a serial manner -
§ where each tree tries to correct the mistakes of the previous one
§ By default, there is no randomization in gradient boosted
regression trees
§ But, Strong pre-pruning is used
§ Gradient boosted trees often use very shallow trees, of depth
one to five
Gradient
Boosting
Advantages of Gradient Boosted Regression
trees
§ Sm all er i n ter m s o f me mo r y ( b e cau s e o f
shallowness)
§ Makes predictions faster
§ Gradient boosted trees are frequently winning
entries in machine learning competitions
§ Widely used in industry
§ Bit more sensitive to parameter settings than
random forests
§ Provide better accuracy if the parameters are set
correctly
Gradient
Boosting
Parameter of gradient boosting
§ Apart from Pre-pruning and Number of trees
(n_estimators)
§ Another important parameter of gradient boosting
is the learning_rate
§ Controls how strongly each tree tries to correct the
mistakes of the previous trees
§ Note 1 -
§ Higher learning_rate means each tree can make
stronger corrections, allowing for more complex
models
§ Note 2 -
§ Adding more trees to the ensemble, which can be
accomplished by increasing n_estimators
Gradient
Boosting
Gradient Boosting Classifier
§ Input –
§ Output -
Gradient
Boosting
§ Training accuracy of 100% - Overfitting
§ To reduce Overfit we can apply
§ Stronger pre-pruning (limiting the max depth)
§ Lower the learning rate
Gradient
Boosting
Pre-pruning
§ Input –
§ Output -
Gradient
Boosting
Learning_rate
§ Input –
§ Output -
Gradient
Boosting
§ Feature Importance
§ Input –
§ Output –
Gradient
Boosting
§ Feature Importance -
§ Gradient boosting and random forests perform
well on similar kinds of data
§ Note -
§ First try random forests, which work quite robustly but
if it is taking more prediction time moving to
GradientBoosting will help
§ Note -
§ If gradient boosting needs to be applied to a large-
scale problem, better for xgboost package
Strengths,
Weaknesses,
and Parameters
Strengths -
§ Most powerful and widely used models for
supervised learning
§ Algorithm works well without scaling and on a
mixture of binary and continuous features
Strengths,
Weaknesses,
and Parameters
§ Weaknesses -
§ They require careful tuning of the parameters
§ May take a long time to train
§ Does not work well on high-dimensional sparse
data
Strengths,
Weaknesses,
and Parameters
Parameters
§ max_depth
§ used to reduce the complexity of each tree
§ Usually max_depth is set very low
§ n_estimators
§ A higher n_estimators is always better
§ increasing n_estimators in gradient boosting leads to
a more complex model, which may lead to
overfitting
§ Fit n_estimators depending on the time and memory,
and then search over different learning_rates
§ learning_rate
§ Controls the degree to which each tree is allowed to
correct the mistakes of the previous trees
Kernelized
SupportVector
Machines
The Kernelized Support
Vector Machines
The Kernal trick
Understanding SVMs
Tuning SVM Parameters
Kernelized
SupportVector
Machines
Kernelized support vector machines
§ Kernelized support vector machines
§ Often just referred to as SVMs
§ Allows for more complex models that are not
defined simply by hyperplanes in the input space
§ Classification and regression
§ SVC – Classification
§ SVR - Regression
Kernelized
SupportVector
Machines
Kernelized
SupportVector
Machines
§ Terminology
§ Mar gin – Margin is the gap b etween the
hyperplane and the support vectors
§ H yper plan e – Hyp er plane s a re d ecision
boundaries that aid in classifying the data points
§ Support Vectors – Support Vectors are the data
points that are on or nearest to the hyperplane and
influence the position of the hyperplane
§ Kernel function – These are the functions used to
determine the shape of the hyperplane and
decision boundary
Kernelized
SupportVector
Machines
Linear models and nonlinear features
§ L i n e a r m o d e l s c a n b e q u i t e l i m i t i n g i n
lowdimensional spaces, as lines and hyperplanes
have limited flexibility
§ One way to make a linear model more flexible is by
adding more features
Kernelized
SupportVector
Machines
§ Input –
§ Output –
Kernelized
SupportVector
Machines
§ A linear model for classification can only separate
points using a line, and will not be able to do a
very good job on this dataset
§ Input -
Kernelized
SupportVector
Machines
Kernelized
SupportVector
Machines
§ Expand the set of input features
§ Feature2 = Feature1 ** 2 ---> (Non-linear
Feature)
§ Square of the second feature, as a new feature.
§ Instead of representing each data point as a two-
dimensional point t, (feature0, feature1)
§ We now represent it as a three-dimensional point,
(feature0, feature1, feature1 ** 2)
Kernelized
SupportVector
Machines
§ Example –
Kernelized
SupportVector
Machines
§ Input –
Kernelized
SupportVector
Machines
§ Output –
Kernelized
SupportVector
Machines
§ Input –
§ Output –
Kernelized
SupportVector
Machines
The kernel trick
§ Adding nonlinear features to the representation of our data
can make linear models much more powerful
§ Drawbacks
§ Which features to add?
§ Adding many features might make computation very
expensive
§ Kernel Trick -
§ It is a clever mathematical trick -
§ Allows us to learn a classifier in a higher-dimensional space without
actually computing the new representation
§ Works by directly computing the distance of the data points for the
expanded feature representation, without ever actually computing the
expansion
Kernelized
SupportVector
Machines
§ Two ways to map your data into a higher-
dimensional space in SVM’s (Types of Kernel)
§ Polynomial Kernel
§ Radial Basis Function (RBF) (or) Gaussian Kernal
Kernelized
SupportVector
Machines
§ Polynomial kernel
§ Computes all possible polynomials up to a certain
degree of the original features (like feature1 ** 2 *
feature2 ** 5)
§ Radial Basis Function (RBF)
§ Also known as Gaussian Kernel
§ A bit harder to explain -
§ as it corresponds to an infinite dimensional feature space
§ It considers all possible polynomials of all degrees
§ But the importance of the features decreases for
higher degrees
Kernelized
SupportVector
Machines
Understanding SVMs
§ During training, the SVM learns how important
each of the training data points is to represent the
decision boundary between the two classes
§ Typically only a subset of the training points
matter for defining the decision boundary
§ Ones that lie on the border between the classes
§ These are called support vectors
Kernelized
SupportVector
Machines
§ To make a prediction for a new point
§ The distance to each of the support vectors is
measured
§ A classification decision is made based on the
distances to the support vector and importance
of the support vectors which is learned during
training
§ importance of support vectors is stored in an attribute
called dual_coef_ attribute of svc
Kernelized
SupportVector
Machines
§ The distance between data points is measured by
the Gaussian kernel
§
§ Here, x1 and x2 are data points
§ ǁ x1 - x2 ǁ denotes Euclidean distance
§ ɣ (gamma) is a parameter that controls the width of
the Gaussian kernel
Kernelized
SupportVector
Machines
§ Example –
§ Input -
Kernelized
SupportVector
Machines
§ SVM yeilds very smooth and nonlinear boundary
§ Output –
Kernelized
SupportVector
Machines
Tuning SVM Parameters
§ Gamma parameter
§ Kernel coefficient
§ only used in case of rbf, poly and sigmoid kernels
§ Corresponds to the inverse of the width of the
Gaussian kernel (RBF)
§ Gamma parameter determines how far the influence of
a single training example reaches, with low values
meaning corresponding to a far reach and high
values to a limited reach
§ The wider the radius of the Gaussian kernel, the
further the influence of each training example
§ C parameter
§ Regularization parameter
§ It limits the importance of each point
Kernelized
SupportVector
Machines
§ Input –
§ Output –
Kernelized
SupportVector
Machines
Explanation -
§ Left to Right (Gamma Parameter)
§ Increase the value of the parameter gamma from
0.1 to 10
§ A small gamma means a large radius for the
Gaussian kernel -
§ which means that many points are considered close by
§ Smooth boundaries on the left
§ Boundaries that focus more on single points
towards the right
§ GammaValue -
§ Low value - decision boundary will vary slowly
§ High value - yields a more complex model
Kernelized
SupportVector
Machines
Explanation -
§ Top to bottom (C Parameter)
§ Increase the C parameter from 0.1 to 1000
§ C values -
§ LowValue -
§ Restricted Model
§ Decision boundary is nearly linear
§ Each Data Point will have limited influence
§ HighValue -
§ Decision boundary bend to classify the data points (Non-
linear)
§ Each Data point had stronger influence on the model
Kernelized
SupportVector
Machines
§ Example - (Breast Cancer Dataset)
§ Input -
Kernelized
SupportVector
Machines
§ SVMs often perform quite well
§ Very sensitive
§ to the settings of the parameters
§ to the scaling of the data
§ Require all the features to vary on a similar
scale
Kernelized
SupportVector
Machines
Example -
§ Breast Cancer dataset are of completely different
orders of magnitude
§ Input –
§ Output -
Kernelized
SupportVector
Machines
Problem with SVM-
§ Breast Cancer dataset are of completely different orders
of magnitude
§ It will result in devastating effects for the kernel SVM
§ Solutions -
§ Preprocessing data for SVMs
§ Rescaling each feature so that they are all approximately on
the same scale
§ A common rescaling method for kernel SVMs is to scale the
data such that all features are between 0 and 1
Kernelized
SupportVector
Machines
§ MinMaxScaler preprocessing method
§ Input - (Training Dataset)
§ Output -
Kernelized
SupportVector
Machines
§ Input - (Test Data Set)
§ Input -
§ Output -
§ Scaling the Data made a huge difference -
§ Lead to underfitting -
§ where training and test set performance are quite
similar
Kernelized
SupportVector
Machines
§ We can try increasing either C or gamma to fit a
more complex model
§ Input -
§ Output -
§ Increasing C allows us to improve the model
significantly, resulting in 97.2% accuracy
Strengths,
Weaknesses,
and Parameters
Strengths -
§ Kernelized support vector machines are powerful
models
§ Perform well on a variety of datasets
§ Allow for complex decision boundaries, even if
the data has only a few features
§ Work well on low-dimensional and high-
dimensional data (i.e., few and many features)
Strengths,
Weaknesses,
and Parameters
Weaknesses -
§ Don’t scale ver y well with the number of
samples
§ Running an SVM on data with up to 10,000 samples
might work well, but working with datasets of size
100,000 or more can become challenging in terms of
runtime and memory usage
§ Require careful preprocessing of the data and
tuning of the parameters
§ SVM models are hard to inspect -
§ It can be difficult to understand why a particular
prediction was made
§ It is tricky to explain the model to a nonexpert
Strengths,
Weaknesses,
and Parameters
§ Note -
§ Try SVMs particularly if all of your features
represent measurements in similar units and they
are on similar scales
Strengths,
Weaknesses,
and Parameters
Parameters
§ Regularization parameter C
§ Choice of the kernel (Polynomial kernel or RBF
Kernel)
§ Kernel-specific parameters (Gamma and C)
§ both control the complexity of the model, with large
values in either resulting in a more complex model
Uncertainty
Estimates
from
Classifiers
The Decision Function
Predicting Probabilities
Uncertainty
Estimates from
Classifiers
Uncertainty Estimates from Classifiers
§ In scikit-learn - classifiers provide uncertainty estimates
of predictions
§ We are not only interested in which class a
classifier predicts for a certain test point, but also
how certain it is that this is the right class
§ Different kinds of mistakes lead to very different
outcomes in real-world applications
§ Testing for cancer
§ False positive prediction might lead to a patient
undergoing additional tests
§ False negative prediction might lead to a serious
disease not being treated
Uncertainty
Estimates from
Classifiers
§ Two different functions used to obtain uncertainty
estimates from classifiers:
§ decision_function
§ predict_proba
§ Most classifiers have at least one of them
§ Many classifiers have both
Uncertainty
Estimates from
Classifiers
§ Gradient Boosting Classifier classifier, which
h a s b o t h a d e c i s i o n _ f u n c t i o n a n d a
predict_proba method:
Uncertainty
Estimates from
Classifiers
The Decision Function in Gradient Boosting
§ In Binary classification
§ Return value of decision_function is of shape
(n_samples,), and it returns one floating-point
number for each sample:
§ Input -
§ Output -
Uncertainty
Estimates from
Classifiers
§ This value encodes how strongly the model believes a
data point to belong to the “positive” class, in this case
class 1
§ Input -
§ Output -
§ Positive value indicate a preference for the positive
class (Class 1)
§ Negative Value indicate a preference for the Negative
Class (Class 0)
Uncertainty
Estimates from
Classifiers
§ Input –
§ Output -
Uncertainty
Estimates from
Classifiers
§ Input – (Range of decision_functin can be arbitrary and
depends on the data and the model parameters)
§ Output -
§ Note 1 -
§ Arbitrary scaling makes the output of decision_function
often hard to interpret
Uncertainty
Estimates from
Classifiers
§ Input –
§ Output -
Uncertainty
Estimates from
Classifiers
Predicting Probabilities
§ The output of predict_proba is a probability for
each class
§ Often more easily understood than the output of
decision_function
§ It is always of shape (n_samples, 2) for binary
classification:
§ Input -
§ Output -
Uncertainty
Estimates from
Classifiers
§ The first entry in each row is the estimated
probability of the first class, and the second entry
is the estimated probability of the second class
§ Input -
§ Output -
Uncertainty
Estimates from
Classifiers
§ Because the probabilities for the two classes sum to
1, exactly one of the classes will be above 50%
certainty
§ That class is the one that is predicted
§ From the above example the classifier is relatively
certain for most points
§ How well the uncertainty actually reflects
uncertainty in the data depends on the model and
the parameters
§ Note 1 -
§ A model that is more overfitted tends to make more
certain predictions, even if they might be wrong
§ A model with less complexity usually has more
uncertainty in its predictions
Uncertainty
Estimates from
Classifiers
Calibrated model
§ A model is called calibrated if the reported
uncertainty actually matches how correct it is — in
a calibrated model, a prediction made with 70%
certainty would be correct 70% of the time
Uncertainty
Estimates from
Classifiers
§ Input –
§ Output -
Uncertainty
Estimates from
Classifiers
Uncertainty in Multiclass Classification
§ decision_function and predict_proba methods also work in the
multiclass setting
§ In multiclass case, the shape of the decision_function is (n_samples,
n_classes)
§ each column provides a “certainty score” for each class, where
§ large score means that a class is more likely
§ small score means the class is less likely
§ Example –
§ Iris dataset
§ Input -
Uncertainty
Estimates from
Classifiers
§ Example -
§ Input –
§ Output -
Uncertainty
Estimates from
Classifiers
§ Example -
§ Input –
§ Output -
Uncertainty
Estimates from
Classifiers
§ Example - (predict_proba)
§ Has shape as (n_samples, n_classes)
§ Maximum probability value is the prediction value
§ The probabilities of the possible classes for each
datapoint sum to 1
§ Input –
§ Output -
Uncertainty
Estimates from
Classifiers
§ Example - (predict_proba)
§ Input –
§ Output -
Uncertainty
Estimates from
Classifiers
§ Predict_proba and decision_function always have
shape (n_samples, n_classes)
§ Binary case, decision_function only has one
column, corresponding to the “positive” class
classes_[1]
Summary and
Outlook
Nearest neighbors
§ For small datasets
§ Good as a baseline
§ Easy to explain
Summary and
Outlook
Linear models
§ Go-to as a first algorithm to try
§ Good for very large datasets
§ Good for very high-dimensional data
Summary and
Outlook
Naive Bayes
§ Only for classification
§ Even faster than linear models
§ Good for very large datasets and high-dimensional
data
§ Often less accurate than linear models
Summary and
Outlook
Decision trees
§ Very fast
§ Don’t need scaling of the data
§ Can be visualized
§ Easily explained
Summary and
Outlook
Random forests
§ Nearly always perform better than a single decision
tree, very robust and powerful
§ Don’t need scaling of data
§ Not good for very high dimensional sparse data
Summary and
Outlook
Gradient boosted decision trees
§ Often slightly more accurate than random forests
§ Slower to train but faster to predict than random
forests
§ Smaller in memory
§ Need more parameter tuning than random forests
Summary and
Outlook
Support vector machines
§ Powerful for medium-sized datasets of features with
similar meaning
§ Require scaling of data
§ Sensitive to parameters
Summary and
Outlook
Neural networks
§ Can build very complex models, particularly for
large datasets
§ Sensitive to scaling of the data and to the choice of
parameters
§ Large models need a long time to train
Thank you
Ad

More Related Content

Similar to Machine Learning - Implementation with Python - 2 (20)

Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
Roger Barga
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
5_Model for Predictions_Machine_Learning.ppt
5_Model for Predictions_Machine_Learning.ppt5_Model for Predictions_Machine_Learning.ppt
5_Model for Predictions_Machine_Learning.ppt
VGaneshKarthikeyan
 
When Models Meet Data: From ancient science to todays Artificial Intelligence...
When Models Meet Data: From ancient science to todays Artificial Intelligence...When Models Meet Data: From ancient science to todays Artificial Intelligence...
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
Introduction to data visualization tools like Tableau and Power BI and Excel
Introduction to data visualization tools like Tableau and Power BI  and ExcelIntroduction to data visualization tools like Tableau and Power BI  and Excel
Introduction to data visualization tools like Tableau and Power BI and Excel
Lipika Sharma
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
Aun Akbar
 
lec1.ppt
lec1.pptlec1.ppt
lec1.ppt
SVasuKrishna1
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
Matthew Evans
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
Si Krishan
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
IRJET Journal
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
AI Algorithms
AI AlgorithmsAI Algorithms
AI Algorithms
Dr. C.V. Suresh Babu
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questions
IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
IEEEGLOBALSOFTTECHNOLOGIES
 
Mini datathon
Mini datathonMini datathon
Mini datathon
Kunal Jain
 
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Daniel Katz
 
Barga Data Science lecture 5
Barga Data Science lecture 5Barga Data Science lecture 5
Barga Data Science lecture 5
Roger Barga
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
Roger Barga
 
Chapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdfChapter 4 Classification in data sience .pdf
Chapter 4 Classification in data sience .pdf
AschalewAyele2
 
5_Model for Predictions_Machine_Learning.ppt
5_Model for Predictions_Machine_Learning.ppt5_Model for Predictions_Machine_Learning.ppt
5_Model for Predictions_Machine_Learning.ppt
VGaneshKarthikeyan
 
When Models Meet Data: From ancient science to todays Artificial Intelligence...
When Models Meet Data: From ancient science to todays Artificial Intelligence...When Models Meet Data: From ancient science to todays Artificial Intelligence...
When Models Meet Data: From ancient science to todays Artificial Intelligence...
ssuserbbbef4
 
Modelling and evaluation
Modelling and evaluationModelling and evaluation
Modelling and evaluation
eShikshak
 
Introduction to data visualization tools like Tableau and Power BI and Excel
Introduction to data visualization tools like Tableau and Power BI  and ExcelIntroduction to data visualization tools like Tableau and Power BI  and Excel
Introduction to data visualization tools like Tableau and Power BI and Excel
Lipika Sharma
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
Pricing like a data scientist
Pricing like a data scientistPricing like a data scientist
Pricing like a data scientist
Matthew Evans
 
Identifying and classifying unknown Network Disruption
Identifying and classifying unknown Network DisruptionIdentifying and classifying unknown Network Disruption
Identifying and classifying unknown Network Disruption
jagan477830
 
Machine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An IntroMachine Learning 2 deep Learning: An Intro
Machine Learning 2 deep Learning: An Intro
Si Krishan
 
Post Graduate Admission Prediction System
Post Graduate Admission Prediction SystemPost Graduate Admission Prediction System
Post Graduate Admission Prediction System
IRJET Journal
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Comparable entity mining from comparative questions
Comparable entity mining from comparative questionsComparable entity mining from comparative questions
Comparable entity mining from comparative questions
IEEEFINALYEARPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
JAVA 2013 IEEE DATAMINING PROJECT Comparable entity mining from comparative q...
IEEEGLOBALSOFTTECHNOLOGIES
 
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Legal Analytics Course - Class 6 - Overfitting, Underfitting, & Cross-Validat...
Daniel Katz
 

More from University College of Engineering Kakinada, JNTUK - Kakinada, India (6)

Chandu cyber security career path
Chandu cyber security career pathChandu cyber security career path
Chandu cyber security career path
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 1
Object Oriented Programming using C++ - Part 1Object Oriented Programming using C++ - Part 1
Object Oriented Programming using C++ - Part 1
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 2
Object Oriented Programming using C++ - Part 2Object Oriented Programming using C++ - Part 2
Object Oriented Programming using C++ - Part 2
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 5
Object Oriented Programming using C++ - Part 5Object Oriented Programming using C++ - Part 5
Object Oriented Programming using C++ - Part 5
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 4
Object Oriented Programming using C++ - Part 4Object Oriented Programming using C++ - Part 4
Object Oriented Programming using C++ - Part 4
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Object Oriented Programming using C++ - Part 3
Object Oriented Programming using C++ - Part 3Object Oriented Programming using C++ - Part 3
Object Oriented Programming using C++ - Part 3
University College of Engineering Kakinada, JNTUK - Kakinada, India
 
Ad

Recently uploaded (20)

AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
Transforming health care with ai powered
Transforming health care with ai poweredTransforming health care with ai powered
Transforming health care with ai powered
gowthamarvj
 
Process Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital TransformationsProcess Mining as Enabler for Digital Transformations
Process Mining as Enabler for Digital Transformations
Process mining Evangelist
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
CS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docxCS-404 COA COURSE FILE JAN JUN 2025.docx
CS-404 COA COURSE FILE JAN JUN 2025.docx
nidarizvitit
 
Time series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdfTime series for yotube_1_data anlysis.pdf
Time series for yotube_1_data anlysis.pdf
asmaamahmoudsaeed
 
Lesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdfLesson 6-Interviewing in SHRM_updated.pdf
Lesson 6-Interviewing in SHRM_updated.pdf
hemelali11
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Lagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdfLagos School of Programming Final Project Updated.pdf
Lagos School of Programming Final Project Updated.pdf
benuju2016
 
How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?How to Set Up Process Mining in a Decentralized Organization?
How to Set Up Process Mining in a Decentralized Organization?
Process mining Evangelist
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Sets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledgeSets theories and applications that can used to imporve knowledge
Sets theories and applications that can used to imporve knowledge
saumyasl2020
 
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdfTOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
TOAE201-Slides-Chapter 4. Sample theoretical basis (1).pdf
NhiV747372
 
Feature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record SystemsFeature Engineering for Electronic Health Record Systems
Feature Engineering for Electronic Health Record Systems
Process mining Evangelist
 
L1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptxL1_Slides_Foundational Concepts_508.pptx
L1_Slides_Foundational Concepts_508.pptx
38NoopurPatel
 
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
2-Raction quotient_١٠٠١٤٦.ppt of physical chemisstry
bastakwyry
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
Ad

Machine Learning - Implementation with Python - 2

  • 1. Machine Learning Source: Introduction to Machine Learning with Python Authors: Andreas C. Müller and Sarah Guido
  • 3. Agenda Classification and Regression Generalization, Overfitting and Underfitting Relation of Model Complexity to Dataset Size K- Nearest Neighbors
  • 4. Agenda Linear Models Linear Models for Classification Naïve Bayes Classifiers Decision Trees
  • 5. Agenda Ensembles of Decision Tress Kernalized Support Vector Machines Uncertainity Estimates from Classifiers
  • 7. Supervised Learning § Supervised learning is used whenever we want to predict a certain outcome from a given input § Goal is to make accurate predictions for new, never-before-seen data § Supervised learning often requires human effort to build the training set, but afterward automates and often speeds up an otherwise laborious or infeasible task
  • 8. Classification and Regression § Two major types of supervised machine learning problems – § Classification § Regression
  • 9. Classification and Regression § Classification § Goal is to predict a class label, which is a choice from a predefined list of possibilities § Classification is sometimes separated into § Binary classification - § Distingution between two classes § Multiclass classification - § Which is classification between more than two classes § Example - § Binary Classification - § Classifying emails as either spam or not spam § Multiclass Classification - § Iris
  • 10. Classification and Regression Regression § Goal is to predict - § continuous number or a floating-point number in programming terms § Example - § Person’s annual income § Predicting the yield of a corn
  • 12. Generalization Generalization § If a model is able to make accurate predictions on unseen data, we say it is able to generalize from the training set to the test set. § Always build a model that is able to generalize as accurately as possible Example - § Boat Buyers Prediction - § Goal is to send out promotional emails to people who are likely to actually make a purchase but not bother those customers who are not interested
  • 13. Generalization Example – Boat Buyers Prediction § If the customer is older than 45, and has less than 3 children or is not divorced, then they want to buy a boat
  • 14. Generalization § Rule 1: Complex Rule (Complex Model) § If the customer is older than 45, and has less than 3 children or is not divorced, then they want to buy a boat § We can make up many rules that work well on this data § Our goal is to find whether new customers are likely to buy a boat § We therefore want to find a rule that will work well for new customers, and achieving 100 percent accuracy on the training set does not help § The only way or measure of whether an algorithm will perform well on new data is the evaluation on the test set § Note: § Simple models are expected to generalise better to new data § Example: § “Customer older than 50 want to buy a boat” (Simple rule/Simple Model) § is simple rule which did not involve children and divorce features § So it is more generalized or simple model
  • 15. Overfitting Overfitting § Building a model that is too complex for the amount of information we have is called overfitting § Overfitting occurs when you fit a model too closely to the particularities of the training set and obtain a model that works well on the training set but is not able to generalize to new data § Example - § Rule 1 - If the customer is older than 45, and has less than 3 children or is not divorced, then they want to buy a boat
  • 16. Underfitting Underfitting § Rule 3 - § Everybody who owns a house buys a boat § Might not be able to capture all the aspects of and variability in the data, and your model will do badly even on the training set § If the model is too simple then it will lead to underfitting
  • 17. Tradeoff between Overfitting and Underfitting § The more complex we allow our model to be, the better we will be able to predict on the training data § But when we start focusing too much on each individual data point in our training set, and the model will not generalize well to new data § Sweet Spot - § will yield the best generalization performance § This is the model we want to find
  • 18. Relation of Model Complexity to Dataset Size Intro to Supevised Machine Learning Algorithms Classification Regression
  • 19. Relation of Model Complexity to Dataset Size Relation of Model Complexity to Dataset Size § Model complexity is tied to the variation of inputs contained in your training dataset § The larger variety of data points your dataset contains, the more complex a model you can use without overfitting § Collecting more data points will yield more variety § So larger datasets allow building more complex models
  • 20. Relation of Model Complexity to Dataset Size Example – Boat Purchase § Added 10,000 more rows of customer data § Rule 1 - § If the customer is older than 45, and has less than 3 children or is not divorced, then they want to buy a boat § This will be a good rule than when it was developed using only the 12 rows § Note 1: § In the real world, we often have the ability to decide how much data to collect? § Large collection of data might be more beneficial than tweaking and tuning your model § Note 2: § Never understimate the power of more data
  • 21. Supervised Machine Learning Algorithms Introduction to Supervised Machine Learning Algorithms § Note: § Many of the machine learning algorithms have a classification and regression variant § Data Sets - § Some datasets will be small and synthetic § Some datasets will be large (real-world examples) § Forge Dataset (Classification Exampe) § A synthetic two-class classification dataset is the forge dataset has two features § Scatter plot § The plot has the first feature on the x-axis and the second feature on the y- axis § Each data point is represented as one dot § The color and shape of the dot indicates its class
  • 24. Supervised Machine Learning Algorithms § Synthetic wave dataset (Regression Example) § A single input feature and a continuous target variable (or response) § Shows the single feature on x-axis and the regression target (the output) on the y-axis
  • 26. Supervised Machine Learning Algorithms Note 1: § Any intution derived from datasets with few features (called low-dimensional datasets) might not hold in datasets with many features (called high-dimensional datasets)
  • 27. Supervised Machine Learning Algorithms Breast Cancer Example § Scikit-learn includes two realworld datasets § Wisconsin breast cancer dataset § Records clinical measurements of breast cancer tumors § Labeled as “benign” (for harmless tumors) § “Malignant” (for cancerous tumors) § Task is to learn to predict whether a tumor is malignant based on the measurements of the tissue
  • 29. Supervised Machine Learning Algorithms Note: § Datasets included in scikit-learn are usually stored as Bunch objects § which contain some information about the dataset as well as the actual data § Bunch Objects is that they behave like dictionaries
  • 30. Supervised Machine Learning Algorithms § The dataset consists of 569 data points, with 30 features each: § Input : § Output :
  • 31. Supervised Machine Learning Algorithms § Of these 569 data points, 212 are labeled as malignant and 357 as benign: § Input : § Output :
  • 32. Supervised Machine Learning Algorithms § To get a description of the semantic meaning of each f eature, we can have a look at the feature_names attribute: § Input : § Output :
  • 33. Supervised Machine Learning Algorithms Regression Example § Boston Housing dataset § The task associated with this dataset is to predict the median value of homes in several Boston neighborhoods in the 1970s with information such as § Crime rate § Proximity to the charles river § Highway accessibility
  • 34. Supervised Machine Learning Algorithms § The dataset contains 506 data points, described by 13 features § Input - § Output –
  • 35. Supervised Machine Learning Algorithms Load_extended_boston function § The dataset contains 506 data points, described by 104 features § 104 features are the 13 original features together with the 91 possible ccombinations of two features within those 13 (with replacement) § Input - § Output –
  • 37. k-Nearest Neighbors k-Nearest Neighbors § Simplest machine learning algorithm § Building the model consists only of storing the training dataset § To make a prediction for a new data point, the algorithm finds the closest data points in the training dataset — its “nearest neighbors.”
  • 38. k-Nearest Neighbors Classification k-Neighbors classification § In the simplest version, the k-NN algorithm only considers exactly one nearest neighbor § i.e., Closest training data point to the point we want to make a prediction for § Prediction is then simply the known output for this training point
  • 40. k-Nearest Neighbors Classification § Added three new data points, shown as stars § Marked the closest point in the training set § The prediction of the one nearest-neighbor algorithm is the label of that point (shown by the color of the cross). § Instead of considering only the closest neighbor, we can also consider an arbitrary number, k, of neighbors § This is how the name of the k-nearest neighbors algorithm comes from
  • 41. k-Nearest Neighbors Classification § When considering more than one neighbor, we use voting to assign a label § This means that for each test point, we count how many neighbors belong to class 0 and how many neighbors belong to class 1 § Assign the class that is more frequent the major ity class among the k-nearest neighbors
  • 43. k-Nearest Neighbors Classification § Step 1 – § Step 2 – § Step 3 -
  • 44. k-Nearest Neighbors Classification Step 4 - § To make predictions on the test data, we call the predict method § Input – § Output - § Step 5 - § How well our model generalizes, we can call the score method § Input - § Output -
  • 45. k-Nearest Neighbors Classification Step 6 – (Analysis using visualization) § Visualization § Input - § Output -
  • 48. k-Nearest Neighbors Regression K-Neigh bors regr es si on ( Simpl e Example) § wave dataset § Added three test data points as green stars on the x-axis
  • 49. k-Nearest Neighbors Regression § Input - (Single Neighbour) § Output -
  • 50. k-Nearest Neighbors Regressor § Input - (Three Neighbours) § Output - (Prediction is the average or mean of the relevant neighbours)
  • 52. k-Nearest Neighbors Regressor § Evaluation - § Evaluate the model using the score method § For regressors returns the R^2 score § The R^2 score, also known as the coefficient of determination § is a measure of goodness of a prediction for a regression model § Yields a score between 0 and 1 § 1 corresponds to perfect prediction § 0 corresponds to a constant model (just predicts the mean of the training set) § Input - § Output –
  • 55. k-Nearest Neighbors Regressor § Using only a single neighbor, each point in the training set has an obvious influence on the predictions, and the predicted values go through all of the data points § More neighbors leads to smoother predictions, but these do not fit the training data as well
  • 56. k-Nearest Neighbors Classifier Strengths, weaknesses, and parameters § Two important parameters to the KNeighbors classifier § Number of neighbors - § Using a small number of neighbours like three or five often works well § you should certainly adjust this parameter § How you measure distance between data points § Euclidean Distance is used which works well in many settings
  • 57. k-Nearest Neighbors Strengths § Very easy to understand, implement § Often gives reasonable performance without a lot of adjustments § Good baseline method to try § Few hyperparameters Weaknesses § Model is usually very fast, but when your training set is very large (either in number of features or in number of samples) prediction can be slow § Mandatory to preprocess the data § Performs poorly with datasets consisting of many zeros (Sparse Datasets) § Lazy learning algorithm § Prone to overfitting § Prone to curse of dimensionality
  • 58. Linear Models Linear Regression (aka ordinary least squares) Ridge Regression Lasso Regression
  • 59. Linear Models § Introduction § Class of models that are widely used in practice § Studied extensively in the last few decades § With roots going back over a hundred years § Linear models make a prediction using a linear function of the input features § Building block for many complex machine learning algorithms, including deep neural networks § It assumes that the data is linearly separable and tries to learn the weight of each feature
  • 60. Linear Models § Linear Models for Regression § x[0] to x[p] denotes the features of a single data point § w and b are parameters of the model that are learned § ŷ is the prediction the model makes § Single feature § where w[0] is the slope and b is the y-axis offset § Note: § Predicted response being a weighted sum of the input features, with weights (which an be negative) given by entries of w
  • 61. Linear Models One -dimensional wave dataset § Input - § Output -
  • 62. Linear Models § Y-Intercept - § this is slightly below which you can also confirm in the image § Linear models for regression can be characterized as regression models for which the prediction § is a line for a single feature § A plane when using two features § Hyperplane in higher dimensions
  • 63. Linear Models § Note 1: § Using a straight line to make predictions is very restrictive § Note 2: § It is a strong assumption (somewhat unrealistic) that our target y is a linear combination of the features § Note 3: § Linear models are very powerful with datasets having many features § Note 4: § Many different models exist for regression § Difference between these models lies in § How the model parameters W and b are learned from the training data? § How the model complexity can be controlled?
  • 64. Linear Regression Linear regression (aka ordinary least squares) § Linear regression - § also known as Ordinary Least Squares (OLS) § Simplest and most classic linear method for regression § Linear regression finds the parameters w and b that § Minimize the mean squared error between predictions and the true regression targets, y, on the training set
  • 65. Linear Regression § Mean Squared Error § The mean squared error is the sum of the squared differences between the predictions and the true values, divided by the number of samples § Linear regression has no parameters § Which is a benefit § But it also has no way to control model
  • 67. Linear Regression § The “slope” parameters (w), also called weights or coefficients § Stored in the coef_ attribute § Offset or intercept (b) is stored in the intercept_ attribute
  • 68. Linear Regression Example - § Input - § Output - § The intercept_ attribute is always a single float number, while the coef_ attribute is a NumPy array with one entry per input feature
  • 69. Linear Regression Training and Test Score (R2) - § Input - § Output - Note - § R2 value of around 0.66 is not very good. § One-dimensional dataset there is a little danger of underfitting § Higher-dimensional datasets, linear models become more powerful and there is a chance of overfitting
  • 70. Linear Regression Boston Housing Dataset - § Consists of 506 samples and 105 derived features § Input - § Output -
  • 71. Linear Regression § Note - § T h e d i s c r e p a n c y b e t w e e n performance on the training set and test set is a clear sign of overfitting § Solution - § Find a model that allows us to control complexity § Some of the alternatives for linear models are Ridge Regression, Lasso Regression
  • 72. Ridge Regression Ridge regression § Ridge regression is also a linear model for regression § The formula it uses to make predictions is the same one used for ordinary least squares § Coefficients (w) are chosen not only so that they predict well on the training data, but also to fit an additional constraint § All entries of w should be close to zero § This means each feature should have as little effect on the outcome as possible (which translates to having a small slope), while still predicting well § This constraint is an example of what is called as Regularization
  • 73. Ridge Regression Regularization - § It is a process of explicitly restricting a model to avoid overfitting § The kind of regularization used in Ridge Regression is L2 Regularization
  • 75. Ridge Regression §Note 1- § Training set score of Ridge is lower than for LinearRegression § Note 2- § Tes t s et s core of Ridge is greater than f or LinearRegression § Note 3- § Ridge is more restricted model, so it is less likely to overfit § Note 4- § a l e s s c o m p l e x m o d e l m e a n s w o r s e performance on the training set but better generalization § Note 5- § We are only interested in Generalization performance
  • 76. Ridge Regression §Note 6- § Ridge model makes a trade-off between the simpicity of the model (near-zero coefficients) and its performance on the training set. § Note 7- § The importance the model places on simplicity versus training set performance can be specified by the user using alpha parameter § Default value of alpha parameter is 1.0 § The optimum setting of alpha depends on the particular dataset we are using § Increase in value of alpha forces the coefficients to move more closer towards zero § Note 8- § Moving coefficients towards zero may decrease t ra i n i n g s e t p e r f o r m a n c e b u t m i g h t h e l p generalization
  • 78. Ridge Regression § For very small values of alpha, coefficients are barely restricted at all, and we end up with a model that resembles LinearRegression § Input - § Output -
  • 80. Ridge Regression § Regularization § Another way to understand the influence of regularization is to fix a value of alpha but vary the amount of training data available § Input – § Output -
  • 81. Ridge Regression § Note - § As more and more data becomes available to the model, both models improve § With enough training data, regularization becomes less important § Given enough data, ridge and linear regression will have the same performance
  • 82. Lasso Regression Lasso § An alternative to Ridge for regularizing linear regression is Lasso § Lasso also restricts coefficients to be close to zero called L1 regularization § W h e n u s i n g t h e l a s s o , s o m e coefficients are exactly zero
  • 83. Lasso Regression Advantages of Lasso § Form of automatic feature selection § Some coefficients be exactly zero often makes a model easier to interpret, and can reveal the most important features of your model
  • 84. Lasso Regression Disadvantages of Lasso § Some features are entirely ignored by the model
  • 85. Lasso Regression Extended Boston Housing dataset § Input - § Output -
  • 86. Lasso Regression § Lasso does quite badly, both on the training set and test set § Indicates that we are underfitting § It used only 4 of the 105 features § Lasso also has a regularization parameter, alpha, that controls how strongly coefficients are pushed toward zero § When we decerease the value of alpha, the maximum number of iterations to run need to be increased (max_iter)
  • 88. Lasso Regression § A lower alpha allowed us to fit a more complex model § This makes this model potentially easier to understand § If we set alpha too low, however, we again remove the effect of regularization and end up overfitting
  • 91. Lasso Regression § Ridge regression is usually the first choice between these two models § If you have a large amount of features and expect only a few of them to be important, lasso might be better choice § If we would like to have a model that is easy to interpret, lasso will provide a model that is easier to understand as it will select only a subset of the input features
  • 92. Linear models for classification Linear Models for classification Linear Models for multiclass classification
  • 93. Linear models for classification Linear models for classification § Linear models are also extensively used for classification § Binary Classification - § The formula looks very similar to the one for linear regression § Instead of just returning the weighted sum of the features, we threshold the predicted value at zero § Function is smaller than zero, we predict the class –1 § If it is larger than zero, we predict the class +1
  • 94. Linear models for classification Linear models for classification § For linear models for regression, the output, ŷ, is a linear function of the features: § a line § plane § hyperplane (in higher dimensions) § For linear models for classification separates two classes using a § line § plane § hyperplane § There are many algorithms for learning linear models § The way in which they measure how well a particular combination of coefficients and intercept fits the training data § What kind of regularization they use?
  • 95. Linear models for classification § The two most common linear classification algorithms are § L o g i s t i c r e g r e s s i o n i m p l e m e n t e d i n linear_model.LogisticRegression § Linear support vector machines (linear SVMs), implemented in svm.LinearSVC (SVC stands for support vector classifier)
  • 96. Linear models for classification Example – Despite its name logistic regression is a classification algorithm § Input - § Output -
  • 97. Linear models for classification § Note 1 - § Both the models are depicted with straight lines separating the areas classified by class 0 and class 1 § Note 2 - § Any new data point that lies above the black line will be classified as class 1 and point below the black line will be classified as class 0 § Note 3 - § The two models Linear SVC and Logistic Regression both come up with similar decision boundaries § Note 4 - § By default both models apply an L2 Regularization
  • 98. Linear models for classification § Trade-off parameter(“c”) § For LogisticRegression and LinearSVC the trade-off parameter that determines the strength of the regularization is called C § A h i g h v a l u e f o r t h e p a r a m e t e r C , LogisticRegression and LinearSVC try to fit the training set as best as possible § higher value of C stresses the importance that each individual data point be classified correctly § Low values of the parameter C, the models put more emphasis on finding a coefficient vector (w) that is close to zero § Using low values of C will cause the algorithms to try to adjust to the “majority” of data points
  • 99. Linear models for classification Example – Decision boundaries of Linear SVM for different values of C Input – Output -
  • 100. Linear models for classification § Left Graph - § Very small C - corresponds to a lot of regularization § Most of the points in class 0 are at the bottom, and most of the points in class 1 are at the top § The strongly regularized model chooses a relatively horizontal line, misclassifying two points § Center Graph - § Value of C is slightly higher § Model focuses more on the two misclassified samples, tilting the decision boundary § Right Graph - § Very high value of C in the model tilts the decision boundary a lot § Now correctly classifying all points in class 0 § One of the points in class 1 is still misclassified, as it is not possible to correctly classify all points in this dataset using a straight line. § The model illustrated on the righthand side tries hard to correctly classify all points, but might not capture the overall layout of the classes well. § In other words, this model is likely overfitting. § Similarly to the case of regression, linear models for classification might seem very restrictive in low-dimensional spaces, only allowing for decision boundaries that are straight lines or planes
  • 101. Linear models for classification Example – Breast Cancer § Input – § Output - § The default value of C=1 § Good training and test accuracy § Training and Test accuracy are very close - Likely to underfit
  • 102. Linear models for classification Example – § Input – § Output -
  • 103. Linear models for classification Example – § Input – § Output - Underfit
  • 104. Linear models for classification § As LogisticRegression applies an L2 regularization by default the result looks similar to that produced by RIDGE § Stronger regularization pushes coefficients more and more toward zero, though coefficients never become exactly zero § More interpretable model, using L1 regularization might help,as it limits the model to using only a few features
  • 105. Linear models for classification § Coefficients learned by the models with the three different settings of parameter C
  • 107. Linear models for classification § Input – (Lasso) § Output -
  • 109. Linear models for Multiclass Classification Linear models for multiclass classification § Many linear classification models are for binary classification only and dont extend naturally to the multiclass case § But, Logistic Regression is an exception § Techinique used to extend a binary classification algorithm to a multiclass classification algorithm is the one-vs-rest approach § A binary model is learned for each class that tries to separate that class from all of the other classes, resulting in as many binary models as there are classes § To make a prediction, all binary classifiers are run on a test point § The classifier that has the highest score on its single class “wins,” and this class label is returned as the prediction
  • 110. Linear models for Multiclass Classification Linear models for multiclass classification § Having one binary classifier per class results in having one vector of coefficients (w) and one intercept (b) for each class § The class for which the result of the classification confidence formula given here is highest § The mathematics behind multiclass logistic regression differ from one-vs-rest approach § but they also result in one coefficient vector and intercept § same method of making a prediction is applied § Classification confidence formula
  • 111. Linear models for Multiclass Classification Example – one vs rest § Input – § Output -
  • 112. Linear models for Multiclass Classification Example – § Input – § Output - § coef_ is (3, 2) § each row coefficient vector for one of the three classes § each column holds the coefficient value for a specific feature § The intercept_ is a one-dimensional array
  • 115. Strengths, Weaknesses, Parameters § First Decision - Regularization Parameters (alpha & c) § The main parameter of linear models is the regularization parameter § alpha in the regression models § C in classification models (linear svc and logistic regression) § Large values for alpha or small values for C mean simple models § For regression models, tuning these parameters is quite important § Second Decision - Regularization Techniques (L1 and L2) § Decision on what regularization is also important § L1 regularization § L2 regularization § When only a few of your features are actually important, you should use L1 § L1 can also be useful if interpretability of the model is important § As L1 will use only a few features, it is easier to explain which features are important to the model
  • 116. Strengths, Weaknesses, Parameters §Strengths § Linear models are very fast to train and also very fast to predict § They scale to very large datasets § Works well with sparse data § Linear models make it relatively easy to understand how a prediction is made, using the formulas we saw earlier for regression and classification
  • 117. Strengths, Weaknesses, Parameters §Weaknesses - § Entirely unclear why coefficients are important in linear models § If the dataset has highly correlated features - the coefficients might be hard to interpret § Linear models often perform well when the number of features is large compared to the number of samples § Often used on very large datasets
  • 120. Naive Bayes Classifiers Naïve Bayes Classifiers § A family of classifiers that are quite similar to the linear models § Advantages § They tend to be even faster in training § Disadvantages § Generalization performance that is slightly worse than that of linear classifiers (i.e., LogisticRegression and LinearSVC)
  • 121. Naive Bayes Classifiers Naïve Bayes Classifiers § It is a probabilistic classifier, which means it predicts on the basis of the probability of an object § mainly used in text classification that includes a high- dimensional training dataset § The Naïve Bayes algorithm is comprised of two words Naïve and Bayes § Naïve: § It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features § Naive Bayes assumes that each parameter, also called features or predictors, has an independent capacity of predicting the output variable § Example - § Apple is identified by Shape(Round), Color(Red),Taste (Sweet) - each feature contributes to model § Bayes: § It is called Bayes because it depends on the principle of Bayes' Theorem/Baye’s Rule/Baye’s Law
  • 122. NaiveBayes Classifiers Naïve Bayes Theorem Where, P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B. P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true. P(A) is Priori Probability: Probability of hypothesis before observing the evidence. P(B) is Marginal Probability: Probability of Evidence.
  • 123. Naive Bayes Classifiers Steps to solve Naïve Bayes § Convert the given dataset into frequency tables. § Generate Likelihood table by finding the probabilities of given features. § Now, use Bayes theorem to calculate the posterior probability
  • 127. NaiveBayes Classifiers § Problem § If the weather is sunny, then the Player should play or not? § Solution
  • 128. Naive Bayes Classifiers § Naive Bayes models are so efficient is that they learn parameters by looking at each feature individually and collect simple per class statistics from each feature
  • 129. Naive Bayes Classifiers § Types of Naive Bayes Classifiers § GaussianNB § BernoulliNB § MultinomialNB
  • 130. GaussianNB § GaussianNB § GaussianNB can be applied to any continuous data § GaussianNB stores the average value as well as the standard deviation of each feature for each class § If predictors take continuous values instead of discrete, then the model assumes that these values are sampled from the Gaussian distribution § Gaussian Naive Bayes is a machine lear ning classification technique based on a probablistic approach that assumes each class follows a normal distribution § The combination of the prediction for all parameters is the final prediction that returns a probability of the dependent variable to be classified in each group § The final classification is assigned to the group with the higher probability
  • 131. GaussianNB § The Gaussian model assumes that features follow a normal distribution § Normal Distribution - § Describes the distributions of continuous random variables in nature and is defined by its bell-shaped curve § A normal distribution has a probability distribution that is centered around the mean § This means that the distribution has more data around the mean § The data distribution decreases as you move away from the center § The resulting curve is symmetrical about the mean and forms a bell-shaped distribution
  • 132. BernoulliNB § BernoulliNB § BernoulliNB assumes binary data § Used for discrete probability calculation § The predictor variables are the independent Boolean variables § Mostly used in text data classification/Document classification § Counts how often every feature of each class is not zero
  • 133. BernoulliNB § Example – § Four data points § Four binary features for each data point § Two classes 0 and 1 § For class 0 (the first and third data points), the first feature is zero two times and nonzero zero times, the second feature is zero three time and nonzero one time, and so on
  • 134. MultinomialNB § MultinomialNB § MultinomialNB assumes count data § Example - that each feature represents an integer count of something, like how often a word appears in a sentence § Mostly used in text data classification as BernouliNB § MultinomialNB takes into account the average value of each feature for each class
  • 135. Naive Bayes Classifiers § Note 1 - § To make a prediction a data point is compared to the statistics for each of the classes and the best matching class is predicted. § Note 2 - § The prediction formula for MultinomialNB and BernoulliNB is same as in linear models § Note 3 - § Coef_ for the naive bayes models has a different meaning than in the linear models
  • 136. Strengths, Weaknesses, and Parameters § Parameters - § MultinomialNB and BernoulliNB have a single parameter alpha which controls model complexity § Large alpha means more smoothing, resulting in complex models § Note - § Algorithms performance is relatively robust to the value of alpha § Setting alpha is NOT critical for good performance but tuning it u s u al ly im p r oves a c cu r a cy somewhat
  • 137. Strengths, Weaknesses, and Parameters § GaussianNB is mostly used on ver y high- dimensional data § BernoulliNB and MultinomialNB are widely used for sparse count data such as text § MultinomialNB usually performs better then BernoulliNB particularly on datasets with a relatively large number of nonzero features
  • 138. Strengths, Weaknesses, and Parameters § Advantages - § Very fast to train and to predict § Easiest algorithm § Training procedure is easy to understand § Models work very well with high dimensional sparse data and are relatively robust to the parameters § Great baseline models § Often used on very large datasets § Works well for both Binary classifcation and multiclass classification problems also § Best model for Text Classification Problems
  • 139. Strengths, Weaknesses, and Parameters § Weakness- § Naive Bayes assumes that all features are independent or unrelated § so it cannot learn the relationship between features
  • 140. Decision Trees Building Decision trees Controlling complexity of Decision trees Feature importance in trees
  • 141. DecisionTrees § Widely used models for both classification and regression § They learn a hierarchy of if/else questions, leading to a decision Example – § Distinguish between the following four animals § Bears § Hawks § Hen § Dolphins
  • 143. DecisionTrees § Each node in the tree either § represents a question § Terminal node (also called a leaf) that contains the answer § In ML we build a model to distinguish between four classes of animals using the three features “has feathers,”“can fly,” and “has fins.”
  • 144. DecisionTrees § Building decision trees § Example - two_moons dataset § The dataset consists of two half- moon shapes, with each class consisting of 75 data points § Learning a decision tree means learning the sequence of if/else questions that gets us to the true answer most quickly § In the machine learning, these if-else questions are called tests § Question format in case of continuous data - § In real life data does not come in the form of binary yes/no features as in the animal example § Data can be continuous in real life situations § The tests that are used on continuous data are of the form “Is feature i larger than value a?”
  • 145. DecisionTrees § Splitting the dataset horizontally at x[1]=0.0596 yields the most information; it best separates the points in class 0 from the points in class 1 § The top node, also called the root, represents the wholedataset, consisting of 50 points belonging to class 0 and 50 points belonging to class 1
  • 147. DecisionTrees § The split is done by testing whether x[1] <= 0.0596 (test), indicated by a black line § If test is True - § Assigned to the left node, which contains 2 points belonging to class 0 and 32 points belonging to class 1 § If test is False - § Assigned to the right node, which contains 48 points belonging to class 0 and 18 points belonging to class 1 § Though the first split did a good job of separating the two classes, the bottom region still contains points belonging to class 0, and the top region still contains points belonging to class 1 § Figure 2-25 shows that the most informative next split for the left and the right region is based on x[0]
  • 148. DecisionTrees § This recursive process yields a binary tree of decisions, with each node containing a test § Each test splits the part of the data that is currently being considered along one axis § This yields a view of the algorithm as building a hierarchical partition § Each test concerns only a single feature § which results in partitions into regions that are always parallel to the axes § The recursive partitioning of the data is repeated until each region in the partition (each leaf in the decision tree) only contains a single target value (a single class or a single regression value) § Pure Leaves- § The leaf of the tree that contains data points that all share the same target value is called PURE
  • 149. DecisionTrees § The above fig is the final partition § A prediction on a new data point is made by checking which region of the partition of the feature space the point lies in, and then predicting the majority target in that region § It is also possible to use trees for regression tasks § Where the output for this data point is the mean target of the training points in this leaf
  • 150. DecisionTrees Controlling complexity of decision trees § Drawback - § Building a tree untill all leaves are PURE leads to models that are very complex and highly overfit to training data § The overfitting can be seen on the left of Figure 2-26 § We can see a small strip predicted as class 0 around the point belonging to class 1
  • 151. DecisionTrees Controlling complexity of decision trees § Common strategies to prevent overfitting § pre-pruning - § Stopping the creation of the tree early (also called pre-pruning) § Possible criteria for pre-pruning § Limiting the maximum depth of the tree § Limiting the maximum number of leave § Requiring a minimum number of data points in a node to keep splitting it § post-pruning - § Building the tree but then removing or collapsing nodes that contain little information § Also called as pruning
  • 152. DecisionTrees § Decision trees in scikit-learn are implemented in the § DecisionTreeRegressor § DecisionTreeClassifier § Scikit-learn only implements pre-pruning but NOT post-pruning
  • 153. DecisionTrees Breast Cancer dataset § Import the dataset and split it into a training and a test part. § Then we build a model using the default setting of fully developing the tree (growing the tree until all leaves are pure). § We fix the random_state in the tree, which is used for tie- breaking internally § Input - § Output -
  • 154. DecisionTrees § The accuracy on the training set is 100% — because the leaves are pure § The tree was grown deep enough that it could perfectly memorize all the labels on the training data § The test set accuracy is slightly worse than for the linear models § Limiting the depth of the tree decreases overfitting
  • 155. DecisionTrees § Limiting the depth of the tree decreases overfitting § If we don’t restrict the depth of a decision tree, the tree can become arbitrarily deep and complex § Unpruned trees are therefore prone to overfitting and not generalizing well to new data § Prepruning to the tree - § will stop developing the tree before we perfectly fit to the training data § How to stop building the tree after a certain depth has reached § Set max_depth=4 - meaning only four consecutive questions can be asked § This will lead to lower training accuracy and improve test accuracy
  • 158. Analyzing DecisionTrees § The example provides a good description for the decision tree machine learning algorithm which can be easily explained to nonexperts § With a tree of depth four, as seen here, the tree can become a bit overwhelming. § Deeper trees are even harder to grasp § One method of inspecting the tree that may be helpful is to find out which path most of the data actually takes
  • 159. Feature importance in DecisionTrees Feature importance in trees § Instead of looking at the whole tree, some useful properties can be used to summarize the tree § The most commonly used summary is feature importance § it rates how important each feature is for the decision a tree makes § It is a number between 0 and 1 for each feature, § 0 means “not used at all” § 1 means “perfectly predicts the target.”
  • 162. DecisionTrees § Worst radius is by far the most important feature § Note 1 - § If a feature has a low value in feature_importance_, it doesn’t mean that this feature is uninformative § It only means that the feature was not picked by the tree, likely because another feature encodes the same information § Note 2 - § Feature importances are always positive § Note 3 - § The feature importances tell us that “worst radius” is important, but not whether a high radius is indicative of a sample being benign or malignant
  • 163. DecisionTrees Regressor § Decision trees for regression, as implemented in DecisionTreeRegressor § The usage and analysis of regression trees is very similar to that of classification trees § The DecisionTreeRegressor is not able to extrapolate - § make predictions outside of the range of the training data
  • 165. DecisionTrees Regressor § Compare two simple models - § Decision Tree Regressor § Linear Regression § Rescale the prices using a logarithm § This doesn’t make a difference for the Decision Tree Regressor, but it makes a big difference for Linear Regression § After training the models and making predictions, we apply the exponential map to undo the logarithm transform
  • 168. DecisionTrees Regressor § The linear model approximates the data with a line and provides quite a good forecast for the test data
  • 169. DecisionTrees Regressor § The tree model, on the other hand, makes perfect predictions on the training data § We did not restrict the complexity of the tree, so it learned the whole dataset by heart § Once we leave the data range for which the model has data, the model simply keeps predicting the last known point § The tree has no ability to generate “new” responses, outside of what was seen in the training data § This shortcoming applies to all models based on trees
  • 170. DecisionTrees Regressor (Strengths, Weaknessesand Parameters) § Parameters - § The parameters that control model complexity in decision trees are the pre-pruning parameters that stop the building of the tree before it is fully developed § max_depth § max_leaf_nodes § min_samples_leaf § These parameters are sufficient to prevent overfitting
  • 171. DecisionTrees Regressor (Strengths, Weaknessesand Parameters) § Strengths - § The resulting model can easily be visualized and understood by nonexperts (at least for smaller trees) § Algorithms are completely invariant to scaling of the data § Each feature is processed separately § Split of the data don’t depend on scaling § NO preprocessing like normalization or standardization of features is needed for decision tree algorithms § Decision trees work well when you have features that are § on completely different scales § a mix of binary and continuous features
  • 172. DecisionTrees Regressor (Strengths, Weaknessesand Parameters) § Weaknesses - § Without the use of pre-pruning, they tend t o o v e r f i t a n d p r o v i d e p o o r generalization performance
  • 174. Ensembles of DecisionTrees Ensembles of Decision Trees § What are ensembles? § Ensembles are methods that combine multiple machine learning models to create more powerful models § Two ensemble models that have proven to be effective on a wide range of datasets for classification and regression § Random forests § Gradient boosted decision trees § Both use decision trees as their building blocks
  • 175. RandomForests Random Forest § Main drawback of decision trees is that they tend to overfit the training data § Random forests are one way to address this problem § What? § A random forest is essentially a collection of decision trees, where each tree is slightly different from the others § Idea behind Random Forests - § Each tree might do a relatively good job of predicting, but will likely overfit on part of the data § If we build many trees, all of which work well and overfit in different ways § We can reduce the amount of overfitting by averaging their results
  • 176. RandomForest Random Forest § Need to build many decision trees § Each tree should do an acceptable job of predicting the target, and should also be different from the other trees § Why Random Forest ? § Random forests get their name from injecting randomness into the tree building to ensure each tree is different § Two ways of randomizing § By selecting the data points used to build a tree § By selecting the features in each split test
  • 177. RandomForest Randomness in RandomForest is decided by § Bootstrap sample § Selection of features (max_features)
  • 178. RandomForest Building Random forests § Step 1 - § You need to decide on the number of trees to build (n_estimators parameter) § Note - § Trees will be built completely independently from each other § Algorithm will make different random choices for each tree to make sure the trees are distinct
  • 179. RandomForest Bootstrap sample § To build a tree first we need to take a bootstrap sample § How? § From our n_samples data points, we repeatedly draw a sample randomly with replacement n_samples times § Replacement meaning the same sample can be picked multiple times § Example on Boot Strap Sample - § Creating a bootstrap sample of the list ['a', 'b', 'c', 'd']. § A possible bootstrap sample would be ['b', 'd', 'd', 'c']. § Another possible sample would be ['d', 'a', 'd', 'a’] § This will create a dataset that is as big as the original dataset, but some data points will be missing from it , and some will be repeated
  • 180. RandomForest § Step 2 - § A decision tree is built based on this newly created dataset § Instead of looking for the best test for each node, in each node the algorithm randomly selects a subset of the features, and it looks for the best possible test involving one of these features § The number of features that are selected is controlled by the max_features parameter. § This selection of a subset of features is repeated separately in each node, so that each node in a tree can make a decision using a different subset of the features § The bootstrap sampling leads to each decision tree in the random forest being built on a slightly different dataset § Because of the selection of features in each node, each split in each tree operates on a different subset of features
  • 181. RandomForest § A critical parameter in this process is max_features § max_features = n_features means § that each split can look at all features in the dataset § NO randomness will be injected in the feature selection § max_features =1, means § that the splits have no choice at all on which feature to test § max_features = HIGH means § that the trees in the random forest will be quite similar § they will be able to fit the data easily, using the most distinctive features § max_features = LOW means § that the trees in the random forest will be quite different
  • 182. RandomForest §Prediction § Random forest algorithm predicts by first making a prediction for every treee in the forest § For regression - § Average - we can average these results of all the decision trees to get our final prediction § For classification - § Soft voting - § Each Decision Tree makes a “soft” prediction, providing a probability for each possible output label § The probabilities predicted by all the trees are averaged, and the class with the highest probability is predicted
  • 183. RandomForest Analyzing random forests § Input – § The trees that are built as part of the random forest are stored in the estimator_ attribute
  • 184. RandomForest § Input – § Decision boundaries learned by the five trees are quite different § some of the training points that are plotted here were not actually included in the training sets of the trees, due to the bootstrap sampling § Note - § The random forest overfits less than any of the trees individually
  • 185. RandomForest § In any real application, we would use many more trees (often hundreds or thousands), leading to even smoother boundaries
  • 186. RandomForest § Random forest consisting of 100 trees § Input – § Output -
  • 188. RandomForest (Strengths, Weaknesses,and Parameters) Strengths - § They are very powerful § Works well without heavy tuning of the parameters § Don’t require scaling of the data
  • 189. RandomForest (Strengths, Weaknesses,and Parameters) § Why still Decision tree is used instead of Random Forest? § Decision trees are compact representation of the Random Forest in decision-making process
  • 190. RandomForest (Strengths, Weaknesses,and Parameters) Weaknesses - § It is basically impossible to interpret tens or hundreds of trees in detail § Random forests tend to be deeper than decision trees (because of the use of feature subsets) § Building random forests on large datasets might be somewhat time consuming
  • 191. RandomForest §Multi-Core Processing - § To increase the speed of building random forests on large datasets § Use the n_jobs parameter to adjust the number of cores to use § Using more CPU cores will result in linear speedups § n_jobs=-1 to use all the cores in your computer
  • 192. RandomForest (Strengths, Weaknesses,and Parameters) § Parameters - § The important parameters to adjust are § n_estimators § max_features § Possibly pre-pruning options like max_depth § Note 1 - § For n_estimators, larger is always better § Thumb rule is to build as many as you have time/memory for § Note 2 - § max_features - determines how random each tree is § Smaller max_features reduces overfitting § Thumb rule is § max_features = sqrt(n_features) for classification § max_features = n_features for regression
  • 193. RandomForest (Strengths, Weaknesses,and Parameters) § Note 1 - § The more trees there are in the forest, the more robust it will be against the choice of random state § Note 2 - § Random forests don’t tend to perform well on very high dimensional, sparse data, such as text data § Linear models are best choice for very high dimensional and sparse data § Note 3 - § Random forests usually work well even on very large datasets § Note 4 - § Training can easily be parallelized over many CPU cores within a powerful computer § Note 5 - § Random Forests are slower to train § Note 6 - § Random forests require more memory § Note 7 - § If time and memory are crucial linear models are best choice than Random Forests
  • 194. Gradient Boosting Gradient boosted regression trees § Also called as gradient boosting machines § Another ensemble method - § combines multiple decision trees to create a more powerful model § Basic Idea - § Combine many simple models (weak learners) § Each weak learner (tree) can only provide good predictions on part of the data § More and more trees are added iteratively to improve performance § Despite the “regression” in the name, these models can be used for regression and classification § Gradient boosting works by building trees in a serial manner - § where each tree tries to correct the mistakes of the previous one § By default, there is no randomization in gradient boosted regression trees § But, Strong pre-pruning is used § Gradient boosted trees often use very shallow trees, of depth one to five
  • 195. Gradient Boosting Advantages of Gradient Boosted Regression trees § Sm all er i n ter m s o f me mo r y ( b e cau s e o f shallowness) § Makes predictions faster § Gradient boosted trees are frequently winning entries in machine learning competitions § Widely used in industry § Bit more sensitive to parameter settings than random forests § Provide better accuracy if the parameters are set correctly
  • 196. Gradient Boosting Parameter of gradient boosting § Apart from Pre-pruning and Number of trees (n_estimators) § Another important parameter of gradient boosting is the learning_rate § Controls how strongly each tree tries to correct the mistakes of the previous trees § Note 1 - § Higher learning_rate means each tree can make stronger corrections, allowing for more complex models § Note 2 - § Adding more trees to the ensemble, which can be accomplished by increasing n_estimators
  • 198. Gradient Boosting § Training accuracy of 100% - Overfitting § To reduce Overfit we can apply § Stronger pre-pruning (limiting the max depth) § Lower the learning rate
  • 201. Gradient Boosting § Feature Importance § Input – § Output –
  • 202. Gradient Boosting § Feature Importance - § Gradient boosting and random forests perform well on similar kinds of data § Note - § First try random forests, which work quite robustly but if it is taking more prediction time moving to GradientBoosting will help § Note - § If gradient boosting needs to be applied to a large- scale problem, better for xgboost package
  • 203. Strengths, Weaknesses, and Parameters Strengths - § Most powerful and widely used models for supervised learning § Algorithm works well without scaling and on a mixture of binary and continuous features
  • 204. Strengths, Weaknesses, and Parameters § Weaknesses - § They require careful tuning of the parameters § May take a long time to train § Does not work well on high-dimensional sparse data
  • 205. Strengths, Weaknesses, and Parameters Parameters § max_depth § used to reduce the complexity of each tree § Usually max_depth is set very low § n_estimators § A higher n_estimators is always better § increasing n_estimators in gradient boosting leads to a more complex model, which may lead to overfitting § Fit n_estimators depending on the time and memory, and then search over different learning_rates § learning_rate § Controls the degree to which each tree is allowed to correct the mistakes of the previous trees
  • 206. Kernelized SupportVector Machines The Kernelized Support Vector Machines The Kernal trick Understanding SVMs Tuning SVM Parameters
  • 207. Kernelized SupportVector Machines Kernelized support vector machines § Kernelized support vector machines § Often just referred to as SVMs § Allows for more complex models that are not defined simply by hyperplanes in the input space § Classification and regression § SVC – Classification § SVR - Regression
  • 209. Kernelized SupportVector Machines § Terminology § Mar gin – Margin is the gap b etween the hyperplane and the support vectors § H yper plan e – Hyp er plane s a re d ecision boundaries that aid in classifying the data points § Support Vectors – Support Vectors are the data points that are on or nearest to the hyperplane and influence the position of the hyperplane § Kernel function – These are the functions used to determine the shape of the hyperplane and decision boundary
  • 210. Kernelized SupportVector Machines Linear models and nonlinear features § L i n e a r m o d e l s c a n b e q u i t e l i m i t i n g i n lowdimensional spaces, as lines and hyperplanes have limited flexibility § One way to make a linear model more flexible is by adding more features
  • 212. Kernelized SupportVector Machines § A linear model for classification can only separate points using a line, and will not be able to do a very good job on this dataset § Input -
  • 214. Kernelized SupportVector Machines § Expand the set of input features § Feature2 = Feature1 ** 2 ---> (Non-linear Feature) § Square of the second feature, as a new feature. § Instead of representing each data point as a two- dimensional point t, (feature0, feature1) § We now represent it as a three-dimensional point, (feature0, feature1, feature1 ** 2)
  • 219. Kernelized SupportVector Machines The kernel trick § Adding nonlinear features to the representation of our data can make linear models much more powerful § Drawbacks § Which features to add? § Adding many features might make computation very expensive § Kernel Trick - § It is a clever mathematical trick - § Allows us to learn a classifier in a higher-dimensional space without actually computing the new representation § Works by directly computing the distance of the data points for the expanded feature representation, without ever actually computing the expansion
  • 220. Kernelized SupportVector Machines § Two ways to map your data into a higher- dimensional space in SVM’s (Types of Kernel) § Polynomial Kernel § Radial Basis Function (RBF) (or) Gaussian Kernal
  • 221. Kernelized SupportVector Machines § Polynomial kernel § Computes all possible polynomials up to a certain degree of the original features (like feature1 ** 2 * feature2 ** 5) § Radial Basis Function (RBF) § Also known as Gaussian Kernel § A bit harder to explain - § as it corresponds to an infinite dimensional feature space § It considers all possible polynomials of all degrees § But the importance of the features decreases for higher degrees
  • 222. Kernelized SupportVector Machines Understanding SVMs § During training, the SVM learns how important each of the training data points is to represent the decision boundary between the two classes § Typically only a subset of the training points matter for defining the decision boundary § Ones that lie on the border between the classes § These are called support vectors
  • 223. Kernelized SupportVector Machines § To make a prediction for a new point § The distance to each of the support vectors is measured § A classification decision is made based on the distances to the support vector and importance of the support vectors which is learned during training § importance of support vectors is stored in an attribute called dual_coef_ attribute of svc
  • 224. Kernelized SupportVector Machines § The distance between data points is measured by the Gaussian kernel § § Here, x1 and x2 are data points § ǁ x1 - x2 ǁ denotes Euclidean distance § ɣ (gamma) is a parameter that controls the width of the Gaussian kernel
  • 226. Kernelized SupportVector Machines § SVM yeilds very smooth and nonlinear boundary § Output –
  • 227. Kernelized SupportVector Machines Tuning SVM Parameters § Gamma parameter § Kernel coefficient § only used in case of rbf, poly and sigmoid kernels § Corresponds to the inverse of the width of the Gaussian kernel (RBF) § Gamma parameter determines how far the influence of a single training example reaches, with low values meaning corresponding to a far reach and high values to a limited reach § The wider the radius of the Gaussian kernel, the further the influence of each training example § C parameter § Regularization parameter § It limits the importance of each point
  • 229. Kernelized SupportVector Machines Explanation - § Left to Right (Gamma Parameter) § Increase the value of the parameter gamma from 0.1 to 10 § A small gamma means a large radius for the Gaussian kernel - § which means that many points are considered close by § Smooth boundaries on the left § Boundaries that focus more on single points towards the right § GammaValue - § Low value - decision boundary will vary slowly § High value - yields a more complex model
  • 230. Kernelized SupportVector Machines Explanation - § Top to bottom (C Parameter) § Increase the C parameter from 0.1 to 1000 § C values - § LowValue - § Restricted Model § Decision boundary is nearly linear § Each Data Point will have limited influence § HighValue - § Decision boundary bend to classify the data points (Non- linear) § Each Data point had stronger influence on the model
  • 231. Kernelized SupportVector Machines § Example - (Breast Cancer Dataset) § Input -
  • 232. Kernelized SupportVector Machines § SVMs often perform quite well § Very sensitive § to the settings of the parameters § to the scaling of the data § Require all the features to vary on a similar scale
  • 233. Kernelized SupportVector Machines Example - § Breast Cancer dataset are of completely different orders of magnitude § Input – § Output -
  • 234. Kernelized SupportVector Machines Problem with SVM- § Breast Cancer dataset are of completely different orders of magnitude § It will result in devastating effects for the kernel SVM § Solutions - § Preprocessing data for SVMs § Rescaling each feature so that they are all approximately on the same scale § A common rescaling method for kernel SVMs is to scale the data such that all features are between 0 and 1
  • 235. Kernelized SupportVector Machines § MinMaxScaler preprocessing method § Input - (Training Dataset) § Output -
  • 236. Kernelized SupportVector Machines § Input - (Test Data Set) § Input - § Output - § Scaling the Data made a huge difference - § Lead to underfitting - § where training and test set performance are quite similar
  • 237. Kernelized SupportVector Machines § We can try increasing either C or gamma to fit a more complex model § Input - § Output - § Increasing C allows us to improve the model significantly, resulting in 97.2% accuracy
  • 238. Strengths, Weaknesses, and Parameters Strengths - § Kernelized support vector machines are powerful models § Perform well on a variety of datasets § Allow for complex decision boundaries, even if the data has only a few features § Work well on low-dimensional and high- dimensional data (i.e., few and many features)
  • 239. Strengths, Weaknesses, and Parameters Weaknesses - § Don’t scale ver y well with the number of samples § Running an SVM on data with up to 10,000 samples might work well, but working with datasets of size 100,000 or more can become challenging in terms of runtime and memory usage § Require careful preprocessing of the data and tuning of the parameters § SVM models are hard to inspect - § It can be difficult to understand why a particular prediction was made § It is tricky to explain the model to a nonexpert
  • 240. Strengths, Weaknesses, and Parameters § Note - § Try SVMs particularly if all of your features represent measurements in similar units and they are on similar scales
  • 241. Strengths, Weaknesses, and Parameters Parameters § Regularization parameter C § Choice of the kernel (Polynomial kernel or RBF Kernel) § Kernel-specific parameters (Gamma and C) § both control the complexity of the model, with large values in either resulting in a more complex model
  • 243. Uncertainty Estimates from Classifiers Uncertainty Estimates from Classifiers § In scikit-learn - classifiers provide uncertainty estimates of predictions § We are not only interested in which class a classifier predicts for a certain test point, but also how certain it is that this is the right class § Different kinds of mistakes lead to very different outcomes in real-world applications § Testing for cancer § False positive prediction might lead to a patient undergoing additional tests § False negative prediction might lead to a serious disease not being treated
  • 244. Uncertainty Estimates from Classifiers § Two different functions used to obtain uncertainty estimates from classifiers: § decision_function § predict_proba § Most classifiers have at least one of them § Many classifiers have both
  • 245. Uncertainty Estimates from Classifiers § Gradient Boosting Classifier classifier, which h a s b o t h a d e c i s i o n _ f u n c t i o n a n d a predict_proba method:
  • 246. Uncertainty Estimates from Classifiers The Decision Function in Gradient Boosting § In Binary classification § Return value of decision_function is of shape (n_samples,), and it returns one floating-point number for each sample: § Input - § Output -
  • 247. Uncertainty Estimates from Classifiers § This value encodes how strongly the model believes a data point to belong to the “positive” class, in this case class 1 § Input - § Output - § Positive value indicate a preference for the positive class (Class 1) § Negative Value indicate a preference for the Negative Class (Class 0)
  • 249. Uncertainty Estimates from Classifiers § Input – (Range of decision_functin can be arbitrary and depends on the data and the model parameters) § Output - § Note 1 - § Arbitrary scaling makes the output of decision_function often hard to interpret
  • 251. Uncertainty Estimates from Classifiers Predicting Probabilities § The output of predict_proba is a probability for each class § Often more easily understood than the output of decision_function § It is always of shape (n_samples, 2) for binary classification: § Input - § Output -
  • 252. Uncertainty Estimates from Classifiers § The first entry in each row is the estimated probability of the first class, and the second entry is the estimated probability of the second class § Input - § Output -
  • 253. Uncertainty Estimates from Classifiers § Because the probabilities for the two classes sum to 1, exactly one of the classes will be above 50% certainty § That class is the one that is predicted § From the above example the classifier is relatively certain for most points § How well the uncertainty actually reflects uncertainty in the data depends on the model and the parameters § Note 1 - § A model that is more overfitted tends to make more certain predictions, even if they might be wrong § A model with less complexity usually has more uncertainty in its predictions
  • 254. Uncertainty Estimates from Classifiers Calibrated model § A model is called calibrated if the reported uncertainty actually matches how correct it is — in a calibrated model, a prediction made with 70% certainty would be correct 70% of the time
  • 256. Uncertainty Estimates from Classifiers Uncertainty in Multiclass Classification § decision_function and predict_proba methods also work in the multiclass setting § In multiclass case, the shape of the decision_function is (n_samples, n_classes) § each column provides a “certainty score” for each class, where § large score means that a class is more likely § small score means the class is less likely § Example – § Iris dataset § Input -
  • 259. Uncertainty Estimates from Classifiers § Example - (predict_proba) § Has shape as (n_samples, n_classes) § Maximum probability value is the prediction value § The probabilities of the possible classes for each datapoint sum to 1 § Input – § Output -
  • 260. Uncertainty Estimates from Classifiers § Example - (predict_proba) § Input – § Output -
  • 261. Uncertainty Estimates from Classifiers § Predict_proba and decision_function always have shape (n_samples, n_classes) § Binary case, decision_function only has one column, corresponding to the “positive” class classes_[1]
  • 262. Summary and Outlook Nearest neighbors § For small datasets § Good as a baseline § Easy to explain
  • 263. Summary and Outlook Linear models § Go-to as a first algorithm to try § Good for very large datasets § Good for very high-dimensional data
  • 264. Summary and Outlook Naive Bayes § Only for classification § Even faster than linear models § Good for very large datasets and high-dimensional data § Often less accurate than linear models
  • 265. Summary and Outlook Decision trees § Very fast § Don’t need scaling of the data § Can be visualized § Easily explained
  • 266. Summary and Outlook Random forests § Nearly always perform better than a single decision tree, very robust and powerful § Don’t need scaling of data § Not good for very high dimensional sparse data
  • 267. Summary and Outlook Gradient boosted decision trees § Often slightly more accurate than random forests § Slower to train but faster to predict than random forests § Smaller in memory § Need more parameter tuning than random forests
  • 268. Summary and Outlook Support vector machines § Powerful for medium-sized datasets of features with similar meaning § Require scaling of data § Sensitive to parameters
  • 269. Summary and Outlook Neural networks § Can build very complex models, particularly for large datasets § Sensitive to scaling of the data and to the choice of parameters § Large models need a long time to train
  翻译: