Machine Learning - Implementation with Python - 2

Machine
Learning
Source: Introduction to Machine Learning with Python
Authors: Andreas C. Müller and Sarah Guido

Unit – II
Supervised Learning

Agenda
Classification and
Regression
Generalization, Overfitting
and Underfitting
Relation of Model
Complexity to Dataset Size
K- Nearest Neighbors

Agenda
Linear Models
Linear Models for
Classification
Naïve Bayes Classifiers
Decision Trees

Agenda
Ensembles of Decision Tress
Kernalized Support Vector
Machines
Uncertainity Estimates from
Classifiers

Classification
and
Regression
Introduction to Machine
learning
Classification
Regression

Supervised
Learning
§ Supervised learning is used whenever we want to
predict a certain outcome from a given input
§ Goal is to make accurate predictions for new,
never-before-seen data
§ Supervised learning often requires human
effort to build the training set, but afterward
automates and often speeds up an otherwise
laborious or infeasible task

Classification
and
Regression
§ Two major types of supervised machine
learning problems –
§ Classification
§ Regression

Classification
and
Regression
§ Classification
§ Goal is to predict a class label, which is a choice
from a predefined list of possibilities
§ Classification is sometimes separated into
§ Binary classification -
§ Distingution between two classes
§ Multiclass classification -
§ Which is classification between more than two classes
§ Example -
§ Binary Classification -
§ Classifying emails as either spam or not spam
§ Multiclass Classification -
§ Iris

Classification
and
Regression
Regression
§ Goal is to predict -
§ continuous number or a floating-point number in
programming terms
§ Example -
§ Person’s annual income
§ Predicting the yield of a corn

Generalization,
Overfitting and
Underfitting
Generalization
Overfitting
Underfitting

Generalization
Generalization
§ If a model is able to make accurate predictions on
unseen data, we say it is able to generalize from
the training set to the test set.
§ Always build a model that is able to generalize as
accurately as possible
Example -
§ Boat Buyers Prediction -
§ Goal is to send out promotional emails to people who are
likely to actually make a purchase but not bother those
customers who are not interested

Generalization
Example – Boat Buyers Prediction
§ If the customer is older than 45, and has less than 3
children or is not divorced, then they want to buy a boat

Generalization
§ Rule 1: Complex Rule (Complex Model)
§ If the customer is older than 45, and has less than 3 children or is not
divorced, then they want to buy a boat
§ We can make up many rules that work well on this data
§ Our goal is to find whether new customers are likely to buy a boat
§ We therefore want to find a rule that will work well for new
customers, and achieving 100 percent accuracy on the training
set does not help
§ The only way or measure of whether an algorithm will perform well
on new data is the evaluation on the test set
§ Note:
§ Simple models are expected to generalise better to new data
§ Example:
§ “Customer older than 50 want to buy a boat” (Simple rule/Simple
Model)
§ is simple rule which did not involve children and divorce features
§ So it is more generalized or simple model

Overfitting
Overfitting
§ Building a model that is too complex for the amount
of information we have is called overfitting
§ Overfitting occurs when you fit a model too closely
to the particularities of the training set and obtain a
model that works well on the training set but is
not able to generalize to new data
§ Example -
§ Rule 1 - If the customer is older than 45, and has less
than 3 children or is not divorced, then they want to buy
a boat

Underfitting
Underfitting
§ Rule 3 -
§ Everybody who owns a house buys a boat
§ Might not be able to capture all the aspects of and
variability in the data, and your model will do
badly even on the training set
§ If the model is too simple then it will lead to
underfitting

Tradeoff
between
Overfitting and
Underfitting
§ The more complex we allow our model to be, the better
we will be able to predict on the training data
§ But when we start focusing too much on each individual
data point in our training set, and the model will not
generalize well to new data
§ Sweet Spot -
§ will yield the best generalization performance
§ This is the model we want to find

Relation of
Model
Complexity to
Dataset Size
Intro to Supevised Machine
Learning Algorithms
Classification
Regression

Relation of
Model
Complexity to
Dataset Size
Relation of Model Complexity to Dataset Size
§ Model complexity is tied to the variation of
inputs contained in your training dataset
§ The larger variety of data points your dataset
contains, the more complex a model you can
use without overfitting
§ Collecting more data points will yield more
variety
§ So larger datasets allow building more
complex models

Relation of
Model
Complexity to
Dataset Size
Example –
Boat Purchase
§ Added 10,000 more rows of customer data
§ Rule 1 -
§ If the customer is older than 45, and has less than
3 children or is not divorced, then they want to
buy a boat
§ This will be a good rule than when it was
developed using only the 12 rows
§ Note 1:
§ In the real world, we often have the ability to decide
how much data to collect?
§ Large collection of data might be more beneficial than
tweaking and tuning your model
§ Note 2:
§ Never understimate the power of more data

Supervised
Machine
Learning
Algorithms
Introduction to Supervised Machine
Learning Algorithms
§ Note:
§ Many of the machine learning algorithms have a classification and
regression variant
§ Data Sets -
§ Some datasets will be small and synthetic
§ Some datasets will be large (real-world examples)
§ Forge Dataset (Classification Exampe)
§ A synthetic two-class classification dataset is the forge dataset has two
features
§ Scatter plot
§ The plot has the first feature on the x-axis and the second feature on the y-
axis
§ Each data point is represented as one dot
§ The color and shape of the dot indicates its class

Supervised
Machine
Learning
Algorithms
Example –
§ Input
§ Output

Supervised
Machine
Learning
Algorithms

Supervised
Machine
Learning
Algorithms
§ Synthetic wave dataset (Regression
Example)
§ A single input feature and a continuous target
variable (or response)
§ Shows the single feature on x-axis and the
regression target (the output) on the y-axis

Supervised
Machine
Learning
Algorithms
Note 1:
§ Any intution derived from datasets with
few features (called low-dimensional
datasets) might not hold in datasets with
many features (called high-dimensional
datasets)

Supervised
Machine
Learning
Algorithms
Breast Cancer Example
§ Scikit-learn includes two realworld datasets
§ Wisconsin breast cancer dataset
§ Records clinical measurements of breast cancer
tumors
§ Labeled as “benign” (for harmless tumors)
§ “Malignant” (for cancerous tumors)
§ Task is to learn to predict whether a tumor
is malignant based on the measurements of
the tissue

Supervised
Machine
Learning
Algorithms
§ Input :
Output :

Supervised
Machine
Learning
Algorithms
Note:
§ Datasets included in scikit-learn are
usually stored as Bunch objects
§ which contain some information about the
dataset as well as the actual data
§ Bunch Objects is that they behave like
dictionaries

Supervised
Machine
Learning
Algorithms
§ The dataset consists of 569 data points, with 30
features each:
§ Input :
§ Output :

Supervised
Machine
Learning
Algorithms
§ Of these 569 data points, 212 are labeled as
malignant and 357 as benign:
§ Input :
§ Output :

Supervised
Machine
Learning
Algorithms
§ To get a description of the semantic meaning of
each f eature, we can have a look at the
feature_names attribute:
§ Input :
§ Output :

Supervised
Machine
Learning
Algorithms
Regression Example
§ Boston Housing dataset
§ The task associated with this dataset is to
predict the median value of homes in
several Boston neighborhoods in the 1970s
with information such as
§ Crime rate
§ Proximity to the charles river
§ Highway accessibility

Supervised
Machine
Learning
Algorithms
§ The dataset contains 506 data points, described
by 13 features
§ Input -
§ Output –

Supervised
Machine
Learning
Algorithms
Load_extended_boston function
§ The dataset contains 506 data points, described by
104 features
§ 104 features are the 13 original features together with
the 91 possible ccombinations of two features within
those 13 (with replacement)
§ Input -
§ Output –

k-Nearest
Neighbors
k- Neighbors Classification
k-Neighbors Regression

k-Nearest
Neighbors
k-Nearest Neighbors
§ Simplest machine learning algorithm
§ Building the model consists only of storing the
training dataset
§ To make a prediction for a new data point, the
algorithm finds the closest data points in
the training dataset — its “nearest
neighbors.”

k-Nearest
Neighbors
Classification
k-Neighbors classification
§ In the simplest version, the k-NN algorithm only
considers exactly one nearest neighbor
§ i.e., Closest training data point to the point we want to
make a prediction for
§ Prediction is then simply the known output for this
training point

k-Nearest
Neighbors
Classification
§ Input –
§ Output -

k-Nearest
Neighbors
Classification
§ Added three new data points, shown as stars
§ Marked the closest point in the training set
§ The prediction of the one nearest-neighbor
algorithm is the label of that point (shown by the
color of the cross).
§ Instead of considering only the closest neighbor,
we can also consider an arbitrary number, k, of
neighbors
§ This is how the name of the k-nearest neighbors
algorithm comes from

k-Nearest
Neighbors
Classification
§ When considering more than one neighbor, we
use voting to assign a label
§ This means that for each test point, we count
how many neighbors belong to class 0 and
how many neighbors belong to class 1
§ Assign the class that is more frequent the
major ity class among the k-nearest
neighbors

k-Nearest
Neighbors
Classification
Three closest Neighbors
Input –
§ Output -

k-Nearest
Neighbors
Classification
§ Step 1 –
§ Step 2 –
§ Step 3 -

k-Nearest
Neighbors
Classification
Step 4 -
§ To make predictions on the test data, we call the
predict method
§ Input –
§ Output -
§ Step 5 -
§ How well our model generalizes, we can call the
score method
§ Input -
§ Output -

k-Nearest
Neighbors
Classification
Step 6 – (Analysis using visualization)
§ Visualization
§ Input -
§ Output -

k-Nearest
Neighbors
Classification
Example –
§ Breast Cancer (Real world Dataset)

k-Nearest
Neighbors
Regression
K-Neigh bors regr es si on ( Simpl e
Example)
§ wave dataset
§ Added three test data points as green
stars on the x-axis

k-Nearest
Neighbors
Regression
§ Input - (Single Neighbour)
§ Output -

k-Nearest
Neighbors
Regressor
§ Input - (Three Neighbours)
§ Output - (Prediction is the average or mean of
the relevant neighbours)

k-Nearest
Neighbors
Regressor
§ K Neighbors Regressor
§ Example -

k-Nearest
Neighbors
Regressor
§ Evaluation -
§ Evaluate the model using the score method
§ For regressors returns the R^2 score
§ The R^2 score, also known as the coefficient of
determination
§ is a measure of goodness of a prediction for a
regression model
§ Yields a score between 0 and 1
§ 1 corresponds to perfect prediction
§ 0 corresponds to a constant model (just predicts the
mean of the training set)
§ Input -
§ Output –

k-Nearest
Neighbors
Regressor
Analyzing KNeighborsRegressor

k-Nearest
Neighbors
Regressor
§ Using only a single neighbor, each point in the
training set has an obvious influence on the
predictions, and the predicted values go through
all of the data points
§ More neighbors leads to smoother predictions,
but these do not fit the training data as well

k-Nearest
Neighbors
Classifier
Strengths, weaknesses, and parameters
§ Two important parameters to the KNeighbors
classifier
§ Number of neighbors -
§ Using a small number of neighbours like three or five
often works well
§ you should certainly adjust this parameter
§ How you measure distance between data points
§ Euclidean Distance is used which works well in many
settings

k-Nearest
Neighbors
Strengths
§ Very easy to understand, implement
§ Often gives reasonable performance without a lot of
adjustments
§ Good baseline method to try
§ Few hyperparameters
Weaknesses
§ Model is usually very fast, but when your training set is
very large (either in number of features or in number of
samples) prediction can be slow
§ Mandatory to preprocess the data
§ Performs poorly with datasets consisting of many zeros
(Sparse Datasets)
§ Lazy learning algorithm
§ Prone to overfitting
§ Prone to curse of dimensionality

Linear Models
Linear Regression (aka
ordinary least squares)
Ridge Regression
Lasso Regression

Linear Models
§ Introduction
§ Class of models that are widely used in practice
§ Studied extensively in the last few decades
§ With roots going back over a hundred years
§ Linear models make a prediction using a linear
function of the input features
§ Building block for many complex machine learning
algorithms, including deep neural networks
§ It assumes that the data is linearly separable and
tries to learn the weight of each feature

Linear Models
§ Linear Models for Regression
§ x[0] to x[p] denotes the features of a single data
point
§ w and b are parameters of the model that are
learned
§ ŷ is the prediction the model makes
§ Single feature
§ where w[0] is the slope and b is the y-axis offset
§ Note:
§ Predicted response being a weighted sum of the
input features, with weights (which an be negative)
given by entries of w

Linear Models
One -dimensional wave dataset
§ Input -
§ Output -

Linear Models
§ Y-Intercept -
§ this is slightly below which you can also
confirm in the image
§ Linear models for regression can be
characterized as regression models for
which the prediction
§ is a line for a single feature
§ A plane when using two features
§ Hyperplane in higher dimensions

Linear Models
§ Note 1:
§ Using a straight line to make predictions is very restrictive
§ Note 2:
§ It is a strong assumption (somewhat unrealistic) that
our target y is a linear combination of the features
§ Note 3:
§ Linear models are very powerful with datasets having
many features
§ Note 4:
§ Many different models exist for regression
§ Difference between these models lies in
§ How the model parameters W and b are learned from the
training data?
§ How the model complexity can be controlled?

Linear
Regression
Linear regression (aka ordinary least
squares)
§ Linear regression -
§ also known as Ordinary Least Squares (OLS)
§ Simplest and most classic linear method for
regression
§ Linear regression finds the parameters w
and b that
§ Minimize the mean squared error between
predictions and the true regression targets, y,
on the training set

Linear
Regression
§ Mean Squared Error
§ The mean squared error is the sum of the
squared differences between the predictions
and the true values, divided by the number of
samples
§ Linear regression has no parameters
§ Which is a benefit
§ But it also has no way to control model

Linear
Regression
§ The “slope” parameters (w), also called weights or
coefficients
§ Stored in the coef_ attribute
§ Offset or intercept (b) is stored in the intercept_
attribute

Linear
Regression
Example -
§ Input -
§ Output -
§ The intercept_ attribute is always a single float
number, while the coef_ attribute is a NumPy array
with one entry per input feature

Linear
Regression
Training and Test Score (R2) -
§ Input -
§ Output -
Note -
§ R2 value of around 0.66 is not very good.
§ One-dimensional dataset there is a little danger of
underfitting
§ Higher-dimensional datasets, linear models
become more powerful and there is a chance of
overfitting

Linear
Regression
Boston Housing Dataset -
§ Consists of 506 samples and 105 derived
features
§ Input -
§ Output -

Linear
Regression
§ Note -
§ T h e d i s c r e p a n c y b e t w e e n
performance on the training set and
test set is a clear sign of overfitting
§ Solution -
§ Find a model that allows us to control
complexity
§ Some of the alternatives for linear
models are Ridge Regression, Lasso
Regression

Ridge
Regression
Ridge regression
§ Ridge regression is also a linear model for
regression
§ The formula it uses to make predictions is the
same one used for ordinary least squares
§ Coefficients (w) are chosen not only so that
they predict well on the training data, but also
to fit an additional constraint
§ All entries of w should be close to zero
§ This means each feature should have as little effect
on the outcome as possible (which translates to
having a small slope), while still predicting well
§ This constraint is an example of what is called as
Regularization

Ridge
Regression
Regularization -
§ It is a process of explicitly restricting a
model to avoid overfitting
§ The kind of regularization used in Ridge
Regression is L2 Regularization

Ridge
Regression
§ Input -
§ Output -

Ridge
Regression
§Note 1-
§ Training set score of Ridge is lower than for
LinearRegression
§ Note 2-
§ Tes t s et s core of Ridge is greater than f or
LinearRegression
§ Note 3-
§ Ridge is more restricted model, so it is less likely to
overfit
§ Note 4-
§ a l e s s c o m p l e x m o d e l m e a n s w o r s e
performance on the training set but better
generalization
§ Note 5-
§ We are only interested in Generalization
performance

Ridge
Regression
§Note 6-
§ Ridge model makes a trade-off between the
simpicity of the model (near-zero coefficients) and its
performance on the training set.
§ Note 7-
§ The importance the model places on simplicity versus
training set performance can be specified by the user
using alpha parameter
§ Default value of alpha parameter is 1.0
§ The optimum setting of alpha depends on the
particular dataset we are using
§ Increase in value of alpha forces the coefficients to
move more closer towards zero
§ Note 8-
§ Moving coefficients towards zero may decrease
t ra i n i n g s e t p e r f o r m a n c e b u t m i g h t h e l p
generalization

Ridge
Regression
§ For very small values of alpha, coefficients are
barely restricted at all, and we end up with a model
that resembles LinearRegression
§ Input -
§ Output -

Ridge
Regression
§ Regularization
§ Another way to understand the influence of
regularization is to fix a value of alpha but vary the
amount of training data available
§ Input –
§ Output -

Ridge
Regression
§ Note -
§ As more and more data becomes available to
the model, both models improve
§ With enough training data, regularization
becomes less important
§ Given enough data, ridge and linear regression
will have the same performance

Lasso
Regression
Lasso
§ An alternative to Ridge for regularizing
linear regression is Lasso
§ Lasso also restricts coefficients to be
close to zero called L1 regularization
§ W h e n u s i n g t h e l a s s o , s o m e
coefficients are exactly zero

Lasso
Regression
Advantages of Lasso
§ Form of automatic feature selection
§ Some coefficients be exactly zero often
makes a model easier to interpret, and can
reveal the most important features of your
model

Lasso
Regression
Disadvantages of Lasso
§ Some features are entirely ignored by the
model

Lasso
Regression
Extended Boston Housing dataset
§ Input -
§ Output -

Lasso
Regression
§ Lasso does quite badly, both on the training set
and test set
§ Indicates that we are underfitting
§ It used only 4 of the 105 features
§ Lasso also has a regularization parameter,
alpha, that controls how strongly coefficients are
pushed toward zero
§ When we decerease the value of alpha, the
maximum number of iterations to run need to be
increased (max_iter)

Lasso
Regression
§ Input -
§ Output -

Lasso
Regression
§ A lower alpha allowed us to fit a more complex
model
§ This makes this model potentially easier to
understand
§ If we set alpha too low, however, we again remove
the effect of regularization and end up overfitting

Lasso
Regression
§ Ridge regression is usually the first choice
between these two models
§ If you have a large amount of features and
expect only a few of them to be important,
lasso might be better choice
§ If we would like to have a model that is easy to
interpret, lasso will provide a model that is easier
to understand as it will select only a subset of
the input features

Linear models
for
classification
Linear Models for
classification
Linear Models for multiclass
classification

Linear models
for classification
Linear models for classification
§ Linear models are also extensively used for
classification
§ Binary Classification -
§ The formula looks very similar to the one for
linear regression
§ Instead of just returning the weighted sum of the
features, we threshold the predicted value at zero
§ Function is smaller than zero, we predict the class –1
§ If it is larger than zero, we predict the class +1

Linear models
for classification
Linear models for classification
§ For linear models for regression, the output, ŷ, is a
linear function of the features:
§ a line
§ plane
§ hyperplane (in higher dimensions)
§ For linear models for classification separates two
classes using a
§ line
§ plane
§ hyperplane
§ There are many algorithms for learning linear
models
§ The way in which they measure how well a particular
combination of coefficients and intercept fits the
training data
§ What kind of regularization they use?

Linear models
for classification
§ The two most common linear classification
algorithms are
§ L o g i s t i c r e g r e s s i o n i m p l e m e n t e d i n
linear_model.LogisticRegression
§ Linear support vector machines (linear SVMs),
implemented in svm.LinearSVC (SVC stands for
support vector classifier)

Linear models
for classification
Example – Despite its name logistic regression is a classification algorithm
§ Input -
§ Output -

Linear models
for classification
§ Note 1 -
§ Both the models are depicted with straight lines
separating the areas classified by class 0 and class 1
§ Note 2 -
§ Any new data point that lies above the black line
will be classified as class 1 and point below the
black line will be classified as class 0
§ Note 3 -
§ The two models Linear SVC and Logistic Regression
both come up with similar decision boundaries
§ Note 4 -
§ By default both models apply an L2 Regularization

Linear models
for classification
§ Trade-off parameter(“c”)
§ For LogisticRegression and LinearSVC the trade-off
parameter that determines the strength of the
regularization is called C
§ A h i g h v a l u e f o r t h e p a r a m e t e r C ,
LogisticRegression and LinearSVC try to fit the
training set as best as possible
§ higher value of C stresses the importance that each
individual data point be classified correctly
§ Low values of the parameter C, the models put
more emphasis on finding a coefficient vector (w)
that is close to zero
§ Using low values of C will cause the algorithms to try
to adjust to the “majority” of data points

Linear models for
classification
Example – Decision boundaries of Linear SVM
for different values of C
Input –
Output -

Linear models
for classification
§ Left Graph -
§ Very small C - corresponds to a lot of regularization
§ Most of the points in class 0 are at the bottom, and most of the points in class 1
are at the top
§ The strongly regularized model chooses a relatively horizontal line,
misclassifying two points
§ Center Graph -
§ Value of C is slightly higher
§ Model focuses more on the two misclassified samples, tilting the decision
boundary
§ Right Graph -
§ Very high value of C in the model tilts the decision boundary a lot
§ Now correctly classifying all points in class 0
§ One of the points in class 1 is still misclassified, as it is not possible to
correctly classify all points in this dataset using a straight line.
§ The model illustrated on the righthand side tries hard to correctly classify all
points, but might not capture the overall layout of the classes well.
§ In other words, this model is likely overfitting.
§ Similarly to the case of regression, linear models for classification might seem very
restrictive in low-dimensional spaces, only allowing for decision boundaries that are
straight lines or planes

Linear models
for classification
Example – Breast Cancer
§ Input –
§ Output -
§ The default value of C=1
§ Good training and test accuracy
§ Training and Test accuracy are very close - Likely to underfit

Linear models
for classification
Example –
§ Input –
§ Output -

Linear models
for classification
Example –
§ Input –
§ Output -
Underfit

Linear models
for classification
§ As LogisticRegression applies an L2 regularization by
default the result looks similar to that produced by RIDGE
§ Stronger regularization pushes coefficients more and
more toward zero, though coefficients never become
exactly zero
§ More interpretable model, using L1 regularization
might help,as it limits the model to using only a few
features

Linear models for
classification
§ Coefficients learned by the models with the three
different settings of parameter C

Linear models
for classification

Linear models for
classification
§ Input – (Lasso)
§ Output -

Linear models for
classification

Linear models
for Multiclass
Classification
Linear models for multiclass classification
§ Many linear classification models are for binary
classification only and dont extend naturally to the
multiclass case
§ But, Logistic Regression is an exception
§ Techinique used to extend a binary classification
algorithm to a multiclass classification algorithm is
the one-vs-rest approach
§ A binary model is learned for each class that tries to
separate that class from all of the other classes,
resulting in as many binary models as there are
classes
§ To make a prediction, all binary classifiers are run on
a test point
§ The classifier that has the highest score on its single
class “wins,” and this class label is returned as the
prediction

Linear models
for Multiclass
Classification
Linear models for multiclass classification
§ Having one binary classifier per class results in
having one vector of coefficients (w) and one
intercept (b) for each class
§ The class for which the result of the classification
confidence formula given here is highest
§ The mathematics behind multiclass logistic
regression differ from one-vs-rest approach
§ but they also result in one coefficient vector and
intercept
§ same method of making a prediction is applied
§ Classification confidence formula

Linear models
for Multiclass
Classification
Example – one vs rest
§ Input –
§ Output -

Linear models
for Multiclass
Classification
Example –
§ Input –
§ Output -
§ coef_ is (3, 2)
§ each row coefficient vector for one of the three classes
§ each column holds the coefficient value for a specific
feature
§ The intercept_ is a one-dimensional array

Linear models
for Multiclass
Classification
§ Input –
§ Output -

Linear models
for Multiclass
Classification
Input –
Output -

Strengths,
Weaknesses,
Parameters
§ First Decision - Regularization Parameters (alpha & c)
§ The main parameter of linear models is the regularization
parameter
§ alpha in the regression models
§ C in classification models (linear svc and logistic
regression)
§ Large values for alpha or small values for C mean simple
models
§ For regression models, tuning these parameters is quite
important
§ Second Decision - Regularization Techniques (L1 and L2)
§ Decision on what regularization is also important
§ L1 regularization
§ L2 regularization
§ When only a few of your features are actually important, you
should use L1
§ L1 can also be useful if interpretability of the model is important
§ As L1 will use only a few features, it is easier to explain
which features are important to the model

Strengths,
Weaknesses,
Parameters
§Strengths
§ Linear models are very fast to train and also very fast to
predict
§ They scale to very large datasets
§ Works well with sparse data
§ Linear models make it relatively easy to understand how a
prediction is made, using the formulas we saw earlier for
regression and classification

Strengths,
Weaknesses,
Parameters
§Weaknesses -
§ Entirely unclear why coefficients are important
in linear models
§ If the dataset has highly correlated features -
the coefficients might be hard to interpret
§ Linear models often perform well when the
number of features is large compared to the
number of samples
§ Often used on very large datasets

Naive Bayes
Classifiers
Introduction
Types

Naive Bayes
Classifiers
Advantages
Disadvantages

Naive Bayes
Classifiers
§ A family of classifiers that are quite similar to the
linear models
§ Advantages
§ They tend to be even faster in training
§ Disadvantages
§ Generalization performance that is slightly
worse than that of linear classifiers (i.e.,
LogisticRegression and LinearSVC)

Naive Bayes
Classifiers
§ It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object
§ mainly used in text classification that includes a high-
dimensional training dataset
§ The Naïve Bayes algorithm is comprised of two words
Naïve and Bayes
§ Naïve:
§ It is called Naïve because it assumes that the occurrence of a
certain feature is independent of the occurrence of other
features
§ Naive Bayes assumes that each parameter, also called
features or predictors, has an independent capacity of
predicting the output variable
§ Example -
§ Apple is identified by Shape(Round), Color(Red),Taste (Sweet) -
each feature contributes to model
§ Bayes:
§ It is called Bayes because it depends on the principle of
Bayes' Theorem/Baye’s Rule/Baye’s Law

NaiveBayes
Classifiers
Naïve Bayes Theorem
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on
the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence
given that the probability of a hypothesis is true.
P(A) is Priori Probability: Probability of hypothesis before
observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.

Naive Bayes
Classifiers
Steps to solve Naïve Bayes
§ Convert the given dataset into frequency tables.
§ Generate Likelihood table by finding the
probabilities of given features.
§ Now, use Bayes theorem to calculate the posterior
probability

NaiveBayes
Classifiers
§ Initial Dataset

NaiveBayes
Classifiers
§ Frequency Table

NaiveBayes
Classifiers
§ Likelihood Table

NaiveBayes
Classifiers
§ Problem
§ If the weather is sunny, then the Player
should play or not?
§ Solution

Naive Bayes
Classifiers
§ Naive Bayes models are so efficient is that they
learn parameters by looking at each feature
individually and collect simple per class
statistics from each feature

Naive Bayes
Classifiers
§ Types of Naive Bayes Classifiers
§ GaussianNB
§ BernoulliNB
§ MultinomialNB

GaussianNB
§ GaussianNB
§ GaussianNB can be applied to any continuous data
§ GaussianNB stores the average value as well as the
standard deviation of each feature for each class
§ If predictors take continuous values instead of
discrete, then the model assumes that these values are
sampled from the Gaussian distribution
§ Gaussian Naive Bayes is a machine lear ning
classification technique based on a probablistic
approach that assumes each class follows a normal
distribution
§ The combination of the prediction for all parameters is
the final prediction that returns a probability of the
dependent variable to be classified in each group
§ The final classification is assigned to the group with
the higher probability

GaussianNB
§ The Gaussian model assumes that features follow a
normal distribution
§ Normal Distribution -
§ Describes the distributions of continuous random variables in
nature and is defined by its bell-shaped curve
§ A normal distribution has a probability distribution that is
centered around the mean
§ This means that the distribution has more data around the
mean
§ The data distribution decreases as you move away from the
center
§ The resulting curve is symmetrical about the mean and
forms a bell-shaped distribution

BernoulliNB
§ BernoulliNB
§ BernoulliNB assumes binary data
§ Used for discrete probability calculation
§ The predictor variables are the independent
Boolean variables
§ Mostly used in text data classification/Document
classification
§ Counts how often every feature of each class is
not zero

BernoulliNB
§ Example –
§ Four data points
§ Four binary features for each data point
§ Two classes 0 and 1
§ For class 0 (the first and third data points), the first
feature is zero two times and nonzero zero times, the
second feature is zero three time and nonzero one
time, and so on

MultinomialNB
§ MultinomialNB
§ MultinomialNB assumes count data
§ Example - that each feature represents an integer
count of something, like how often a word appears in
a sentence
§ Mostly used in text data classification as
BernouliNB
§ MultinomialNB takes into account the average
value of each feature for each class

Naive Bayes
Classifiers
§ Note 1 -
§ To make a prediction a data point is compared to
the statistics for each of the classes and the best
matching class is predicted.
§ Note 2 -
§ The prediction formula for MultinomialNB and
BernoulliNB is same as in linear models
§ Note 3 -
§ Coef_ for the naive bayes models has a different
meaning than in the linear models

Strengths,
Weaknesses,
and Parameters
§ Parameters -
§ MultinomialNB and BernoulliNB have a
single parameter alpha which controls
model complexity
§ Large alpha means more smoothing,
resulting in complex models
§ Note -
§ Algorithms performance is relatively
robust to the value of alpha
§ Setting alpha is NOT critical for
good performance but tuning it
u s u al ly im p r oves a c cu r a cy
somewhat

Strengths,
Weaknesses,
and Parameters
§ GaussianNB is mostly used on ver y high-
dimensional data
§ BernoulliNB and MultinomialNB are widely used
for sparse count data such as text
§ MultinomialNB usually performs better then
BernoulliNB particularly on datasets with a relatively
large number of nonzero features

Strengths,
Weaknesses,
and Parameters
§ Advantages -
§ Very fast to train and to predict
§ Easiest algorithm
§ Training procedure is easy to understand
§ Models work very well with high dimensional
sparse data and are relatively robust to the
parameters
§ Great baseline models
§ Often used on very large datasets
§ Works well for both Binary classifcation and
multiclass classification problems also
§ Best model for Text Classification Problems

Strengths,
Weaknesses,
and Parameters
§ Weakness-
§ Naive Bayes assumes that all features are independent
or unrelated
§ so it cannot learn the relationship between features

Decision
Trees
Building Decision trees
Controlling complexity of
Decision trees
Feature importance in trees

DecisionTrees
§ Widely used models for both classification and
regression
§ They learn a hierarchy of if/else questions,
leading to a decision
Example –
§ Distinguish between the following four animals
§ Bears
§ Hawks
§ Hen
§ Dolphins

DecisionTrees
Example -
§ Input –
§ Output -

DecisionTrees
§ Each node in the tree either
§ represents a question
§ Terminal node (also called a leaf) that contains the
answer
§ In ML we build a model to distinguish between
four classes of animals using the three features
“has feathers,”“can fly,” and “has fins.”

DecisionTrees
§ Building decision trees
§ Example - two_moons dataset
§ The dataset consists of two half- moon shapes, with each class
consisting of 75 data points
§ Learning a decision tree means learning the sequence of
if/else questions that gets us to the true answer most
quickly
§ In the machine learning, these if-else questions are called
tests
§ Question format in case of continuous data -
§ In real life data does not come in the form of binary yes/no
features as in the animal example
§ Data can be continuous in real life situations
§ The tests that are used on continuous data are of the form “Is
feature i larger than value a?”

DecisionTrees
§ Splitting the dataset horizontally at x[1]=0.0596 yields
the most information; it best separates the points in
class 0 from the points in class 1
§ The top node, also called the root, represents the
wholedataset, consisting of 50 points belonging to class
0 and 50 points belonging to class 1

DecisionTrees
§ The split is done by testing whether x[1] <= 0.0596
(test), indicated by a black line
§ If test is True -
§ Assigned to the left node, which contains 2 points
belonging to class 0 and 32 points belonging to class 1
§ If test is False -
§ Assigned to the right node, which contains 48 points
belonging to class 0 and 18 points belonging to class 1
§ Though the first split did a good job of separating the
two classes, the bottom region still contains points
belonging to class 0, and the top region still contains
points belonging to class 1
§ Figure 2-25 shows that the most informative next split
for the left and the right region is based on x[0]

DecisionTrees
§ This recursive process yields a binary tree of decisions, with
each node containing a test
§ Each test splits the part of the data that is currently being
considered along one axis
§ This yields a view of the algorithm as building a hierarchical
partition
§ Each test concerns only a single feature
§ which results in partitions into regions that are always parallel
to the axes
§ The recursive partitioning of the data is repeated until each region
in the partition (each leaf in the decision tree) only contains a
single target value (a single class or a single regression value)
§ Pure Leaves-
§ The leaf of the tree that contains data points that all share the same
target value is called PURE

DecisionTrees § The above fig is the final partition
§ A prediction on a new data point is made by checking
which region of the partition of the feature space the
point lies in, and then predicting the majority target in
that region
§ It is also possible to use trees for regression
tasks
§ Where the output for this data point is the mean
target of the training points in this leaf

DecisionTrees
Controlling complexity of decision trees
§ Drawback -
§ Building a tree untill all leaves are PURE leads to
models that are very complex and highly overfit
to training data
§ The overfitting can be seen on the left of Figure 2-26
§ We can see a small strip predicted as class 0 around
the point belonging to class 1

DecisionTrees
Controlling complexity of decision trees
§ Common strategies to prevent overfitting
§ pre-pruning -
§ Stopping the creation of the tree early (also called
pre-pruning)
§ Possible criteria for pre-pruning
§ Limiting the maximum depth of the tree
§ Limiting the maximum number of leave
§ Requiring a minimum number of data points in a node
to keep splitting it
§ post-pruning -
§ Building the tree but then removing or collapsing
nodes that contain little information
§ Also called as pruning

DecisionTrees
§ Decision trees in scikit-learn are implemented in
the
§ DecisionTreeRegressor
§ DecisionTreeClassifier
§ Scikit-learn only implements pre-pruning but
NOT post-pruning

DecisionTrees
Breast Cancer dataset
§ Import the dataset and split it into a training and a test part.
§ Then we build a model using the default setting of fully
developing the tree (growing the tree until all leaves are pure).
§ We fix the random_state in the tree, which is used for tie-
breaking internally
§ Input -
§ Output -

DecisionTrees
§ The accuracy on the training set is 100% —
because the leaves are pure
§ The tree was grown deep enough that it could
perfectly memorize all the labels on the
training data
§ The test set accuracy is slightly worse than for the
linear models
§ Limiting the depth of the tree decreases
overfitting

DecisionTrees
§ Limiting the depth of the tree decreases
overfitting
§ If we don’t restrict the depth of a decision tree, the
tree can become arbitrarily deep and complex
§ Unpruned trees are therefore prone to overfitting
and not generalizing well to new data
§ Prepruning to the tree -
§ will stop developing the tree before we perfectly fit
to the training data
§ How to stop building the tree after a certain depth
has reached
§ Set max_depth=4 - meaning only four consecutive
questions can be asked
§ This will lead to lower training accuracy and
improve test accuracy

DecisionTrees
§ Input -
§ Output -

Analyzing
DecisionTrees
§ Input -
§ Output -

Analyzing
DecisionTrees
§ The example provides a good description for the
decision tree machine learning algorithm which can
be easily explained to nonexperts
§ With a tree of depth four, as seen here, the tree can
become a bit overwhelming.
§ Deeper trees are even harder to grasp
§ One method of inspecting the tree that may be helpful
is to find out which path most of the data actually
takes

Feature
importance in
DecisionTrees
Feature importance in trees
§ Instead of looking at the whole tree, some useful
properties can be used to summarize the tree
§ The most commonly used summary is feature
importance
§ it rates how important each feature is for the
decision a tree makes
§ It is a number between 0 and 1 for each feature,
§ 0 means “not used at all”
§ 1 means “perfectly predicts the target.”

DecisionTrees
§ Input-
§ Output-

DecisionTrees
§ Worst radius is by far the most important feature
§ Note 1 -
§ If a feature has a low value in feature_importance_, it
doesn’t mean that this feature is uninformative
§ It only means that the feature was not picked by the tree,
likely because another feature encodes the same
information
§ Note 2 -
§ Feature importances are always positive
§ Note 3 -
§ The feature importances tell us that “worst radius” is
important, but not whether a high radius is
indicative of a sample being benign or
malignant

DecisionTrees
Regressor
§ Decision trees for regression, as implemented in
DecisionTreeRegressor
§ The usage and analysis of regression trees is very
similar to that of classification trees
§ The DecisionTreeRegressor is not able to
extrapolate -
§ make predictions outside of the range of the
training data

DecisionTrees
Regressor
§ Compare two simple models -
§ Decision Tree Regressor
§ Linear Regression
§ Rescale the prices using a logarithm
§ This doesn’t make a difference for the Decision
Tree Regressor, but it makes a big difference
for Linear Regression
§ After training the models and making predictions,
we apply the exponential map to undo the
logarithm transform

DecisionTrees
Regressor
§ The linear model approximates the data with a line and provides
quite a good forecast for the test data

DecisionTrees
Regressor
§ The tree model, on the other hand, makes perfect
predictions on the training data
§ We did not restrict the complexity of the tree, so it
learned the whole dataset by heart
§ Once we leave the data range for which the model
has data, the model simply keeps predicting the
last known point
§ The tree has no ability to generate “new”
responses, outside of what was seen in the training
data
§ This shortcoming applies to all models based
on trees

DecisionTrees
Regressor
(Strengths,
Weaknessesand
Parameters)
§ Parameters -
§ The parameters that control model complexity in
decision trees are the pre-pruning parameters
that stop the building of the tree before it is fully
developed
§ max_depth
§ max_leaf_nodes
§ min_samples_leaf
§ These parameters are sufficient to prevent
overfitting

DecisionTrees
Regressor
(Strengths,
Weaknessesand
Parameters)
§ Strengths -
§ The resulting model can easily be visualized and
understood by nonexperts (at least for smaller trees)
§ Algorithms are completely invariant to scaling of the data
§ Each feature is processed separately
§ Split of the data don’t depend on scaling
§ NO preprocessing like normalization or standardization of
features is needed for decision tree algorithms
§ Decision trees work well when you have features that are
§ on completely different scales
§ a mix of binary and continuous features

DecisionTrees
Regressor
(Strengths,
Weaknessesand
Parameters)
§ Weaknesses -
§ Without the use of pre-pruning, they tend
t o o v e r f i t a n d p r o v i d e p o o r
generalization performance

Ensembles of
Decision
Trees
Random forests
Gradient boosted regression
trees

Ensembles of
DecisionTrees
Ensembles of Decision Trees
§ What are ensembles?
§ Ensembles are methods that combine multiple machine
learning models to create more powerful models
§ Two ensemble models that have proven to be effective on
a wide range of datasets for classification and regression
§ Random forests
§ Gradient boosted decision trees
§ Both use decision trees as their building blocks

RandomForests
Random Forest
§ Main drawback of decision trees is that they tend to
overfit the training data
§ Random forests are one way to address this problem
§ What?
§ A random forest is essentially a collection of decision
trees, where each tree is slightly different from the
others
§ Idea behind Random Forests -
§ Each tree might do a relatively good job of predicting,
but will likely overfit on part of the data
§ If we build many trees, all of which work well and
overfit in different ways
§ We can reduce the amount of overfitting by averaging
their results

RandomForest
Random Forest
§ Need to build many decision trees
§ Each tree should do an acceptable job of predicting
the target, and should also be different from the
other trees
§ Why Random Forest ?
§ Random forests get their name from injecting
randomness into the tree building to ensure each
tree is different
§ Two ways of randomizing
§ By selecting the data points used to build a tree
§ By selecting the features in each split test

RandomForest
Randomness in RandomForest is decided by
§ Bootstrap sample
§ Selection of features (max_features)

RandomForest
Building Random forests
§ Step 1 -
§ You need to decide on the number of trees to build
(n_estimators parameter)
§ Note -
§ Trees will be built completely independently from
each other
§ Algorithm will make different random choices for
each tree to make sure the trees are distinct

RandomForest
Bootstrap sample
§ To build a tree first we need to take a bootstrap
sample
§ How?
§ From our n_samples data points, we repeatedly draw a
sample randomly with replacement n_samples times
§ Replacement meaning the same sample can be picked
multiple times
§ Example on Boot Strap Sample -
§ Creating a bootstrap sample of the list ['a', 'b', 'c', 'd'].
§ A possible bootstrap sample would be ['b', 'd', 'd', 'c'].
§ Another possible sample would be ['d', 'a', 'd', 'a’]
§ This will create a dataset that is as big as the original
dataset, but some data points will be missing from
it , and some will be repeated

RandomForest
§ Step 2 -
§ A decision tree is built based on this newly created
dataset
§ Instead of looking for the best test for each node, in each node
the algorithm randomly selects a subset of the features, and
it looks for the best possible test involving one of these
features
§ The number of features that are selected is controlled by the
max_features parameter.
§ This selection of a subset of features is repeated separately
in each node, so that each node in a tree can make a decision
using a different subset of the features
§ The bootstrap sampling leads to each decision tree in the
random forest being built on a slightly different dataset
§ Because of the selection of features in each node, each split in
each tree operates on a different subset of features

RandomForest
§ A critical parameter in this process is max_features
§ max_features = n_features means
§ that each split can look at all features in the dataset
§ NO randomness will be injected in the feature selection
§ max_features =1, means
§ that the splits have no choice at all on which feature to
test
§ max_features = HIGH means
§ that the trees in the random forest will be quite similar
§ they will be able to fit the data easily, using the most
distinctive features
§ max_features = LOW means
§ that the trees in the random forest will be quite
different

RandomForest
§Prediction
§ Random forest algorithm predicts by first making
a prediction for every treee in the forest
§ For regression -
§ Average - we can average these results of all the
decision trees to get our final prediction
§ For classification -
§ Soft voting -
§ Each Decision Tree makes a “soft” prediction,
providing a probability for each possible
output label
§ The probabilities predicted by all the trees are
averaged, and the class with the highest
probability is predicted

RandomForest
Analyzing random forests
§ Input –
§ The trees that are built as part of the random forest
are stored in the estimator_ attribute

RandomForest
§ Input –
§ Decision boundaries learned by the five trees are quite
different
§ some of the training points that are plotted here were not
actually included in the training sets of the trees, due to
the bootstrap sampling
§ Note -
§ The random forest overfits less than any of the trees
individually

RandomForest
§ In any real application, we would use many more
trees (often hundreds or thousands), leading to
even smoother boundaries

RandomForest
§ Random forest consisting of 100 trees
§ Input –
§ Output -

RandomForest
(Strengths,
Weaknesses,and
Parameters)
Strengths -
§ They are very powerful
§ Works well without heavy tuning of the
parameters
§ Don’t require scaling of the data

RandomForest
(Strengths,
Weaknesses,and
Parameters)
§ Why still Decision tree is used instead of
Random Forest?
§ Decision trees are compact representation of the
Random Forest in decision-making process

RandomForest
(Strengths,
Weaknesses,and
Parameters)
Weaknesses -
§ It is basically impossible to interpret tens
or hundreds of trees in detail
§ Random forests tend to be deeper than
decision trees (because of the use of feature
subsets)
§ Building random forests on large datasets
might be somewhat time consuming

RandomForest
§Multi-Core Processing -
§ To increase the speed of building random
forests on large datasets
§ Use the n_jobs parameter to adjust the number
of cores to use
§ Using more CPU cores will result in linear
speedups
§ n_jobs=-1 to use all the cores in your computer

RandomForest
(Strengths,
Weaknesses,and
Parameters)
§ Parameters -
§ The important parameters to adjust are
§ n_estimators
§ max_features
§ Possibly pre-pruning options like max_depth
§ Note 1 -
§ For n_estimators, larger is always better
§ Thumb rule is to build as many as you have
time/memory for
§ Note 2 -
§ max_features - determines how random each
tree is
§ Smaller max_features reduces overfitting
§ Thumb rule is
§ max_features = sqrt(n_features) for classification
§ max_features = n_features for regression

RandomForest
(Strengths,
Weaknesses,and
Parameters)
§ Note 1 -
§ The more trees there are in the forest, the more robust it will be against the
choice of random state
§ Note 2 -
§ Random forests don’t tend to perform well on very high dimensional,
sparse data, such as text data
§ Linear models are best choice for very high dimensional and sparse data
§ Note 3 -
§ Random forests usually work well even on very large datasets
§ Note 4 -
§ Training can easily be parallelized over many CPU cores within a
powerful computer
§ Note 5 -
§ Random Forests are slower to train
§ Note 6 -
§ Random forests require more memory
§ Note 7 -
§ If time and memory are crucial linear models are best choice than
Random Forests

Gradient
Boosting
Gradient boosted regression trees
§ Also called as gradient boosting machines
§ Another ensemble method -
§ combines multiple decision trees to create a more powerful model
§ Basic Idea -
§ Combine many simple models (weak learners)
§ Each weak learner (tree) can only provide good predictions on part of
the data
§ More and more trees are added iteratively to improve performance
§ Despite the “regression” in the name, these models can be used
for regression and classification
§ Gradient boosting works by building trees in a serial manner -
§ where each tree tries to correct the mistakes of the previous one
§ By default, there is no randomization in gradient boosted
regression trees
§ But, Strong pre-pruning is used
§ Gradient boosted trees often use very shallow trees, of depth
one to five

Gradient
Boosting
Advantages of Gradient Boosted Regression
trees
§ Sm all er i n ter m s o f me mo r y ( b e cau s e o f
shallowness)
§ Makes predictions faster
§ Gradient boosted trees are frequently winning
entries in machine learning competitions
§ Widely used in industry
§ Bit more sensitive to parameter settings than
random forests
§ Provide better accuracy if the parameters are set
correctly

Gradient
Boosting
Parameter of gradient boosting
§ Apart from Pre-pruning and Number of trees
(n_estimators)
§ Another important parameter of gradient boosting
is the learning_rate
§ Controls how strongly each tree tries to correct the
mistakes of the previous trees
§ Note 1 -
§ Higher learning_rate means each tree can make
stronger corrections, allowing for more complex
models
§ Note 2 -
§ Adding more trees to the ensemble, which can be
accomplished by increasing n_estimators

Gradient
Boosting
Gradient Boosting Classifier
§ Input –
§ Output -

Gradient
Boosting
§ Training accuracy of 100% - Overfitting
§ To reduce Overfit we can apply
§ Stronger pre-pruning (limiting the max depth)
§ Lower the learning rate

Gradient
Boosting
Pre-pruning
§ Input –
§ Output -

Gradient
Boosting
Learning_rate
§ Input –
§ Output -

Gradient
Boosting
§ Feature Importance
§ Input –
§ Output –

Gradient
Boosting
§ Feature Importance -
§ Gradient boosting and random forests perform
well on similar kinds of data
§ Note -
§ First try random forests, which work quite robustly but
if it is taking more prediction time moving to
GradientBoosting will help
§ Note -
§ If gradient boosting needs to be applied to a large-
scale problem, better for xgboost package

Strengths,
Weaknesses,
and Parameters
Strengths -
§ Most powerful and widely used models for
supervised learning
§ Algorithm works well without scaling and on a
mixture of binary and continuous features

Strengths,
Weaknesses,
and Parameters
§ Weaknesses -
§ They require careful tuning of the parameters
§ May take a long time to train
§ Does not work well on high-dimensional sparse
data

Strengths,
Weaknesses,
and Parameters
Parameters
§ max_depth
§ used to reduce the complexity of each tree
§ Usually max_depth is set very low
§ n_estimators
§ A higher n_estimators is always better
§ increasing n_estimators in gradient boosting leads to
a more complex model, which may lead to
overfitting
§ Fit n_estimators depending on the time and memory,
and then search over different learning_rates
§ learning_rate
§ Controls the degree to which each tree is allowed to
correct the mistakes of the previous trees

Kernelized
SupportVector
Machines
The Kernelized Support
Vector Machines
The Kernal trick
Understanding SVMs
Tuning SVM Parameters

Kernelized
SupportVector
Machines
Kernelized support vector machines
§ Kernelized support vector machines
§ Often just referred to as SVMs
§ Allows for more complex models that are not
defined simply by hyperplanes in the input space
§ Classification and regression
§ SVC – Classification
§ SVR - Regression

Kernelized
SupportVector
Machines

Kernelized
SupportVector
Machines
§ Terminology
§ Mar gin – Margin is the gap b etween the
hyperplane and the support vectors
§ H yper plan e – Hyp er plane s a re d ecision
boundaries that aid in classifying the data points
§ Support Vectors – Support Vectors are the data
points that are on or nearest to the hyperplane and
influence the position of the hyperplane
§ Kernel function – These are the functions used to
determine the shape of the hyperplane and
decision boundary

Kernelized
SupportVector
Machines
Linear models and nonlinear features
§ L i n e a r m o d e l s c a n b e q u i t e l i m i t i n g i n
lowdimensional spaces, as lines and hyperplanes
have limited flexibility
§ One way to make a linear model more flexible is by
adding more features

Kernelized
SupportVector
Machines
§ Input –
§ Output –

Kernelized
SupportVector
Machines
§ A linear model for classification can only separate
points using a line, and will not be able to do a
very good job on this dataset
§ Input -

Kernelized
SupportVector
Machines
§ Expand the set of input features
§ Feature2 = Feature1 ** 2 ---> (Non-linear
Feature)
§ Square of the second feature, as a new feature.
§ Instead of representing each data point as a two-
dimensional point t, (feature0, feature1)
§ We now represent it as a three-dimensional point,
(feature0, feature1, feature1 ** 2)

Kernelized
SupportVector
Machines
§ Example –

Kernelized
SupportVector
Machines
§ Input –

Kernelized
SupportVector
Machines
§ Output –

Kernelized
SupportVector
Machines
The kernel trick
§ Adding nonlinear features to the representation of our data
can make linear models much more powerful
§ Drawbacks
§ Which features to add?
§ Adding many features might make computation very
expensive
§ Kernel Trick -
§ It is a clever mathematical trick -
§ Allows us to learn a classifier in a higher-dimensional space without
actually computing the new representation
§ Works by directly computing the distance of the data points for the
expanded feature representation, without ever actually computing the
expansion

Kernelized
SupportVector
Machines
§ Two ways to map your data into a higher-
dimensional space in SVM’s (Types of Kernel)
§ Polynomial Kernel
§ Radial Basis Function (RBF) (or) Gaussian Kernal

Kernelized
SupportVector
Machines
§ Polynomial kernel
§ Computes all possible polynomials up to a certain
degree of the original features (like feature1 ** 2 *
feature2 ** 5)
§ Radial Basis Function (RBF)
§ Also known as Gaussian Kernel
§ A bit harder to explain -
§ as it corresponds to an infinite dimensional feature space
§ It considers all possible polynomials of all degrees
§ But the importance of the features decreases for
higher degrees

Kernelized
SupportVector
Machines
Understanding SVMs
§ During training, the SVM learns how important
each of the training data points is to represent the
decision boundary between the two classes
§ Typically only a subset of the training points
matter for defining the decision boundary
§ Ones that lie on the border between the classes
§ These are called support vectors

Kernelized
SupportVector
Machines
§ To make a prediction for a new point
§ The distance to each of the support vectors is
measured
§ A classification decision is made based on the
distances to the support vector and importance
of the support vectors which is learned during
training
§ importance of support vectors is stored in an attribute
called dual_coef_ attribute of svc

Kernelized
SupportVector
Machines
§ The distance between data points is measured by
the Gaussian kernel
§
§ Here, x1 and x2 are data points
§ ǁ x1 - x2 ǁ denotes Euclidean distance
§ ɣ (gamma) is a parameter that controls the width of
the Gaussian kernel

Kernelized
SupportVector
Machines
§ Example –
§ Input -

Kernelized
SupportVector
Machines
§ SVM yeilds very smooth and nonlinear boundary
§ Output –

Kernelized
SupportVector
Machines
Tuning SVM Parameters
§ Gamma parameter
§ Kernel coefficient
§ only used in case of rbf, poly and sigmoid kernels
§ Corresponds to the inverse of the width of the
Gaussian kernel (RBF)
§ Gamma parameter determines how far the influence of
a single training example reaches, with low values
meaning corresponding to a far reach and high
values to a limited reach
§ The wider the radius of the Gaussian kernel, the
further the influence of each training example
§ C parameter
§ Regularization parameter
§ It limits the importance of each point

Kernelized
SupportVector
Machines
Explanation -
§ Left to Right (Gamma Parameter)
§ Increase the value of the parameter gamma from
0.1 to 10
§ A small gamma means a large radius for the
Gaussian kernel -
§ which means that many points are considered close by
§ Smooth boundaries on the left
§ Boundaries that focus more on single points
towards the right
§ GammaValue -
§ Low value - decision boundary will vary slowly
§ High value - yields a more complex model

Kernelized
SupportVector
Machines
Explanation -
§ Top to bottom (C Parameter)
§ Increase the C parameter from 0.1 to 1000
§ C values -
§ LowValue -
§ Restricted Model
§ Decision boundary is nearly linear
§ Each Data Point will have limited influence
§ HighValue -
§ Decision boundary bend to classify the data points (Non-
linear)
§ Each Data point had stronger influence on the model

Kernelized
SupportVector
Machines
§ Example - (Breast Cancer Dataset)
§ Input -

Kernelized
SupportVector
Machines
§ SVMs often perform quite well
§ Very sensitive
§ to the settings of the parameters
§ to the scaling of the data
§ Require all the features to vary on a similar
scale

Kernelized
SupportVector
Machines
Example -
§ Breast Cancer dataset are of completely different
orders of magnitude
§ Input –
§ Output -

Kernelized
SupportVector
Machines
Problem with SVM-
§ Breast Cancer dataset are of completely different orders
of magnitude
§ It will result in devastating effects for the kernel SVM
§ Solutions -
§ Preprocessing data for SVMs
§ Rescaling each feature so that they are all approximately on
the same scale
§ A common rescaling method for kernel SVMs is to scale the
data such that all features are between 0 and 1

Kernelized
SupportVector
Machines
§ MinMaxScaler preprocessing method
§ Input - (Training Dataset)
§ Output -

Kernelized
SupportVector
Machines
§ Input - (Test Data Set)
§ Input -
§ Output -
§ Scaling the Data made a huge difference -
§ Lead to underfitting -
§ where training and test set performance are quite
similar

Kernelized
SupportVector
Machines
§ We can try increasing either C or gamma to fit a
more complex model
§ Input -
§ Output -
§ Increasing C allows us to improve the model
significantly, resulting in 97.2% accuracy

Strengths,
Weaknesses,
and Parameters
Strengths -
§ Kernelized support vector machines are powerful
models
§ Perform well on a variety of datasets
§ Allow for complex decision boundaries, even if
the data has only a few features
§ Work well on low-dimensional and high-
dimensional data (i.e., few and many features)

Strengths,
Weaknesses,
and Parameters
Weaknesses -
§ Don’t scale ver y well with the number of
samples
§ Running an SVM on data with up to 10,000 samples
might work well, but working with datasets of size
100,000 or more can become challenging in terms of
runtime and memory usage
§ Require careful preprocessing of the data and
tuning of the parameters
§ SVM models are hard to inspect -
§ It can be difficult to understand why a particular
prediction was made
§ It is tricky to explain the model to a nonexpert

Strengths,
Weaknesses,
and Parameters
§ Note -
§ Try SVMs particularly if all of your features
represent measurements in similar units and they
are on similar scales

Strengths,
Weaknesses,
and Parameters
Parameters
§ Regularization parameter C
§ Choice of the kernel (Polynomial kernel or RBF
Kernel)
§ Kernel-specific parameters (Gamma and C)
§ both control the complexity of the model, with large
values in either resulting in a more complex model

Uncertainty
Estimates
from
Classifiers
The Decision Function
Predicting Probabilities

Uncertainty
Estimates from
Classifiers
Uncertainty Estimates from Classifiers
§ In scikit-learn - classifiers provide uncertainty estimates
of predictions
§ We are not only interested in which class a
classifier predicts for a certain test point, but also
how certain it is that this is the right class
§ Different kinds of mistakes lead to very different
outcomes in real-world applications
§ Testing for cancer
§ False positive prediction might lead to a patient
undergoing additional tests
§ False negative prediction might lead to a serious
disease not being treated

Uncertainty
Estimates from
Classifiers
§ Two different functions used to obtain uncertainty
estimates from classifiers:
§ decision_function
§ predict_proba
§ Most classifiers have at least one of them
§ Many classifiers have both

Uncertainty
Estimates from
Classifiers
§ Gradient Boosting Classifier classifier, which
h a s b o t h a d e c i s i o n _ f u n c t i o n a n d a
predict_proba method:

Uncertainty
Estimates from
Classifiers
The Decision Function in Gradient Boosting
§ In Binary classification
§ Return value of decision_function is of shape
(n_samples,), and it returns one floating-point
number for each sample:
§ Input -
§ Output -

Uncertainty
Estimates from
Classifiers
§ This value encodes how strongly the model believes a
data point to belong to the “positive” class, in this case
class 1
§ Input -
§ Output -
§ Positive value indicate a preference for the positive
class (Class 1)
§ Negative Value indicate a preference for the Negative
Class (Class 0)

Uncertainty
Estimates from
Classifiers
§ Input –
§ Output -

Uncertainty
Estimates from
Classifiers
§ Input – (Range of decision_functin can be arbitrary and
depends on the data and the model parameters)
§ Output -
§ Note 1 -
§ Arbitrary scaling makes the output of decision_function
often hard to interpret

Uncertainty
Estimates from
Classifiers
Predicting Probabilities
§ The output of predict_proba is a probability for
each class
§ Often more easily understood than the output of
decision_function
§ It is always of shape (n_samples, 2) for binary
classification:
§ Input -
§ Output -

Uncertainty
Estimates from
Classifiers
§ The first entry in each row is the estimated
probability of the first class, and the second entry
is the estimated probability of the second class
§ Input -
§ Output -

Uncertainty
Estimates from
Classifiers
§ Because the probabilities for the two classes sum to
1, exactly one of the classes will be above 50%
certainty
§ That class is the one that is predicted
§ From the above example the classifier is relatively
certain for most points
§ How well the uncertainty actually reflects
uncertainty in the data depends on the model and
the parameters
§ Note 1 -
§ A model that is more overfitted tends to make more
certain predictions, even if they might be wrong
§ A model with less complexity usually has more
uncertainty in its predictions

Uncertainty
Estimates from
Classifiers
Calibrated model
§ A model is called calibrated if the reported
uncertainty actually matches how correct it is — in
a calibrated model, a prediction made with 70%
certainty would be correct 70% of the time

Uncertainty
Estimates from
Classifiers
Uncertainty in Multiclass Classification
§ decision_function and predict_proba methods also work in the
multiclass setting
§ In multiclass case, the shape of the decision_function is (n_samples,
n_classes)
§ each column provides a “certainty score” for each class, where
§ large score means that a class is more likely
§ small score means the class is less likely
§ Example –
§ Iris dataset
§ Input -

Uncertainty
Estimates from
Classifiers
§ Example -
§ Input –
§ Output -

Uncertainty
Estimates from
Classifiers
§ Example - (predict_proba)
§ Has shape as (n_samples, n_classes)
§ Maximum probability value is the prediction value
§ The probabilities of the possible classes for each
datapoint sum to 1
§ Input –
§ Output -

Uncertainty
Estimates from
Classifiers
§ Example - (predict_proba)
§ Input –
§ Output -

Uncertainty
Estimates from
Classifiers
§ Predict_proba and decision_function always have
shape (n_samples, n_classes)
§ Binary case, decision_function only has one
column, corresponding to the “positive” class
classes_[1]

Summary and
Outlook
Nearest neighbors
§ For small datasets
§ Good as a baseline
§ Easy to explain

Summary and
Outlook
Linear models
§ Go-to as a first algorithm to try
§ Good for very large datasets
§ Good for very high-dimensional data

Summary and
Outlook
Naive Bayes
§ Only for classification
§ Even faster than linear models
§ Good for very large datasets and high-dimensional
data
§ Often less accurate than linear models

Summary and
Outlook
Decision trees
§ Very fast
§ Don’t need scaling of the data
§ Can be visualized
§ Easily explained

Summary and
Outlook
Random forests
§ Nearly always perform better than a single decision
tree, very robust and powerful
§ Don’t need scaling of data
§ Not good for very high dimensional sparse data

Summary and
Outlook
Gradient boosted decision trees
§ Often slightly more accurate than random forests
§ Slower to train but faster to predict than random
forests
§ Smaller in memory
§ Need more parameter tuning than random forests

Summary and
Outlook
Support vector machines
§ Powerful for medium-sized datasets of features with
similar meaning
§ Require scaling of data
§ Sensitive to parameters

Summary and
Outlook
Neural networks
§ Can build very complex models, particularly for
large datasets
§ Sensitive to scaling of the data and to the choice of
parameters
§ Large models need a long time to train

Machine Learning - Implementation with Python - 2

Recommended

More Related Content

Similar to Machine Learning - Implementation with Python - 2 (20)

More from University College of Engineering Kakinada, JNTUK - Kakinada, India (6)

Recently uploaded (20)

Machine Learning - Implementation with Python - 2