SlideShare a Scribd company logo
Machine Learning for Data Mining
Important Issues
Andres Mendez-Vazquez
July 3, 2015
1 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
2 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
3 / 34
Images/cinvestav-
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (1)
Remark: Where the xi ∼ p (x|Θ)!!!
4 / 34
Images/cinvestav-
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (1)
Remark: Where the xi ∼ p (x|Θ)!!!
4 / 34
Images/cinvestav-
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (1)
Remark: Where the xi ∼ p (x|Θ)!!!
4 / 34
Images/cinvestav-
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (1)
Remark: Where the xi ∼ p (x|Θ)!!!
4 / 34
Images/cinvestav-
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (1)
Remark: Where the xi ∼ p (x|Θ)!!!
4 / 34
Images/cinvestav-
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (1)
Remark: Where the xi ∼ p (x|Θ)!!!
4 / 34
Images/cinvestav-
Introduction
What did we see until now?
The design of learning machines from two main points:
Statistical Point of View
Linear Algebra and Optimization Point of View
Going back to the probability models
We might think in the machine to be learned as a function g (x|D)....
Something as curve fitting...
Under a data set
D = {(xi, yi) |i = 1, 2, ..., N} (1)
Remark: Where the xi ∼ p (x|Θ)!!!
4 / 34
Images/cinvestav-
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
5 / 34
Images/cinvestav-
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
5 / 34
Images/cinvestav-
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
5 / 34
Images/cinvestav-
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
5 / 34
Images/cinvestav-
Thus, we have that
Two main functions
A function g (x|D) obtained using some algorithm!!!
E [y|x] the optimal regression...
Important
The key factor here is the dependence of the approximation on D.
Why?
The approximation may be very good for a specific training data set but
very bad for another.
This is the reason of studying fusion of information at decision level...
5 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
6 / 34
Images/cinvestav-
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (2)
Remark: The expected output of the machine g (x|D)
7 / 34
Images/cinvestav-
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (2)
Remark: The expected output of the machine g (x|D)
7 / 34
Images/cinvestav-
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (2)
Remark: The expected output of the machine g (x|D)
7 / 34
Images/cinvestav-
How do we measure the difference
We have that
Var(X) = E((X − µ)2
)
We can do that for our data
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
Now, if we add and subtract
ED [g (x|D)] (2)
Remark: The expected output of the machine g (x|D)
7 / 34
Images/cinvestav-
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3)
8 / 34
Images/cinvestav-
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3)
8 / 34
Images/cinvestav-
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3)
8 / 34
Images/cinvestav-
Thus, we have that
Or Original variance
VarD (g (x|D)) = ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
+ ...
...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ...
... (ED [g (x|D)] − E [y|x])2
Finally
ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3)
8 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
9 / 34
Images/cinvestav-
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represents the measure of the error between our machine g (x|D) and
the expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represents the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
10 / 34
Images/cinvestav-
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represents the measure of the error between our machine g (x|D) and
the expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represents the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
10 / 34
Images/cinvestav-
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represents the measure of the error between our machine g (x|D) and
the expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represents the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
10 / 34
Images/cinvestav-
We have the Bias-Variance
Our Final Equation
ED (g (x|D) − E [y|x])2
= ED (g (x|D) − ED [g (x|D)])2
VARIANCE
+ (ED [g (x|D)] − E [y|x])2
BIAS
Where the variance
It represents the measure of the error between our machine g (x|D) and
the expected output of the machine under xi ∼ p (x|Θ).
Where the bias
It represents the quadratic error between the expected output of the
machine under xi ∼ p (x|Θ) and the expected output of the optimal
regression.
10 / 34
Images/cinvestav-
Remarks
We have then
Even if the estimator is unbiased, it can still result in a large mean square
error due to a large variance term.
The situation is more dire in a finite set of data D
We have then a trade-off:
1 Increasing the bias decreases the variance and vice versa.
2 This is known as the bias–variance dilemma.
11 / 34
Images/cinvestav-
Remarks
We have then
Even if the estimator is unbiased, it can still result in a large mean square
error due to a large variance term.
The situation is more dire in a finite set of data D
We have then a trade-off:
1 Increasing the bias decreases the variance and vice versa.
2 This is known as the bias–variance dilemma.
11 / 34
Images/cinvestav-
Remarks
We have then
Even if the estimator is unbiased, it can still result in a large mean square
error due to a large variance term.
The situation is more dire in a finite set of data D
We have then a trade-off:
1 Increasing the bias decreases the variance and vice versa.
2 This is known as the bias–variance dilemma.
11 / 34
Images/cinvestav-
Remarks
We have then
Even if the estimator is unbiased, it can still result in a large mean square
error due to a large variance term.
The situation is more dire in a finite set of data D
We have then a trade-off:
1 Increasing the bias decreases the variance and vice versa.
2 This is known as the bias–variance dilemma.
11 / 34
Images/cinvestav-
Similar to...
Curve Fitting
If, for example, the adopted model is complex (many parameters involved)
with respect to the number N, the model will fit the idiosyncrasies of the
specific data set.
Thus
Thus, it will result in low bias but will yield high variance, as we change
from one data set to another data set.
Furthermore
If N grows we can have a more complex model to be fitted which reduces
bias and ensures low variance.
However, N is always finite!!!
12 / 34
Images/cinvestav-
Similar to...
Curve Fitting
If, for example, the adopted model is complex (many parameters involved)
with respect to the number N, the model will fit the idiosyncrasies of the
specific data set.
Thus
Thus, it will result in low bias but will yield high variance, as we change
from one data set to another data set.
Furthermore
If N grows we can have a more complex model to be fitted which reduces
bias and ensures low variance.
However, N is always finite!!!
12 / 34
Images/cinvestav-
Similar to...
Curve Fitting
If, for example, the adopted model is complex (many parameters involved)
with respect to the number N, the model will fit the idiosyncrasies of the
specific data set.
Thus
Thus, it will result in low bias but will yield high variance, as we change
from one data set to another data set.
Furthermore
If N grows we can have a more complex model to be fitted which reduces
bias and ensures low variance.
However, N is always finite!!!
12 / 34
Images/cinvestav-
Similar to...
Curve Fitting
If, for example, the adopted model is complex (many parameters involved)
with respect to the number N, the model will fit the idiosyncrasies of the
specific data set.
Thus
Thus, it will result in low bias but will yield high variance, as we change
from one data set to another data set.
Furthermore
If N grows we can have a more complex model to be fitted which reduces
bias and ensures low variance.
However, N is always finite!!!
12 / 34
Images/cinvestav-
Thus
You always need to compromise
However, you always have some a priori knowledge about the data
Allowing you to impose restrictions
Lowering the bias and the variance
Nevertheless
We have the following example to grasp better the bothersome
bias–variance dilemma.
13 / 34
Images/cinvestav-
Thus
You always need to compromise
However, you always have some a priori knowledge about the data
Allowing you to impose restrictions
Lowering the bias and the variance
Nevertheless
We have the following example to grasp better the bothersome
bias–variance dilemma.
13 / 34
Images/cinvestav-
Thus
You always need to compromise
However, you always have some a priori knowledge about the data
Allowing you to impose restrictions
Lowering the bias and the variance
Nevertheless
We have the following example to grasp better the bothersome
bias–variance dilemma.
13 / 34
Images/cinvestav-
For this
Assume
The data is generated by the following function
y =f (x) + ,
∼N 0, σ2
We know that
The optimum regressor is E [y|x] = f (x)
Furthermore
Assume that the randomness in the different training sets, D, is due to the
yi’s (Affected by noise), while the respective points, xi, are fixed.
14 / 34
Images/cinvestav-
For this
Assume
The data is generated by the following function
y =f (x) + ,
∼N 0, σ2
We know that
The optimum regressor is E [y|x] = f (x)
Furthermore
Assume that the randomness in the different training sets, D, is due to the
yi’s (Affected by noise), while the respective points, xi, are fixed.
14 / 34
Images/cinvestav-
For this
Assume
The data is generated by the following function
y =f (x) + ,
∼N 0, σ2
We know that
The optimum regressor is E [y|x] = f (x)
Furthermore
Assume that the randomness in the different training sets, D, is due to the
yi’s (Affected by noise), while the respective points, xi, are fixed.
14 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
15 / 34
Images/cinvestav-
Sampling the Space
Imagine that D ⊂ [x1, x2] in which x lies
For example, you can choose xi = x1 + x2−x1
N−1 (i − 1) with i = 1, 2, ..., N
16 / 34
Images/cinvestav-
Case 1
Choose the estimate of f (x), g (x|D), to be independent of D
For example, g (x) = w1x + w0
For example, the points are spread around (x, f (x))
17 / 34
Images/cinvestav-
Case 1
Choose the estimate of f (x), g (x|D), to be independent of D
For example, g (x) = w1x + w0
For example, the points are spread around (x, f (x))
0
Data Points
17 / 34
Images/cinvestav-
Case 1
Since g (x) is fixed
ED [g (x|D)] = g (x|D) ≡ g (x) (4)
With
VarD [g (x|D)] = 0 (5)
On the other hand
Because g (x) was chosen arbitrarily the expected bias must be large.
(ED [g (x|D)] − E [y|x])2
BIAS
(6)
18 / 34
Images/cinvestav-
Case 1
Since g (x) is fixed
ED [g (x|D)] = g (x|D) ≡ g (x) (4)
With
VarD [g (x|D)] = 0 (5)
On the other hand
Because g (x) was chosen arbitrarily the expected bias must be large.
(ED [g (x|D)] − E [y|x])2
BIAS
(6)
18 / 34
Images/cinvestav-
Case 1
Since g (x) is fixed
ED [g (x|D)] = g (x|D) ≡ g (x) (4)
With
VarD [g (x|D)] = 0 (5)
On the other hand
Because g (x) was chosen arbitrarily the expected bias must be large.
(ED [g (x|D)] − E [y|x])2
BIAS
(6)
18 / 34
Images/cinvestav-
Case 2
In the other hand
Now, g1 (x) corresponds to a polynomial of high degree so it can pass
through each training point in D.
Example of g1 (x)
19 / 34
Images/cinvestav-
Case 2
In the other hand
Now, g1 (x) corresponds to a polynomial of high degree so it can pass
through each training point in D.
Example of g1 (x)
0
Data Points
19 / 34
Images/cinvestav-
Case 2
Due to the zero mean of the noise source
ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7)
Remark: At the training points the bias is zero.
However the variance increases
ED (g1 (x|D) − ED [g1 (x|D)])2
= ED (f (x) + − f (x))2
= σ2
, for x = xi, i = 1, 2, ..., N
In other words
The bias becomes zero (or approximately zero) but the variance is now
equal to the variance of the noise source.
20 / 34
Images/cinvestav-
Case 2
Due to the zero mean of the noise source
ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7)
Remark: At the training points the bias is zero.
However the variance increases
ED (g1 (x|D) − ED [g1 (x|D)])2
= ED (f (x) + − f (x))2
= σ2
, for x = xi, i = 1, 2, ..., N
In other words
The bias becomes zero (or approximately zero) but the variance is now
equal to the variance of the noise source.
20 / 34
Images/cinvestav-
Case 2
Due to the zero mean of the noise source
ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7)
Remark: At the training points the bias is zero.
However the variance increases
ED (g1 (x|D) − ED [g1 (x|D)])2
= ED (f (x) + − f (x))2
= σ2
, for x = xi, i = 1, 2, ..., N
In other words
The bias becomes zero (or approximately zero) but the variance is now
equal to the variance of the noise source.
20 / 34
Images/cinvestav-
Observations
First
Everything that has been said so far applies to both the regression and the
classification tasks.
However
Mean squared error is not the best way to measure the power of a classifier.
Think about
A classifier that send everything far away of the hyperplane!!! Away from
the values + − 1!!!
21 / 34
Images/cinvestav-
Observations
First
Everything that has been said so far applies to both the regression and the
classification tasks.
However
Mean squared error is not the best way to measure the power of a classifier.
Think about
A classifier that send everything far away of the hyperplane!!! Away from
the values + − 1!!!
21 / 34
Images/cinvestav-
Observations
First
Everything that has been said so far applies to both the regression and the
classification tasks.
However
Mean squared error is not the best way to measure the power of a classifier.
Think about
A classifier that send everything far away of the hyperplane!!! Away from
the values + − 1!!!
21 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
22 / 34
Images/cinvestav-
Introduction
Something Notable
In evaluating the performance of a classification system, the probability of
error is sometimes not the only quantity that assesses its performance
sufficiently.
For this assume a M-class classification task
An important issue is to know whether there are classes that exhibit a
higher tendency for confusion.
Where the confusion matrix
Confusion Matrix A = [Aij] is defined such that each element Aij is the
number of data points whose true class was i but where classified in class
j.
23 / 34
Images/cinvestav-
Introduction
Something Notable
In evaluating the performance of a classification system, the probability of
error is sometimes not the only quantity that assesses its performance
sufficiently.
For this assume a M-class classification task
An important issue is to know whether there are classes that exhibit a
higher tendency for confusion.
Where the confusion matrix
Confusion Matrix A = [Aij] is defined such that each element Aij is the
number of data points whose true class was i but where classified in class
j.
23 / 34
Images/cinvestav-
Introduction
Something Notable
In evaluating the performance of a classification system, the probability of
error is sometimes not the only quantity that assesses its performance
sufficiently.
For this assume a M-class classification task
An important issue is to know whether there are classes that exhibit a
higher tendency for confusion.
Where the confusion matrix
Confusion Matrix A = [Aij] is defined such that each element Aij is the
number of data points whose true class was i but where classified in class
j.
23 / 34
Images/cinvestav-
Thus
We have that
From A, one can directly extract the recall and precision values for each
class, along with the overall accuracy.
Recall - Ri
It is the percentage of data points with true class label i, which were
correctly classified in that class.
For example in a two-class problem
The recall of the first class is calculated as
R1 =
A11
A11 + A12
(8)
24 / 34
Images/cinvestav-
Thus
We have that
From A, one can directly extract the recall and precision values for each
class, along with the overall accuracy.
Recall - Ri
It is the percentage of data points with true class label i, which were
correctly classified in that class.
For example in a two-class problem
The recall of the first class is calculated as
R1 =
A11
A11 + A12
(8)
24 / 34
Images/cinvestav-
Thus
We have that
From A, one can directly extract the recall and precision values for each
class, along with the overall accuracy.
Recall - Ri
It is the percentage of data points with true class label i, which were
correctly classified in that class.
For example in a two-class problem
The recall of the first class is calculated as
R1 =
A11
A11 + A12
(8)
24 / 34
Images/cinvestav-
More
Precision - Pi
It is the percentage of data points classified as class i, whose true class is
indeed i.
Therefore again for a two class problem
P1 =
A11
A11 + A21
(9)
Overall Accuracy (Ac).
The overall accuracy, Ac, is the percentage of data that has been correctly
classified.
25 / 34
Images/cinvestav-
More
Precision - Pi
It is the percentage of data points classified as class i, whose true class is
indeed i.
Therefore again for a two class problem
P1 =
A11
A11 + A21
(9)
Overall Accuracy (Ac).
The overall accuracy, Ac, is the percentage of data that has been correctly
classified.
25 / 34
Images/cinvestav-
More
Precision - Pi
It is the percentage of data points classified as class i, whose true class is
indeed i.
Therefore again for a two class problem
P1 =
A11
A11 + A21
(9)
Overall Accuracy (Ac).
The overall accuracy, Ac, is the percentage of data that has been correctly
classified.
25 / 34
Images/cinvestav-
Thus, for a M-Class Problem
We have that
Ac =
1
N
M
i=1
Aii (10)
26 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
27 / 34
Images/cinvestav-
What we want
We want to measure
A quality measure to measure different classifiers (for different parameter
values).
We call that as
R(f ) = ED [L (y, f (x))] . (11)
Example: L (y, f (x)) = y − f (x) 2
2
More precisely
For different values γj of the parameter, we train a classifier f (x|γj) on
the training set.
28 / 34
Images/cinvestav-
What we want
We want to measure
A quality measure to measure different classifiers (for different parameter
values).
We call that as
R(f ) = ED [L (y, f (x))] . (11)
Example: L (y, f (x)) = y − f (x) 2
2
More precisely
For different values γj of the parameter, we train a classifier f (x|γj) on
the training set.
28 / 34
Images/cinvestav-
What we want
We want to measure
A quality measure to measure different classifiers (for different parameter
values).
We call that as
R(f ) = ED [L (y, f (x))] . (11)
Example: L (y, f (x)) = y − f (x) 2
2
More precisely
For different values γj of the parameter, we train a classifier f (x|γj) on
the training set.
28 / 34
Images/cinvestav-
Then, calculate the empirical Risk
Do you have any ideas?
Give me your best shot!!!
Empirical Risk
We use the validation set to estimate
ˆR (f (x|γi)) =
1
Nv
Nv
i=1
L (yif (xi|γj)) (12)
Thus, you follow the following procedure
1 Select the value γ∗ which achieves the smallest estimated error.
2 Re-train the classifier with parameter γ∗ on all data except the test
set (i.e. train + validation data).
3 Report error estimate ˆR (f (x|γi)) computed on the test set.
29 / 34
Images/cinvestav-
Then, calculate the empirical Risk
Do you have any ideas?
Give me your best shot!!!
Empirical Risk
We use the validation set to estimate
ˆR (f (x|γi)) =
1
Nv
Nv
i=1
L (yif (xi|γj)) (12)
Thus, you follow the following procedure
1 Select the value γ∗ which achieves the smallest estimated error.
2 Re-train the classifier with parameter γ∗ on all data except the test
set (i.e. train + validation data).
3 Report error estimate ˆR (f (x|γi)) computed on the test set.
29 / 34
Images/cinvestav-
Then, calculate the empirical Risk
Do you have any ideas?
Give me your best shot!!!
Empirical Risk
We use the validation set to estimate
ˆR (f (x|γi)) =
1
Nv
Nv
i=1
L (yif (xi|γj)) (12)
Thus, you follow the following procedure
1 Select the value γ∗ which achieves the smallest estimated error.
2 Re-train the classifier with parameter γ∗ on all data except the test
set (i.e. train + validation data).
3 Report error estimate ˆR (f (x|γi)) computed on the test set.
29 / 34
Images/cinvestav-
Then, calculate the empirical Risk
Do you have any ideas?
Give me your best shot!!!
Empirical Risk
We use the validation set to estimate
ˆR (f (x|γi)) =
1
Nv
Nv
i=1
L (yif (xi|γj)) (12)
Thus, you follow the following procedure
1 Select the value γ∗ which achieves the smallest estimated error.
2 Re-train the classifier with parameter γ∗ on all data except the test
set (i.e. train + validation data).
3 Report error estimate ˆR (f (x|γi)) computed on the test set.
29 / 34
Images/cinvestav-
Then, calculate the empirical Risk
Do you have any ideas?
Give me your best shot!!!
Empirical Risk
We use the validation set to estimate
ˆR (f (x|γi)) =
1
Nv
Nv
i=1
L (yif (xi|γj)) (12)
Thus, you follow the following procedure
1 Select the value γ∗ which achieves the smallest estimated error.
2 Re-train the classifier with parameter γ∗ on all data except the test
set (i.e. train + validation data).
3 Report error estimate ˆR (f (x|γi)) computed on the test set.
29 / 34
Images/cinvestav-
Idea
Something Notable
Each of the error estimates computed on validation set is computed from a
single example of a trained classifier. Can we improve the estimate?
K-fold Cross Validation
To estimate the risk of a classifier f :
1 Split data into K equally sized parts (called "folds").
2 Train an instance fk of the classifier, using all folds except fold k as
training data.
3 Compute the cross validation (CV) estimate:
ˆRCV (f (x|γi)) =
1
Nv
Nv
i=1
L yif xk(i)|γj (13)
where k (i) is the fold containing xi.
30 / 34
Images/cinvestav-
Idea
Something Notable
Each of the error estimates computed on validation set is computed from a
single example of a trained classifier. Can we improve the estimate?
K-fold Cross Validation
To estimate the risk of a classifier f :
1 Split data into K equally sized parts (called "folds").
2 Train an instance fk of the classifier, using all folds except fold k as
training data.
3 Compute the cross validation (CV) estimate:
ˆRCV (f (x|γi)) =
1
Nv
Nv
i=1
L yif xk(i)|γj (13)
where k (i) is the fold containing xi.
30 / 34
Images/cinvestav-
Idea
Something Notable
Each of the error estimates computed on validation set is computed from a
single example of a trained classifier. Can we improve the estimate?
K-fold Cross Validation
To estimate the risk of a classifier f :
1 Split data into K equally sized parts (called "folds").
2 Train an instance fk of the classifier, using all folds except fold k as
training data.
3 Compute the cross validation (CV) estimate:
ˆRCV (f (x|γi)) =
1
Nv
Nv
i=1
L yif xk(i)|γj (13)
where k (i) is the fold containing xi.
30 / 34
Images/cinvestav-
Idea
Something Notable
Each of the error estimates computed on validation set is computed from a
single example of a trained classifier. Can we improve the estimate?
K-fold Cross Validation
To estimate the risk of a classifier f :
1 Split data into K equally sized parts (called "folds").
2 Train an instance fk of the classifier, using all folds except fold k as
training data.
3 Compute the cross validation (CV) estimate:
ˆRCV (f (x|γi)) =
1
Nv
Nv
i=1
L yif xk(i)|γj (13)
where k (i) is the fold containing xi.
30 / 34
Images/cinvestav-
Idea
Something Notable
Each of the error estimates computed on validation set is computed from a
single example of a trained classifier. Can we improve the estimate?
K-fold Cross Validation
To estimate the risk of a classifier f :
1 Split data into K equally sized parts (called "folds").
2 Train an instance fk of the classifier, using all folds except fold k as
training data.
3 Compute the cross validation (CV) estimate:
ˆRCV (f (x|γi)) =
1
Nv
Nv
i=1
L yif xk(i)|γj (13)
where k (i) is the fold containing xi.
30 / 34
Images/cinvestav-
Example
K = 5,k = 3
Train Train Validation Train Train
1 2 3 4 5
Actually, we have
Cross validation procedure does not involve the test data.
SPLIT THIS PART
Train Data + Validation Data Test
31 / 34
Images/cinvestav-
Example
K = 5,k = 3
Train Train Validation Train Train
1 2 3 4 5
Actually, we have
Cross validation procedure does not involve the test data.
SPLIT THIS PART
Train Data + Validation Data Test
31 / 34
Images/cinvestav-
Outline
1 Bias-Variance Dilemma
Introduction
Measuring the difference between optimal and learned
The Bias-Variance
“Extreme” Example
2 Confusion Matrix
The Confusion Matrix
3 K-Cross Validation
Introduction
How to choose K
32 / 34
Images/cinvestav-
How to choose K
Extremal cases
K = N, called leave one out cross validation (loocv)
K = 2
An often-cited problem with loocv is that we have to train many (= N)
classifiers, but there is also a deeper problem.
Argument 1: K should be small, e.g. K = 2
1 Unless we have a lot of data, variance between two distinct training
sets may be considerable.
2 Important concept: By removing substantial parts of the sample in
turn and at random, we can simulate this variance.
3 By removing a single point (loocv), we cannot make this variance
visible.
33 / 34
Images/cinvestav-
How to choose K
Extremal cases
K = N, called leave one out cross validation (loocv)
K = 2
An often-cited problem with loocv is that we have to train many (= N)
classifiers, but there is also a deeper problem.
Argument 1: K should be small, e.g. K = 2
1 Unless we have a lot of data, variance between two distinct training
sets may be considerable.
2 Important concept: By removing substantial parts of the sample in
turn and at random, we can simulate this variance.
3 By removing a single point (loocv), we cannot make this variance
visible.
33 / 34
Images/cinvestav-
How to choose K
Extremal cases
K = N, called leave one out cross validation (loocv)
K = 2
An often-cited problem with loocv is that we have to train many (= N)
classifiers, but there is also a deeper problem.
Argument 1: K should be small, e.g. K = 2
1 Unless we have a lot of data, variance between two distinct training
sets may be considerable.
2 Important concept: By removing substantial parts of the sample in
turn and at random, we can simulate this variance.
3 By removing a single point (loocv), we cannot make this variance
visible.
33 / 34
Images/cinvestav-
How to choose K
Extremal cases
K = N, called leave one out cross validation (loocv)
K = 2
An often-cited problem with loocv is that we have to train many (= N)
classifiers, but there is also a deeper problem.
Argument 1: K should be small, e.g. K = 2
1 Unless we have a lot of data, variance between two distinct training
sets may be considerable.
2 Important concept: By removing substantial parts of the sample in
turn and at random, we can simulate this variance.
3 By removing a single point (loocv), we cannot make this variance
visible.
33 / 34
Images/cinvestav-
How to choose K
Extremal cases
K = N, called leave one out cross validation (loocv)
K = 2
An often-cited problem with loocv is that we have to train many (= N)
classifiers, but there is also a deeper problem.
Argument 1: K should be small, e.g. K = 2
1 Unless we have a lot of data, variance between two distinct training
sets may be considerable.
2 Important concept: By removing substantial parts of the sample in
turn and at random, we can simulate this variance.
3 By removing a single point (loocv), we cannot make this variance
visible.
33 / 34
Images/cinvestav-
How to choose K
Extremal cases
K = N, called leave one out cross validation (loocv)
K = 2
An often-cited problem with loocv is that we have to train many (= N)
classifiers, but there is also a deeper problem.
Argument 1: K should be small, e.g. K = 2
1 Unless we have a lot of data, variance between two distinct training
sets may be considerable.
2 Important concept: By removing substantial parts of the sample in
turn and at random, we can simulate this variance.
3 By removing a single point (loocv), we cannot make this variance
visible.
33 / 34
Images/cinvestav-
How to choose K
Argument 2: K should be large, e.g. K = N
1 Classifiers generally perform better when trained on larger data sets.
2 A small K means we substantially reduce the amount of training data
used to train each fk, so we may end up with weaker classifiers.
3 This way, we will systematically overestimate the risk.
Common recommendation: K = 5 to K = 10
Intuition:
1 K = 10 means number of samples removed from training is one order
of magnitude below training sample size.
2 This should not weaken the classifier considerably, but should be large
enough to make measure variance effects.
34 / 34
Images/cinvestav-
How to choose K
Argument 2: K should be large, e.g. K = N
1 Classifiers generally perform better when trained on larger data sets.
2 A small K means we substantially reduce the amount of training data
used to train each fk, so we may end up with weaker classifiers.
3 This way, we will systematically overestimate the risk.
Common recommendation: K = 5 to K = 10
Intuition:
1 K = 10 means number of samples removed from training is one order
of magnitude below training sample size.
2 This should not weaken the classifier considerably, but should be large
enough to make measure variance effects.
34 / 34
Images/cinvestav-
How to choose K
Argument 2: K should be large, e.g. K = N
1 Classifiers generally perform better when trained on larger data sets.
2 A small K means we substantially reduce the amount of training data
used to train each fk, so we may end up with weaker classifiers.
3 This way, we will systematically overestimate the risk.
Common recommendation: K = 5 to K = 10
Intuition:
1 K = 10 means number of samples removed from training is one order
of magnitude below training sample size.
2 This should not weaken the classifier considerably, but should be large
enough to make measure variance effects.
34 / 34
Images/cinvestav-
How to choose K
Argument 2: K should be large, e.g. K = N
1 Classifiers generally perform better when trained on larger data sets.
2 A small K means we substantially reduce the amount of training data
used to train each fk, so we may end up with weaker classifiers.
3 This way, we will systematically overestimate the risk.
Common recommendation: K = 5 to K = 10
Intuition:
1 K = 10 means number of samples removed from training is one order
of magnitude below training sample size.
2 This should not weaken the classifier considerably, but should be large
enough to make measure variance effects.
34 / 34
Images/cinvestav-
How to choose K
Argument 2: K should be large, e.g. K = N
1 Classifiers generally perform better when trained on larger data sets.
2 A small K means we substantially reduce the amount of training data
used to train each fk, so we may end up with weaker classifiers.
3 This way, we will systematically overestimate the risk.
Common recommendation: K = 5 to K = 10
Intuition:
1 K = 10 means number of samples removed from training is one order
of magnitude below training sample size.
2 This should not weaken the classifier considerably, but should be large
enough to make measure variance effects.
34 / 34
Images/cinvestav-
How to choose K
Argument 2: K should be large, e.g. K = N
1 Classifiers generally perform better when trained on larger data sets.
2 A small K means we substantially reduce the amount of training data
used to train each fk, so we may end up with weaker classifiers.
3 This way, we will systematically overestimate the risk.
Common recommendation: K = 5 to K = 10
Intuition:
1 K = 10 means number of samples removed from training is one order
of magnitude below training sample size.
2 This should not weaken the classifier considerably, but should be large
enough to make measure variance effects.
34 / 34
Ad

More Related Content

What's hot (20)

31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
Andres Mendez-Vazquez
 
18.1 combining models
18.1 combining models18.1 combining models
18.1 combining models
Andres Mendez-Vazquez
 
Markov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningMarkov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learning
Andres Mendez-Vazquez
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
Edgar Marca
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
arogozhnikov
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
arogozhnikov
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
VARUN KUMAR
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
arogozhnikov
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
arogozhnikov
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel Trick
Edgar Marca
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
Vivian S. Zhang
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Ujjawal
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
arogozhnikov
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
arogozhnikov
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
Prakash Pimpale
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networks
Steve Nouri
 
Financial Time Series Analysis Using R
Financial Time Series Analysis Using RFinancial Time Series Analysis Using R
Financial Time Series Analysis Using R
Majeed Simaan
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 
31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity31 Machine Learning Unsupervised Cluster Validity
31 Machine Learning Unsupervised Cluster Validity
Andres Mendez-Vazquez
 
Markov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learningMarkov chain monte_carlo_methods_for_machine_learning
Markov chain monte_carlo_methods_for_machine_learning
Andres Mendez-Vazquez
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
Edgar Marca
 
MLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic trackMLHEP Lectures - day 1, basic track
MLHEP Lectures - day 1, basic track
arogozhnikov
 
Machine learning in science and industry — day 3
Machine learning in science and industry — day 3Machine learning in science and industry — day 3
Machine learning in science and industry — day 3
arogozhnikov
 
Vc dimension in Machine Learning
Vc dimension in Machine LearningVc dimension in Machine Learning
Vc dimension in Machine Learning
VARUN KUMAR
 
Machine learning in science and industry — day 2
Machine learning in science and industry — day 2Machine learning in science and industry — day 2
Machine learning in science and industry — day 2
arogozhnikov
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
arogozhnikov
 
The Kernel Trick
The Kernel TrickThe Kernel Trick
The Kernel Trick
Edgar Marca
 
Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
Vivian S. Zhang
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
Ujjawal
 
Reweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEPReweighting and Boosting to uniforimty in HEP
Reweighting and Boosting to uniforimty in HEP
arogozhnikov
 
MLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic trackMLHEP Lectures - day 3, basic track
MLHEP Lectures - day 3, basic track
arogozhnikov
 
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
MLconf
 
Support Vector Machines for Classification
Support Vector Machines for ClassificationSupport Vector Machines for Classification
Support Vector Machines for Classification
Prakash Pimpale
 
Cheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networksCheatsheet recurrent-neural-networks
Cheatsheet recurrent-neural-networks
Steve Nouri
 
Financial Time Series Analysis Using R
Financial Time Series Analysis Using RFinancial Time Series Analysis Using R
Financial Time Series Analysis Using R
Majeed Simaan
 
MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4MLHEP 2015: Introductory Lecture #4
MLHEP 2015: Introductory Lecture #4
arogozhnikov
 

Viewers also liked (20)

13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees
Andres Mendez-Vazquez
 
01 Machine Learning Introduction
01 Machine Learning Introduction 01 Machine Learning Introduction
01 Machine Learning Introduction
Andres Mendez-Vazquez
 
Drainage
DrainageDrainage
Drainage
Ansh Jindal
 
Preparation Data Structures 10 trees
Preparation Data Structures 10 treesPreparation Data Structures 10 trees
Preparation Data Structures 10 trees
Andres Mendez-Vazquez
 
26 Computational Geometry
26 Computational Geometry26 Computational Geometry
26 Computational Geometry
Andres Mendez-Vazquez
 
Preparation Data Structures 11 graphs
Preparation Data Structures 11 graphsPreparation Data Structures 11 graphs
Preparation Data Structures 11 graphs
Andres Mendez-Vazquez
 
Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2
sashrilisdi
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
Andres Mendez-Vazquez
 
What's the Matter?
What's the Matter?What's the Matter?
What's the Matter?
Stephen Taylor
 
Preparation Data Structures 07 stacks
Preparation Data Structures 07 stacksPreparation Data Structures 07 stacks
Preparation Data Structures 07 stacks
Andres Mendez-Vazquez
 
Periodic table (1)
Periodic table (1)Periodic table (1)
Periodic table (1)
jghopwood
 
03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms
Andres Mendez-Vazquez
 
Preparation Data Structures 05 chain linear_list
Preparation Data Structures 05 chain linear_listPreparation Data Structures 05 chain linear_list
Preparation Data Structures 05 chain linear_list
Andres Mendez-Vazquez
 
23 Matrix Algorithms
23 Matrix Algorithms23 Matrix Algorithms
23 Matrix Algorithms
Andres Mendez-Vazquez
 
Waves
WavesWaves
Waves
jghopwood
 
15 B-Trees
15 B-Trees15 B-Trees
15 B-Trees
Andres Mendez-Vazquez
 
F1
F1F1
F1
smartgeniusproduction
 
12 Machine Learning Supervised Hidden Markov Chains
12 Machine Learning  Supervised Hidden Markov Chains12 Machine Learning  Supervised Hidden Markov Chains
12 Machine Learning Supervised Hidden Markov Chains
Andres Mendez-Vazquez
 
13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees13 Machine Learning Supervised Decision Trees
13 Machine Learning Supervised Decision Trees
Andres Mendez-Vazquez
 
Preparation Data Structures 10 trees
Preparation Data Structures 10 treesPreparation Data Structures 10 trees
Preparation Data Structures 10 trees
Andres Mendez-Vazquez
 
Preparation Data Structures 11 graphs
Preparation Data Structures 11 graphsPreparation Data Structures 11 graphs
Preparation Data Structures 11 graphs
Andres Mendez-Vazquez
 
Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2Work energy power 2 reading assignment -revision 2
Work energy power 2 reading assignment -revision 2
sashrilisdi
 
07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization07 Machine Learning - Expectation Maximization
07 Machine Learning - Expectation Maximization
Andres Mendez-Vazquez
 
Preparation Data Structures 07 stacks
Preparation Data Structures 07 stacksPreparation Data Structures 07 stacks
Preparation Data Structures 07 stacks
Andres Mendez-Vazquez
 
Periodic table (1)
Periodic table (1)Periodic table (1)
Periodic table (1)
jghopwood
 
03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms03 Probability Review for Analysis of Algorithms
03 Probability Review for Analysis of Algorithms
Andres Mendez-Vazquez
 
Preparation Data Structures 05 chain linear_list
Preparation Data Structures 05 chain linear_listPreparation Data Structures 05 chain linear_list
Preparation Data Structures 05 chain linear_list
Andres Mendez-Vazquez
 
12 Machine Learning Supervised Hidden Markov Chains
12 Machine Learning  Supervised Hidden Markov Chains12 Machine Learning  Supervised Hidden Markov Chains
12 Machine Learning Supervised Hidden Markov Chains
Andres Mendez-Vazquez
 
Ad

Similar to 11 Machine Learning Important Issues in Machine Learning (20)

ML unit-1.pptx
ML unit-1.pptxML unit-1.pptx
ML unit-1.pptx
SwarnaKumariChinni
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Parameswaran Raman
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
Lin JiaMing
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Andres Mendez-Vazquez
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
hemangppatel
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
anandsimple
 
Review_Cibe Sridharan
Review_Cibe SridharanReview_Cibe Sridharan
Review_Cibe Sridharan
Cibe Sridharan
 
Xgboost
XgboostXgboost
Xgboost
Vivian S. Zhang
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
MostafaHazemMostafaa
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
HODIT12
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
arogozhnikov
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
IJERA Editor
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
IJERA Editor
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient Apportionment
Raphael Reitzig
 
A basic introduction to learning
A basic introduction to learningA basic introduction to learning
A basic introduction to learning
Andres Mendez-Vazquez
 
Enhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points AnalysisEnhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points Analysis
jfrchicanog
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
Stéphanie Roger
 
Normalization cross correlation value of
Normalization cross correlation value ofNormalization cross correlation value of
Normalization cross correlation value of
eSAT Publishing House
 
MS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning AlgorithmMS CS - Selecting Machine Learning Algorithm
MS CS - Selecting Machine Learning Algorithm
Kaniska Mandal
 
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid ParallelismScaling Multinomial Logistic Regression via Hybrid Parallelism
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Parameswaran Raman
 
Output Units and Cost Function in FNN
Output Units and Cost Function in FNNOutput Units and Cost Function in FNN
Output Units and Cost Function in FNN
Lin JiaMing
 
20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations20 k-means, k-center, k-meoids and variations
20 k-means, k-center, k-meoids and variations
Andres Mendez-Vazquez
 
know Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdfknow Machine Learning Basic Concepts.pdf
know Machine Learning Basic Concepts.pdf
hemangppatel
 
Parallel Optimization in Machine Learning
Parallel Optimization in Machine LearningParallel Optimization in Machine Learning
Parallel Optimization in Machine Learning
Fabian Pedregosa
 
Optimization Techniques.pdf
Optimization Techniques.pdfOptimization Techniques.pdf
Optimization Techniques.pdf
anandsimple
 
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 11_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
1_5_AI_edx_ml_51intro_240204_104838machine learning lecture 1
MostafaHazemMostafaa
 
Machine Learning ebook.pdf
Machine Learning ebook.pdfMachine Learning ebook.pdf
Machine Learning ebook.pdf
HODIT12
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
arogozhnikov
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
IJERA Editor
 
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
Determination of Optimal Product Mix for Profit Maximization using Linear Pro...
IJERA Editor
 
Practical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient ApportionmentPractical and Worst-Case Efficient Apportionment
Practical and Worst-Case Efficient Apportionment
Raphael Reitzig
 
Enhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points AnalysisEnhancing Partition Crossover with Articulation Points Analysis
Enhancing Partition Crossover with Articulation Points Analysis
jfrchicanog
 
Inria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCCInria Tech Talk - La classification de données complexes avec MASSICCC
Inria Tech Talk - La classification de données complexes avec MASSICCC
Stéphanie Roger
 
Normalization cross correlation value of
Normalization cross correlation value ofNormalization cross correlation value of
Normalization cross correlation value of
eSAT Publishing House
 
Ad

More from Andres Mendez-Vazquez (20)

2.03 bayesian estimation
2.03 bayesian estimation2.03 bayesian estimation
2.03 bayesian estimation
Andres Mendez-Vazquez
 
05 linear transformations
05 linear transformations05 linear transformations
05 linear transformations
Andres Mendez-Vazquez
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Andres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Andres Mendez-Vazquez
 
01.02 linear equations
01.02 linear equations01.02 linear equations
01.02 linear equations
Andres Mendez-Vazquez
 
01.01 vector spaces
01.01 vector spaces01.01 vector spaces
01.01 vector spaces
Andres Mendez-Vazquez
 
06 recurrent neural_networks
06 recurrent neural_networks06 recurrent neural_networks
06 recurrent neural_networks
Andres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Andres Mendez-Vazquez
 
Zetta global
Zetta globalZetta global
Zetta global
Andres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Andres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Andres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Andres Mendez-Vazquez
 
Ideas 09 22_2018
Ideas 09 22_2018Ideas 09 22_2018
Ideas 09 22_2018
Andres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Andres Mendez-Vazquez
 
Analysis of Algorithms Syllabus
Analysis of Algorithms  SyllabusAnalysis of Algorithms  Syllabus
Analysis of Algorithms Syllabus
Andres Mendez-Vazquez
 
17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension17 vapnik chervonenkis dimension
17 vapnik chervonenkis dimension
Andres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
Andres Mendez-Vazquez
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
Andres Mendez-Vazquez
 
Schedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningSchedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine Learning
Andres Mendez-Vazquez
 
01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors01.04 orthonormal basis_eigen_vectors
01.04 orthonormal basis_eigen_vectors
Andres Mendez-Vazquez
 
01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues01.03 squared matrices_and_other_issues
01.03 squared matrices_and_other_issues
Andres Mendez-Vazquez
 
05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation05 backpropagation automatic_differentiation
05 backpropagation automatic_differentiation
Andres Mendez-Vazquez
 
01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning01 Introduction to Neural Networks and Deep Learning
01 Introduction to Neural Networks and Deep Learning
Andres Mendez-Vazquez
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
Andres Mendez-Vazquez
 
Neural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning SyllabusNeural Networks and Deep Learning Syllabus
Neural Networks and Deep Learning Syllabus
Andres Mendez-Vazquez
 
Introduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabusIntroduction to artificial_intelligence_syllabus
Introduction to artificial_intelligence_syllabus
Andres Mendez-Vazquez
 
Ideas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data SciencesIdeas about a Bachelor in Machine Learning/Data Sciences
Ideas about a Bachelor in Machine Learning/Data Sciences
Andres Mendez-Vazquez
 
Introduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems SyllabusIntroduction Mathematics Intelligent Systems Syllabus
Introduction Mathematics Intelligent Systems Syllabus
Andres Mendez-Vazquez
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
Andres Mendez-Vazquez
 
Schedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine LearningSchedule for Data Lab Community Path in Machine Learning
Schedule for Data Lab Community Path in Machine Learning
Andres Mendez-Vazquez
 

Recently uploaded (20)

Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdfSmart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
PawachMetharattanara
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdfML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
rameshwarchintamani
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation RateModeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Journal of Soft Computing in Civil Engineering
 
Deepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber ThreatsDeepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber Threats
RaviKumar256934
 
vtc2018fall_otfs_tutorial_presentation_1.pdf
vtc2018fall_otfs_tutorial_presentation_1.pdfvtc2018fall_otfs_tutorial_presentation_1.pdf
vtc2018fall_otfs_tutorial_presentation_1.pdf
RaghavaGD1
 
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdfLittle Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
gori42199
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
Machine foundation notes for civil engineering students
Machine foundation notes for civil engineering studentsMachine foundation notes for civil engineering students
Machine foundation notes for civil engineering students
DYPCET
 
2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt
rakshaiya16
 
David Boutry - Specializes In AWS, Microservices And Python
David Boutry - Specializes In AWS, Microservices And PythonDavid Boutry - Specializes In AWS, Microservices And Python
David Boutry - Specializes In AWS, Microservices And Python
David Boutry
 
Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025
Antonin Danalet
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
Physical and Physic-Chemical Based Optimization Methods: A Review
Physical and Physic-Chemical Based Optimization Methods: A ReviewPhysical and Physic-Chemical Based Optimization Methods: A Review
Physical and Physic-Chemical Based Optimization Methods: A Review
Journal of Soft Computing in Civil Engineering
 
hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .
NABLAS株式会社
 
Design of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdfDesign of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdf
Kamel Farid
 
Environment .................................
Environment .................................Environment .................................
Environment .................................
shadyozq9
 
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdfSmart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
PawachMetharattanara
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdfML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
ML_Unit_V_RDC_ASSOCIATION AND DIMENSIONALITY REDUCTION.pdf
rameshwarchintamani
 
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdfATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ATAL 6 Days Online FDP Scheme Document 2025-26.pdf
ssuserda39791
 
Deepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber ThreatsDeepfake Phishing: A New Frontier in Cyber Threats
Deepfake Phishing: A New Frontier in Cyber Threats
RaviKumar256934
 
vtc2018fall_otfs_tutorial_presentation_1.pdf
vtc2018fall_otfs_tutorial_presentation_1.pdfvtc2018fall_otfs_tutorial_presentation_1.pdf
vtc2018fall_otfs_tutorial_presentation_1.pdf
RaghavaGD1
 
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdfLittle Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
Little Known Ways To 3 Best sites to Buy Linkedin Accounts.pdf
gori42199
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
Machine foundation notes for civil engineering students
Machine foundation notes for civil engineering studentsMachine foundation notes for civil engineering students
Machine foundation notes for civil engineering students
DYPCET
 
2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt2.3 Genetically Modified Organisms (1).ppt
2.3 Genetically Modified Organisms (1).ppt
rakshaiya16
 
David Boutry - Specializes In AWS, Microservices And Python
David Boutry - Specializes In AWS, Microservices And PythonDavid Boutry - Specializes In AWS, Microservices And Python
David Boutry - Specializes In AWS, Microservices And Python
David Boutry
 
Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025Transport modelling at SBB, presentation at EPFL in 2025
Transport modelling at SBB, presentation at EPFL in 2025
Antonin Danalet
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
OPTIMIZING DATA INTEROPERABILITY IN AGILE ORGANIZATIONS: INTEGRATING NONAKA’S...
ijdmsjournal
 
hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .hypermedia_system_revisit_roy_fielding .
hypermedia_system_revisit_roy_fielding .
NABLAS株式会社
 
Design of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdfDesign of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdf
Kamel Farid
 

11 Machine Learning Important Issues in Machine Learning

  • 1. Machine Learning for Data Mining Important Issues Andres Mendez-Vazquez July 3, 2015 1 / 34
  • 2. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 2 / 34
  • 3. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 3 / 34
  • 4. Images/cinvestav- Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (1) Remark: Where the xi ∼ p (x|Θ)!!! 4 / 34
  • 5. Images/cinvestav- Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (1) Remark: Where the xi ∼ p (x|Θ)!!! 4 / 34
  • 6. Images/cinvestav- Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (1) Remark: Where the xi ∼ p (x|Θ)!!! 4 / 34
  • 7. Images/cinvestav- Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (1) Remark: Where the xi ∼ p (x|Θ)!!! 4 / 34
  • 8. Images/cinvestav- Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (1) Remark: Where the xi ∼ p (x|Θ)!!! 4 / 34
  • 9. Images/cinvestav- Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (1) Remark: Where the xi ∼ p (x|Θ)!!! 4 / 34
  • 10. Images/cinvestav- Introduction What did we see until now? The design of learning machines from two main points: Statistical Point of View Linear Algebra and Optimization Point of View Going back to the probability models We might think in the machine to be learned as a function g (x|D).... Something as curve fitting... Under a data set D = {(xi, yi) |i = 1, 2, ..., N} (1) Remark: Where the xi ∼ p (x|Θ)!!! 4 / 34
  • 11. Images/cinvestav- Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 5 / 34
  • 12. Images/cinvestav- Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 5 / 34
  • 13. Images/cinvestav- Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 5 / 34
  • 14. Images/cinvestav- Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 5 / 34
  • 15. Images/cinvestav- Thus, we have that Two main functions A function g (x|D) obtained using some algorithm!!! E [y|x] the optimal regression... Important The key factor here is the dependence of the approximation on D. Why? The approximation may be very good for a specific training data set but very bad for another. This is the reason of studying fusion of information at decision level... 5 / 34
  • 16. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 6 / 34
  • 17. Images/cinvestav- How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (2) Remark: The expected output of the machine g (x|D) 7 / 34
  • 18. Images/cinvestav- How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (2) Remark: The expected output of the machine g (x|D) 7 / 34
  • 19. Images/cinvestav- How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (2) Remark: The expected output of the machine g (x|D) 7 / 34
  • 20. Images/cinvestav- How do we measure the difference We have that Var(X) = E((X − µ)2 ) We can do that for our data VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 Now, if we add and subtract ED [g (x|D)] (2) Remark: The expected output of the machine g (x|D) 7 / 34
  • 21. Images/cinvestav- Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3) 8 / 34
  • 22. Images/cinvestav- Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3) 8 / 34
  • 23. Images/cinvestav- Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3) 8 / 34
  • 24. Images/cinvestav- Thus, we have that Or Original variance VarD (g (x|D)) = ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)] + ED [g (x|D)] − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 + ... ...2 ((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x]) + ... ... (ED [g (x|D)] − E [y|x])2 Finally ED (((g (x|D) − ED [g (x|D)])) (ED [g (x|D)] − E [y|x])) =? (3) 8 / 34
  • 25. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 9 / 34
  • 26. Images/cinvestav- We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represents the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represents the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 10 / 34
  • 27. Images/cinvestav- We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represents the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represents the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 10 / 34
  • 28. Images/cinvestav- We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represents the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represents the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 10 / 34
  • 29. Images/cinvestav- We have the Bias-Variance Our Final Equation ED (g (x|D) − E [y|x])2 = ED (g (x|D) − ED [g (x|D)])2 VARIANCE + (ED [g (x|D)] − E [y|x])2 BIAS Where the variance It represents the measure of the error between our machine g (x|D) and the expected output of the machine under xi ∼ p (x|Θ). Where the bias It represents the quadratic error between the expected output of the machine under xi ∼ p (x|Θ) and the expected output of the optimal regression. 10 / 34
  • 30. Images/cinvestav- Remarks We have then Even if the estimator is unbiased, it can still result in a large mean square error due to a large variance term. The situation is more dire in a finite set of data D We have then a trade-off: 1 Increasing the bias decreases the variance and vice versa. 2 This is known as the bias–variance dilemma. 11 / 34
  • 31. Images/cinvestav- Remarks We have then Even if the estimator is unbiased, it can still result in a large mean square error due to a large variance term. The situation is more dire in a finite set of data D We have then a trade-off: 1 Increasing the bias decreases the variance and vice versa. 2 This is known as the bias–variance dilemma. 11 / 34
  • 32. Images/cinvestav- Remarks We have then Even if the estimator is unbiased, it can still result in a large mean square error due to a large variance term. The situation is more dire in a finite set of data D We have then a trade-off: 1 Increasing the bias decreases the variance and vice versa. 2 This is known as the bias–variance dilemma. 11 / 34
  • 33. Images/cinvestav- Remarks We have then Even if the estimator is unbiased, it can still result in a large mean square error due to a large variance term. The situation is more dire in a finite set of data D We have then a trade-off: 1 Increasing the bias decreases the variance and vice versa. 2 This is known as the bias–variance dilemma. 11 / 34
  • 34. Images/cinvestav- Similar to... Curve Fitting If, for example, the adopted model is complex (many parameters involved) with respect to the number N, the model will fit the idiosyncrasies of the specific data set. Thus Thus, it will result in low bias but will yield high variance, as we change from one data set to another data set. Furthermore If N grows we can have a more complex model to be fitted which reduces bias and ensures low variance. However, N is always finite!!! 12 / 34
  • 35. Images/cinvestav- Similar to... Curve Fitting If, for example, the adopted model is complex (many parameters involved) with respect to the number N, the model will fit the idiosyncrasies of the specific data set. Thus Thus, it will result in low bias but will yield high variance, as we change from one data set to another data set. Furthermore If N grows we can have a more complex model to be fitted which reduces bias and ensures low variance. However, N is always finite!!! 12 / 34
  • 36. Images/cinvestav- Similar to... Curve Fitting If, for example, the adopted model is complex (many parameters involved) with respect to the number N, the model will fit the idiosyncrasies of the specific data set. Thus Thus, it will result in low bias but will yield high variance, as we change from one data set to another data set. Furthermore If N grows we can have a more complex model to be fitted which reduces bias and ensures low variance. However, N is always finite!!! 12 / 34
  • 37. Images/cinvestav- Similar to... Curve Fitting If, for example, the adopted model is complex (many parameters involved) with respect to the number N, the model will fit the idiosyncrasies of the specific data set. Thus Thus, it will result in low bias but will yield high variance, as we change from one data set to another data set. Furthermore If N grows we can have a more complex model to be fitted which reduces bias and ensures low variance. However, N is always finite!!! 12 / 34
  • 38. Images/cinvestav- Thus You always need to compromise However, you always have some a priori knowledge about the data Allowing you to impose restrictions Lowering the bias and the variance Nevertheless We have the following example to grasp better the bothersome bias–variance dilemma. 13 / 34
  • 39. Images/cinvestav- Thus You always need to compromise However, you always have some a priori knowledge about the data Allowing you to impose restrictions Lowering the bias and the variance Nevertheless We have the following example to grasp better the bothersome bias–variance dilemma. 13 / 34
  • 40. Images/cinvestav- Thus You always need to compromise However, you always have some a priori knowledge about the data Allowing you to impose restrictions Lowering the bias and the variance Nevertheless We have the following example to grasp better the bothersome bias–variance dilemma. 13 / 34
  • 41. Images/cinvestav- For this Assume The data is generated by the following function y =f (x) + , ∼N 0, σ2 We know that The optimum regressor is E [y|x] = f (x) Furthermore Assume that the randomness in the different training sets, D, is due to the yi’s (Affected by noise), while the respective points, xi, are fixed. 14 / 34
  • 42. Images/cinvestav- For this Assume The data is generated by the following function y =f (x) + , ∼N 0, σ2 We know that The optimum regressor is E [y|x] = f (x) Furthermore Assume that the randomness in the different training sets, D, is due to the yi’s (Affected by noise), while the respective points, xi, are fixed. 14 / 34
  • 43. Images/cinvestav- For this Assume The data is generated by the following function y =f (x) + , ∼N 0, σ2 We know that The optimum regressor is E [y|x] = f (x) Furthermore Assume that the randomness in the different training sets, D, is due to the yi’s (Affected by noise), while the respective points, xi, are fixed. 14 / 34
  • 44. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 15 / 34
  • 45. Images/cinvestav- Sampling the Space Imagine that D ⊂ [x1, x2] in which x lies For example, you can choose xi = x1 + x2−x1 N−1 (i − 1) with i = 1, 2, ..., N 16 / 34
  • 46. Images/cinvestav- Case 1 Choose the estimate of f (x), g (x|D), to be independent of D For example, g (x) = w1x + w0 For example, the points are spread around (x, f (x)) 17 / 34
  • 47. Images/cinvestav- Case 1 Choose the estimate of f (x), g (x|D), to be independent of D For example, g (x) = w1x + w0 For example, the points are spread around (x, f (x)) 0 Data Points 17 / 34
  • 48. Images/cinvestav- Case 1 Since g (x) is fixed ED [g (x|D)] = g (x|D) ≡ g (x) (4) With VarD [g (x|D)] = 0 (5) On the other hand Because g (x) was chosen arbitrarily the expected bias must be large. (ED [g (x|D)] − E [y|x])2 BIAS (6) 18 / 34
  • 49. Images/cinvestav- Case 1 Since g (x) is fixed ED [g (x|D)] = g (x|D) ≡ g (x) (4) With VarD [g (x|D)] = 0 (5) On the other hand Because g (x) was chosen arbitrarily the expected bias must be large. (ED [g (x|D)] − E [y|x])2 BIAS (6) 18 / 34
  • 50. Images/cinvestav- Case 1 Since g (x) is fixed ED [g (x|D)] = g (x|D) ≡ g (x) (4) With VarD [g (x|D)] = 0 (5) On the other hand Because g (x) was chosen arbitrarily the expected bias must be large. (ED [g (x|D)] − E [y|x])2 BIAS (6) 18 / 34
  • 51. Images/cinvestav- Case 2 In the other hand Now, g1 (x) corresponds to a polynomial of high degree so it can pass through each training point in D. Example of g1 (x) 19 / 34
  • 52. Images/cinvestav- Case 2 In the other hand Now, g1 (x) corresponds to a polynomial of high degree so it can pass through each training point in D. Example of g1 (x) 0 Data Points 19 / 34
  • 53. Images/cinvestav- Case 2 Due to the zero mean of the noise source ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7) Remark: At the training points the bias is zero. However the variance increases ED (g1 (x|D) − ED [g1 (x|D)])2 = ED (f (x) + − f (x))2 = σ2 , for x = xi, i = 1, 2, ..., N In other words The bias becomes zero (or approximately zero) but the variance is now equal to the variance of the noise source. 20 / 34
  • 54. Images/cinvestav- Case 2 Due to the zero mean of the noise source ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7) Remark: At the training points the bias is zero. However the variance increases ED (g1 (x|D) − ED [g1 (x|D)])2 = ED (f (x) + − f (x))2 = σ2 , for x = xi, i = 1, 2, ..., N In other words The bias becomes zero (or approximately zero) but the variance is now equal to the variance of the noise source. 20 / 34
  • 55. Images/cinvestav- Case 2 Due to the zero mean of the noise source ED [g1 (x|D)] = f (x) = E [y|x] for any x = xi (7) Remark: At the training points the bias is zero. However the variance increases ED (g1 (x|D) − ED [g1 (x|D)])2 = ED (f (x) + − f (x))2 = σ2 , for x = xi, i = 1, 2, ..., N In other words The bias becomes zero (or approximately zero) but the variance is now equal to the variance of the noise source. 20 / 34
  • 56. Images/cinvestav- Observations First Everything that has been said so far applies to both the regression and the classification tasks. However Mean squared error is not the best way to measure the power of a classifier. Think about A classifier that send everything far away of the hyperplane!!! Away from the values + − 1!!! 21 / 34
  • 57. Images/cinvestav- Observations First Everything that has been said so far applies to both the regression and the classification tasks. However Mean squared error is not the best way to measure the power of a classifier. Think about A classifier that send everything far away of the hyperplane!!! Away from the values + − 1!!! 21 / 34
  • 58. Images/cinvestav- Observations First Everything that has been said so far applies to both the regression and the classification tasks. However Mean squared error is not the best way to measure the power of a classifier. Think about A classifier that send everything far away of the hyperplane!!! Away from the values + − 1!!! 21 / 34
  • 59. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 22 / 34
  • 60. Images/cinvestav- Introduction Something Notable In evaluating the performance of a classification system, the probability of error is sometimes not the only quantity that assesses its performance sufficiently. For this assume a M-class classification task An important issue is to know whether there are classes that exhibit a higher tendency for confusion. Where the confusion matrix Confusion Matrix A = [Aij] is defined such that each element Aij is the number of data points whose true class was i but where classified in class j. 23 / 34
  • 61. Images/cinvestav- Introduction Something Notable In evaluating the performance of a classification system, the probability of error is sometimes not the only quantity that assesses its performance sufficiently. For this assume a M-class classification task An important issue is to know whether there are classes that exhibit a higher tendency for confusion. Where the confusion matrix Confusion Matrix A = [Aij] is defined such that each element Aij is the number of data points whose true class was i but where classified in class j. 23 / 34
  • 62. Images/cinvestav- Introduction Something Notable In evaluating the performance of a classification system, the probability of error is sometimes not the only quantity that assesses its performance sufficiently. For this assume a M-class classification task An important issue is to know whether there are classes that exhibit a higher tendency for confusion. Where the confusion matrix Confusion Matrix A = [Aij] is defined such that each element Aij is the number of data points whose true class was i but where classified in class j. 23 / 34
  • 63. Images/cinvestav- Thus We have that From A, one can directly extract the recall and precision values for each class, along with the overall accuracy. Recall - Ri It is the percentage of data points with true class label i, which were correctly classified in that class. For example in a two-class problem The recall of the first class is calculated as R1 = A11 A11 + A12 (8) 24 / 34
  • 64. Images/cinvestav- Thus We have that From A, one can directly extract the recall and precision values for each class, along with the overall accuracy. Recall - Ri It is the percentage of data points with true class label i, which were correctly classified in that class. For example in a two-class problem The recall of the first class is calculated as R1 = A11 A11 + A12 (8) 24 / 34
  • 65. Images/cinvestav- Thus We have that From A, one can directly extract the recall and precision values for each class, along with the overall accuracy. Recall - Ri It is the percentage of data points with true class label i, which were correctly classified in that class. For example in a two-class problem The recall of the first class is calculated as R1 = A11 A11 + A12 (8) 24 / 34
  • 66. Images/cinvestav- More Precision - Pi It is the percentage of data points classified as class i, whose true class is indeed i. Therefore again for a two class problem P1 = A11 A11 + A21 (9) Overall Accuracy (Ac). The overall accuracy, Ac, is the percentage of data that has been correctly classified. 25 / 34
  • 67. Images/cinvestav- More Precision - Pi It is the percentage of data points classified as class i, whose true class is indeed i. Therefore again for a two class problem P1 = A11 A11 + A21 (9) Overall Accuracy (Ac). The overall accuracy, Ac, is the percentage of data that has been correctly classified. 25 / 34
  • 68. Images/cinvestav- More Precision - Pi It is the percentage of data points classified as class i, whose true class is indeed i. Therefore again for a two class problem P1 = A11 A11 + A21 (9) Overall Accuracy (Ac). The overall accuracy, Ac, is the percentage of data that has been correctly classified. 25 / 34
  • 69. Images/cinvestav- Thus, for a M-Class Problem We have that Ac = 1 N M i=1 Aii (10) 26 / 34
  • 70. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 27 / 34
  • 71. Images/cinvestav- What we want We want to measure A quality measure to measure different classifiers (for different parameter values). We call that as R(f ) = ED [L (y, f (x))] . (11) Example: L (y, f (x)) = y − f (x) 2 2 More precisely For different values γj of the parameter, we train a classifier f (x|γj) on the training set. 28 / 34
  • 72. Images/cinvestav- What we want We want to measure A quality measure to measure different classifiers (for different parameter values). We call that as R(f ) = ED [L (y, f (x))] . (11) Example: L (y, f (x)) = y − f (x) 2 2 More precisely For different values γj of the parameter, we train a classifier f (x|γj) on the training set. 28 / 34
  • 73. Images/cinvestav- What we want We want to measure A quality measure to measure different classifiers (for different parameter values). We call that as R(f ) = ED [L (y, f (x))] . (11) Example: L (y, f (x)) = y − f (x) 2 2 More precisely For different values γj of the parameter, we train a classifier f (x|γj) on the training set. 28 / 34
  • 74. Images/cinvestav- Then, calculate the empirical Risk Do you have any ideas? Give me your best shot!!! Empirical Risk We use the validation set to estimate ˆR (f (x|γi)) = 1 Nv Nv i=1 L (yif (xi|γj)) (12) Thus, you follow the following procedure 1 Select the value γ∗ which achieves the smallest estimated error. 2 Re-train the classifier with parameter γ∗ on all data except the test set (i.e. train + validation data). 3 Report error estimate ˆR (f (x|γi)) computed on the test set. 29 / 34
  • 75. Images/cinvestav- Then, calculate the empirical Risk Do you have any ideas? Give me your best shot!!! Empirical Risk We use the validation set to estimate ˆR (f (x|γi)) = 1 Nv Nv i=1 L (yif (xi|γj)) (12) Thus, you follow the following procedure 1 Select the value γ∗ which achieves the smallest estimated error. 2 Re-train the classifier with parameter γ∗ on all data except the test set (i.e. train + validation data). 3 Report error estimate ˆR (f (x|γi)) computed on the test set. 29 / 34
  • 76. Images/cinvestav- Then, calculate the empirical Risk Do you have any ideas? Give me your best shot!!! Empirical Risk We use the validation set to estimate ˆR (f (x|γi)) = 1 Nv Nv i=1 L (yif (xi|γj)) (12) Thus, you follow the following procedure 1 Select the value γ∗ which achieves the smallest estimated error. 2 Re-train the classifier with parameter γ∗ on all data except the test set (i.e. train + validation data). 3 Report error estimate ˆR (f (x|γi)) computed on the test set. 29 / 34
  • 77. Images/cinvestav- Then, calculate the empirical Risk Do you have any ideas? Give me your best shot!!! Empirical Risk We use the validation set to estimate ˆR (f (x|γi)) = 1 Nv Nv i=1 L (yif (xi|γj)) (12) Thus, you follow the following procedure 1 Select the value γ∗ which achieves the smallest estimated error. 2 Re-train the classifier with parameter γ∗ on all data except the test set (i.e. train + validation data). 3 Report error estimate ˆR (f (x|γi)) computed on the test set. 29 / 34
  • 78. Images/cinvestav- Then, calculate the empirical Risk Do you have any ideas? Give me your best shot!!! Empirical Risk We use the validation set to estimate ˆR (f (x|γi)) = 1 Nv Nv i=1 L (yif (xi|γj)) (12) Thus, you follow the following procedure 1 Select the value γ∗ which achieves the smallest estimated error. 2 Re-train the classifier with parameter γ∗ on all data except the test set (i.e. train + validation data). 3 Report error estimate ˆR (f (x|γi)) computed on the test set. 29 / 34
  • 79. Images/cinvestav- Idea Something Notable Each of the error estimates computed on validation set is computed from a single example of a trained classifier. Can we improve the estimate? K-fold Cross Validation To estimate the risk of a classifier f : 1 Split data into K equally sized parts (called "folds"). 2 Train an instance fk of the classifier, using all folds except fold k as training data. 3 Compute the cross validation (CV) estimate: ˆRCV (f (x|γi)) = 1 Nv Nv i=1 L yif xk(i)|γj (13) where k (i) is the fold containing xi. 30 / 34
  • 80. Images/cinvestav- Idea Something Notable Each of the error estimates computed on validation set is computed from a single example of a trained classifier. Can we improve the estimate? K-fold Cross Validation To estimate the risk of a classifier f : 1 Split data into K equally sized parts (called "folds"). 2 Train an instance fk of the classifier, using all folds except fold k as training data. 3 Compute the cross validation (CV) estimate: ˆRCV (f (x|γi)) = 1 Nv Nv i=1 L yif xk(i)|γj (13) where k (i) is the fold containing xi. 30 / 34
  • 81. Images/cinvestav- Idea Something Notable Each of the error estimates computed on validation set is computed from a single example of a trained classifier. Can we improve the estimate? K-fold Cross Validation To estimate the risk of a classifier f : 1 Split data into K equally sized parts (called "folds"). 2 Train an instance fk of the classifier, using all folds except fold k as training data. 3 Compute the cross validation (CV) estimate: ˆRCV (f (x|γi)) = 1 Nv Nv i=1 L yif xk(i)|γj (13) where k (i) is the fold containing xi. 30 / 34
  • 82. Images/cinvestav- Idea Something Notable Each of the error estimates computed on validation set is computed from a single example of a trained classifier. Can we improve the estimate? K-fold Cross Validation To estimate the risk of a classifier f : 1 Split data into K equally sized parts (called "folds"). 2 Train an instance fk of the classifier, using all folds except fold k as training data. 3 Compute the cross validation (CV) estimate: ˆRCV (f (x|γi)) = 1 Nv Nv i=1 L yif xk(i)|γj (13) where k (i) is the fold containing xi. 30 / 34
  • 83. Images/cinvestav- Idea Something Notable Each of the error estimates computed on validation set is computed from a single example of a trained classifier. Can we improve the estimate? K-fold Cross Validation To estimate the risk of a classifier f : 1 Split data into K equally sized parts (called "folds"). 2 Train an instance fk of the classifier, using all folds except fold k as training data. 3 Compute the cross validation (CV) estimate: ˆRCV (f (x|γi)) = 1 Nv Nv i=1 L yif xk(i)|γj (13) where k (i) is the fold containing xi. 30 / 34
  • 84. Images/cinvestav- Example K = 5,k = 3 Train Train Validation Train Train 1 2 3 4 5 Actually, we have Cross validation procedure does not involve the test data. SPLIT THIS PART Train Data + Validation Data Test 31 / 34
  • 85. Images/cinvestav- Example K = 5,k = 3 Train Train Validation Train Train 1 2 3 4 5 Actually, we have Cross validation procedure does not involve the test data. SPLIT THIS PART Train Data + Validation Data Test 31 / 34
  • 86. Images/cinvestav- Outline 1 Bias-Variance Dilemma Introduction Measuring the difference between optimal and learned The Bias-Variance “Extreme” Example 2 Confusion Matrix The Confusion Matrix 3 K-Cross Validation Introduction How to choose K 32 / 34
  • 87. Images/cinvestav- How to choose K Extremal cases K = N, called leave one out cross validation (loocv) K = 2 An often-cited problem with loocv is that we have to train many (= N) classifiers, but there is also a deeper problem. Argument 1: K should be small, e.g. K = 2 1 Unless we have a lot of data, variance between two distinct training sets may be considerable. 2 Important concept: By removing substantial parts of the sample in turn and at random, we can simulate this variance. 3 By removing a single point (loocv), we cannot make this variance visible. 33 / 34
  • 88. Images/cinvestav- How to choose K Extremal cases K = N, called leave one out cross validation (loocv) K = 2 An often-cited problem with loocv is that we have to train many (= N) classifiers, but there is also a deeper problem. Argument 1: K should be small, e.g. K = 2 1 Unless we have a lot of data, variance between two distinct training sets may be considerable. 2 Important concept: By removing substantial parts of the sample in turn and at random, we can simulate this variance. 3 By removing a single point (loocv), we cannot make this variance visible. 33 / 34
  • 89. Images/cinvestav- How to choose K Extremal cases K = N, called leave one out cross validation (loocv) K = 2 An often-cited problem with loocv is that we have to train many (= N) classifiers, but there is also a deeper problem. Argument 1: K should be small, e.g. K = 2 1 Unless we have a lot of data, variance between two distinct training sets may be considerable. 2 Important concept: By removing substantial parts of the sample in turn and at random, we can simulate this variance. 3 By removing a single point (loocv), we cannot make this variance visible. 33 / 34
  • 90. Images/cinvestav- How to choose K Extremal cases K = N, called leave one out cross validation (loocv) K = 2 An often-cited problem with loocv is that we have to train many (= N) classifiers, but there is also a deeper problem. Argument 1: K should be small, e.g. K = 2 1 Unless we have a lot of data, variance between two distinct training sets may be considerable. 2 Important concept: By removing substantial parts of the sample in turn and at random, we can simulate this variance. 3 By removing a single point (loocv), we cannot make this variance visible. 33 / 34
  • 91. Images/cinvestav- How to choose K Extremal cases K = N, called leave one out cross validation (loocv) K = 2 An often-cited problem with loocv is that we have to train many (= N) classifiers, but there is also a deeper problem. Argument 1: K should be small, e.g. K = 2 1 Unless we have a lot of data, variance between two distinct training sets may be considerable. 2 Important concept: By removing substantial parts of the sample in turn and at random, we can simulate this variance. 3 By removing a single point (loocv), we cannot make this variance visible. 33 / 34
  • 92. Images/cinvestav- How to choose K Extremal cases K = N, called leave one out cross validation (loocv) K = 2 An often-cited problem with loocv is that we have to train many (= N) classifiers, but there is also a deeper problem. Argument 1: K should be small, e.g. K = 2 1 Unless we have a lot of data, variance between two distinct training sets may be considerable. 2 Important concept: By removing substantial parts of the sample in turn and at random, we can simulate this variance. 3 By removing a single point (loocv), we cannot make this variance visible. 33 / 34
  • 93. Images/cinvestav- How to choose K Argument 2: K should be large, e.g. K = N 1 Classifiers generally perform better when trained on larger data sets. 2 A small K means we substantially reduce the amount of training data used to train each fk, so we may end up with weaker classifiers. 3 This way, we will systematically overestimate the risk. Common recommendation: K = 5 to K = 10 Intuition: 1 K = 10 means number of samples removed from training is one order of magnitude below training sample size. 2 This should not weaken the classifier considerably, but should be large enough to make measure variance effects. 34 / 34
  • 94. Images/cinvestav- How to choose K Argument 2: K should be large, e.g. K = N 1 Classifiers generally perform better when trained on larger data sets. 2 A small K means we substantially reduce the amount of training data used to train each fk, so we may end up with weaker classifiers. 3 This way, we will systematically overestimate the risk. Common recommendation: K = 5 to K = 10 Intuition: 1 K = 10 means number of samples removed from training is one order of magnitude below training sample size. 2 This should not weaken the classifier considerably, but should be large enough to make measure variance effects. 34 / 34
  • 95. Images/cinvestav- How to choose K Argument 2: K should be large, e.g. K = N 1 Classifiers generally perform better when trained on larger data sets. 2 A small K means we substantially reduce the amount of training data used to train each fk, so we may end up with weaker classifiers. 3 This way, we will systematically overestimate the risk. Common recommendation: K = 5 to K = 10 Intuition: 1 K = 10 means number of samples removed from training is one order of magnitude below training sample size. 2 This should not weaken the classifier considerably, but should be large enough to make measure variance effects. 34 / 34
  • 96. Images/cinvestav- How to choose K Argument 2: K should be large, e.g. K = N 1 Classifiers generally perform better when trained on larger data sets. 2 A small K means we substantially reduce the amount of training data used to train each fk, so we may end up with weaker classifiers. 3 This way, we will systematically overestimate the risk. Common recommendation: K = 5 to K = 10 Intuition: 1 K = 10 means number of samples removed from training is one order of magnitude below training sample size. 2 This should not weaken the classifier considerably, but should be large enough to make measure variance effects. 34 / 34
  • 97. Images/cinvestav- How to choose K Argument 2: K should be large, e.g. K = N 1 Classifiers generally perform better when trained on larger data sets. 2 A small K means we substantially reduce the amount of training data used to train each fk, so we may end up with weaker classifiers. 3 This way, we will systematically overestimate the risk. Common recommendation: K = 5 to K = 10 Intuition: 1 K = 10 means number of samples removed from training is one order of magnitude below training sample size. 2 This should not weaken the classifier considerably, but should be large enough to make measure variance effects. 34 / 34
  • 98. Images/cinvestav- How to choose K Argument 2: K should be large, e.g. K = N 1 Classifiers generally perform better when trained on larger data sets. 2 A small K means we substantially reduce the amount of training data used to train each fk, so we may end up with weaker classifiers. 3 This way, we will systematically overestimate the risk. Common recommendation: K = 5 to K = 10 Intuition: 1 K = 10 means number of samples removed from training is one order of magnitude below training sample size. 2 This should not weaken the classifier considerably, but should be large enough to make measure variance effects. 34 / 34
  翻译: