SlideShare a Scribd company logo
Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
Some Fundamental Problems Which algorithm is “best”? Are there any reasons to favor one algorithm over another? Is “Occam’s razor” really so evident? Do simpler or “smoother” classifiers generalize better? If so, why? Are there fundamental “conservation” or “constraint” laws other than Bayes error rate?
Meaning of  “Algorithm-Independent”  Mathematical foundations that do not depend upon the particular classifier or learning algorithm used e.g., bias and variance concept Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use e.g., cross-validation and resampling techniques
Roadmap No pattern classification method is inherently superior to any other Ways to quantify and adjust the “match” between a learning algorithm and the problem it addresses Estimation of accuracies and comparison of different classifiers with certain assumptions Methods for integrating component classifiers
Generalization Performance by  Off-Training Set Error Consider a two-category problem Training set  D Training patterns  x i   y i  = 1  or  -1  for  i  = 1, . . .,  n  is generated by unknown target function  F ( x )  to be learned F ( x )  is often with a random component The same input could lead to different categories Giving non-zero Bayes error
Generalization Performance by  Off-Training Set Error Let  H  be the (discrete) set of hypotheses, or sets of parameters to be learned A particular  h  belongs to  H quantized weights in neural network Parameters q in a functional model Sets of decisions in a tree P ( h )  : prior probability that the algorithm will produce hypothesis  h  after training
Generalization Performance by  Off-Training Set Error P ( h | D )  : probability the algorithm will yield h when trained on data  D Nearest-neighbor and decision tree: non-zero only for a single hypothesis Neural network: can be a broad distribution E : error for zero-one or other loss function
Generalization Performance by  Off-Training Set Error A natural measure Expected off-training-set classification error for the  k th candidate learning algorithm
No Free Lunch Theorem
No Free Lunch Theorem For any two algorithms No matter how clever in choosing a “good” algorithm and a “bad” algorithm, if all target functions are equally likely, the “good” algorithm will not outperform the “bad” one There is at least one target function for which random guessing is a better algorithm
No Free Lunch Theorem Even if we know  D , averaged over all target functions no algorithm yields an off-training set error that is superior to any other
Example 1 No Free Lunch for Binary Data -1 1 1 111 -1 1 1 110 -1 1 -1 101 -1 1 1 100 -1 1 -1 011 1 1 1 010 -1 -1 -1 001 D 1 1 1 000 h 2 h 1 F x
No Free Lunch Theorem
Conservation in Generalization Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems Can trade performance on problems we do not expect to encounter with those that we do expect to encounter It is the assumptions about the learning domains that are relevant
Ugly Duckling Theorem In the absence of assumptions there is no privileged or “best” feature representation Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct
Venn Diagram Representation of Features as Predicates
Rank of a Predicate Number of the simplest or indivisible elements it contains Example: rank  r  = 1 x 1 :  f 1   AND NOT   f 2 x 2 :  f 1   AND   f 2 x 3 :  f 2   AND NOT   f 1 x 4 :  NOT (  f 1   OR   f 2  ) C (4,1) = 4 predicates
Examples of Rank of a Predicate Rank  r  = 2 x 1   OR   x 2  :  f 1 x 1   OR   x 3  :  f 1   XOR   f 2 x 1   OR   x 4  :  NOT   f 2 x 2   OR   x 3  :  f 2 x 2   OR   x 4  : (  f 1   AND   f 2  )  OR NOT  ( f 1   OR   f 2 ) x 3   OR   x 4  :  NOT   f 1 C (4, 2) = 6 predicates
Examples of Rank of a Predicate Rank  r  = 3 x 1   OR   x 2   OR   x 3  :  f 1   OR   f 2 x 1   OR   x 2   OR   x 4  :  f 1   OR   NOT   f 2 x 1   OR   x 3   OR   x 4  :  NOT (  f 1   AND   f 2  ) x 2   OR   x 3   OR   x 4  :  f 2   OR   NOT   f 1 C (4, 3) = 4 predicates
Total Number of Predicates in Absence of Constraints Let  d  be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features)
A Measure of Similarity in Absence of Prior Information Number of features or attributes shared by two patterns Concept difficulties e.g.,  blind_in_right_eyes  and  blind_in_left_eyes , (1,0) more similar to (1,1) and (0,0) than to (0,1) There are always multiple ways to represent vectors of attributes e.g.  blind_in_right_eye  and  same_in_both_eyes No principled reason to prefer one of these representations over another
A Plausible Measure of Similarity in Absence of Prior Information Number of predicates the patterns share Consider two distinct patterns no predicates of rank  1  is shared 1  predicate of rank 2 is shared C ( d -2, 1)  predicates of rank  3  is shared C ( d -2,  r -2)  predicates of rank  r  is shared Total number of predicates shared
Ugly Duckling Theorem Given a finite set of predicates that enables us to distinguish any two patterns  The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns If pattern similarity is based on the total number of predicates shared, any two patterns are “equally similar”
Ugly Duckling Theorem No problem-independent or privileged or “best” set of features or feature attributes Also applies to a continuous feature spaces
Minimum Description Length (MDL) Find some irreducible, smallest representation (“signal”) of all members of a category All variation among the individual patterns is then “noise” By simplifying recognizers appropriately, the signal can be retained while the noise is ignored
Algorithm Complexity  (Kolmogorov Complexity) Kolmogrov complexity of binary string  x On an abstract computer (Turing machine)  U As the shortest program (binary) string  y Without additional data, computes the string  x  and halts
Algorithm Complexity Example Suppose  x  consists solely of  n   1 s Use some fixed number of bits  k  to specify a loop for printing a string of  1 s Need  log 2 n  more bits to specify the iteration number  n , the condition for halting Thus  K ( x ) =  O (log 2 n )
Algorithm Complexity Example Constant   =11.001001000011111110110101010001… The shortest program is the one that can produce any arbitrary large number of consecutive digits of   Thus  K ( x ) =  O ( 1 )
Algorithm Complexity Example Assume that  x  is a “truly” random binary string Can not be expressed as a shorter string Thus  K ( x ) =  O (| x |)
Minimum Description Length (MDL) Principle Minimize the sum of the model’s algorithmic complexity and the description of the training data with respect to that model
An Application of MDL Principle For decision-tree classifiers, a model  h  specifies the tree and the decisions at the nodes Algorithmic complexity of  h  is proportional to the number of nodes Complexity of data is expressed in terms of the entropy (in bits) of the data Tree-pruning based on entropy is equivalent to the MDL principle
Convergence of MDL Classifiers MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data Can not prove that the MDL principle leads to a superior performance in the finite data case Still consistent with the no free lunch principle
Bayesian Perspective of  MDL Principle
Overfitting Avoidance Avoiding overfitting or minimizing description length are not inherently beneficial Amount to a preference over the forms or parameters of classifiers Beneficial only if they match their problems There are problems that overfitting avoidance leads to worse performances
Explanation of Success of  Occam’s Razor Through evolution and strong selection pressure on our neurons Likely to ignore problems for which Occam’s razor does not hold Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology Principle of satisficing: creating an adequate though possibly nonoptimal solution
Bias and Variance Ways to measure the “match” or “alignment” of the learning algorithm to the classification problem Bias Accuracy or quality of the match High bias implies a poor match Variance Precision or specificity of the match High variance implies a weak match
Bias and Variance for Regression
Bias-Variance Dilemma
Bias-Variance Dilemma Given a target function Model has many parameters Generally low bias Fits data well Yields high variance Model has few parameters Generally high bias May not fit data well The fit does not change much for different data sets (low variance)
Bias-Variance Dilemma Best way to get low bias and low variance Have prior information about the target function Virtually never get zero bias and zero variance Only one learning problem to solve and the answer is already known Large amount of training data will yield improved performance as the model is sufficiently general
Bias and Variance for Classification Reference J. H. Friedman, “On bias, variance, 0/1-loss, and the curse of dimensionality,”  Data Mining and Knowledge Discovery , vol. 1, no. 1, pp. 55-77, 1997.
Bias and Variance for Classification Two-category problem  Target function  F ( x )=Pr[ y =1| x ]=1-Pr[ y =0| x ] Discriminant function y d  =  F ( x ) +      : zero mean, centered binomialy distributed random variable F ( x ) =  E [ y d | x ]
Bias and Variance for Classification Find estimate  g ( x ; D )  to minimize E D [( g ( x ; D )- y d ) 2 ] The estimated  g ( x ; D )  can be used to classify  x  The bias and variance concept for regression can be applied to  g ( x ; D )  as an estimate of  F ( x ) However, this is not related to classification error directly
Bias and Variance for Classification
Bias and Variance for Classification
Bias and Variance for Classification
Bias and Variance for Classification
Bias and Variance for Classification Sign of the boundary bias affects the role of variance in the error Low variance is generally important for accurate classification, if the sign is positive Low boundary bias need not result in lower error rate Simple methods is often with lower variance, and need not be inferior to more flexible methods
Error rates and optimal  K  vs.  N  for  d  = 20 in KNN
Estimation and classification error vs.  d  for  N  = 12800 in KNN d
Boundary Bias-Variance Trade-Off
Leave-One-Out Method (Jackknife)
Generalization to Estimates of Other Statistics  Estimator  of other statistics  median, 25th percentile, mode, etc. Leave-one-out estimate  Jackknife estimate and its related variance
Jackknife Bias Estimate
Example 2 Jackknife for Mode
Bootstrap Randomly selecting  n  points from the training data set  D , with replacement Repeat this process independently  B  times to yield  B  bootstrap data set, treated as independent sets Bootstrap estimate of a statistic  
Bootstrap Bias and Variance Estimate
Properties of Bootstrap Estimates The larger the number  B  of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance B  can be adjusted to the computational resources Jackknife estimate requires exactly  n  repetitions
Bagging Arcing Adaptive Reweighting and Combining Reusing or selecting data in order to improve classification, e.g., AdaBoost Bagging Bootstrap aggregation Multiple versions of  D , by drawing  n’  <  n   samples from  D  with replacement Each set trains a component classifier Final decision is based on vote of each component classifier
Unstable Algorithm and Bagging Unstable algorithm “small” changes in training data lead to significantly different classifiers and relatively “large” changes in accuracy In general bagging improves recognition for unstable classifiers Effectively averages over such discontinuities No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers
Boosting Create the first classifier With accuracy on the training set greater than average (weak learner) Add a new classifier Form an ensemble Joint decision rule has much higher accuracy on the training set Classification performance has been “boosted”
Boosting Procedure Trains successive component classifiers with a subset of the training data that is most “informative” given the current set of component classifiers
Training Data and Weak Learner
Flip a fair coin Head Select remaining samples from  D Present them to  C 1  one by one until  C 1  misclassifies a pattern Add the misclassified pattern to  D 2 Tail Find a pattern that  C 1  classifies correctly “Most Informative” Set Given  C 1
Third Data Set and Classifier  C 3 Randomly select a remaining training pattern Add the pattern if  C 1  and  C 2  disagree, otherwise ignore it
Classification of a Test Pattern If  C 1  and  C 2  agree, use their label If they disagree, trust  C 3
Choosing  n 1 For final vote,  n 1  ~  n 2  ~  n 3  ~  n /3  is desired Reasonable guess:  n /3 Simple problem:  n 2  <<  n 1 Difficult problem:  n 2  too large In practice we need to run the whole boosting procedure a few times  To use the full training set To get roughly equal partitions of the training set
AdaBoost  Adaptive boosting Most popular version of basic boosting Continue adding weak learners until some desired low training error has been achieved
AdaBoost Algorithm
Final Decision Discriminant function
Ensemble Training Error
AdaBoost vs.  No Free Lunch Theorem Boosting only improves classification if the component classifiers perform better than chance Can not be guaranteed a priori Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization Proven effective in many real-world applications
Learning with Queries Set of unlabeled patterns Exists some (possibly costly) way of labeling any pattern To determine which unlabeled patterns would be most informative if they were presented as a  query  to an  oracle Also called active learning or interactive learning Can be refined further as cost-based learning
Application Example Design a classifier for handwritten numerals Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern Human as the oracle
Learning with Queries Begin with a preliminary, weak classifier developed with a small set of labeled samples Two related methods for selecting an informative pattern Confidence-based query selection Voting-based or committee-based query selection
Selecting Most Informative Patterns Confidence-based query selection Pattern that two largest discriminant functions have nearly the same value i.e., patterns lie near the current decision boundaries Voting-based query selection Pattern that yields the greatest disagreement among the  k  resulting category labels
Active Learning Example
Arcing and Active Learning  vs. IID Sampling If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low Resampling methods are generally use techniques not attempt to model or fit the full category distributions Not fitting parameters in a model But instead seeking decision boundaries directly
Arcing and Active Learning  As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions Allow to try to “match” the final classifier to the problem by indirectly adjusting the bias and variance Can be used with arbitrary classification techniques
Estimating the Generalization Rate See if the classifier performs well enough to be useful Compare its performance with that of a competing design Requires making assumptions about the classifier or the problem or both All the methods given here are heuristic
Parametric Models Compute from the assumed parametric model Example: two-class multivariate normal case Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix Problems Overly optimistic Always suspect the model Error rate may be difficult to compute
Simple Cross-Validation Randomly split the set of labeled training samples D into a training set and a  validation  set
m -Fold Cross-Validation Training set is randomly divided into m disjoint sets of equal size  n/m The classifier is trained m times Each time with a different set held out as a validation set Estimated performance is the mean of these  m  errors When  m = n , it is in effect the leave-one-out approach
Forms of Learning for  Cross-Validation neural networks of fixed topology Number of epochs or presentations of the training set  Number of hidden units Width of the Gaussian window in Parzen windows Optimal  k  in the  k -nearest neighbor classifier
Portion    of  D  as a Validation Set Should be small Validation set is used merely to know when to stop adjusting parameters Training set is used to set large number of parameters or degrees of freedoms Traditional default Set     = 0.1 Proven effective in many applications
Anti-Cross-Validation Cross-validation need not work on every problem Anti-cross-validation Halt when the validation error is the first local maximum Must explore different values of   Possibly abandon the use of cross-validation if performance cannot be improved
Estimation of Error Rate Let  p  be the true and unknown error rate of the classifier Assume  k  of the  n’  independent, randomly drawn test samples are misclassified, then  k  has the binomial distribution Maximum-likelihood estimate for  p
95% Confidence Intervals for a Given Estimated  p
Jackknife Estimation of Classification Accuracy Use leave-one-out approach Obtain Jackknife estimate for the mean and variance of the leave-one-out accuracies Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance
Jackknife Estimation of Classification Accuracy
Bootstrap Estimation of Classification Accuracy Train B classifiers, each with a different bootstrap data set Test on other bootstrap data sets Bootstrap estimate is the mean of these bootstrap accuracies
Maximum-Likelihood Comparison (ML-II) Also called maximum-likelihood selection Find the maximum-likelihood parameters for each of the candidate models Calculate the resulting likelihoods (evidences) Choose the model with the largest likelihood
Maximum-Likelihood Comparison
Scientific Process *D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, 4(3), 415-447, 1992
Bayesian Model Comparison
Concept of Occam Factor
Concept of Occam Factor An inherent bias toward simple models (small   0  ) Models that are overly complex (large   0  ) are automatically self-penalizing
Evidence for Gaussian Parameters
Bayesian Model Selection vs.  No Free Lunch Theorem Bayesian model selection  Ignore the prior over the space of models Effectively assume that it is uniform Not take into account how models correspond to underlying target functions Usually corresponds to non-uniform prior over target functions No Free Lunch Theorem Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results
Error Rate as a Function of Number  n  of Training Samples Classifiers trained by a small number of samples will not performed well on new data Typical steps Estimate unknown parameters from samples Use these estimates to determine the classifier Calculate the error rate for the resulting classifier
Analytical Analysis Case of two categories having equal prior probabilities Partition feature space into some m disjoint cells,  C 1 , . . .,  C m Conditional probabilities  p ( x |  1 )  and  p ( x |  2 )  do not vary appreciably within any cell Need only know which cell x falls
Analytical Analysis
Analytical Analysis
Analytical Analysis
Results of Simulation Experiments
Discussions on Error Rate  for Given  n For every curve involving finite n there is an optimal number of cells At first increasing number of cells make it easier to distinguish between distributions represented by  p  and  q If the number of cells becomes too large, there will not be enough training patterns to fill them Eventually number of patterns in most cells will be zero
Discussions on Error Rate  for Given  n For  n  = 500 , the minimal error rate occurs somewhere around  m  = 20 Form the cells by dividing each feature axis into  l  intervals With d features,  m  =  l d If  l  = 2 , using more than four or five binary features will lead to worsen rather than better performance
Test Errors vs.  Number of Training Patterns
Test and Training Error
Power Law
Sum and Difference of  Test and Training Error
Fraction of Dichotomies of  n  Points in  d  Dimensions That are Linear
One-Dimensional Case f ( n  = 4,  d  = 1) = 0.5 X 1111 X 0111 X 1110 0110 1101 0101 X 1100 0100 1011 X 0011 1010 0010 1001 X 0001 X 1000 X 0000 Linearly Separable? Labels Linearly Separable? Labels
Capacity of a Separating Plane Not until  n  is a sizable fraction of  2( d +1)  that the problem begins to become difficult Capacity of a hyperplane At  n  = 2( d +1) , half of the possible dichotomies are still linear Can not expect a linear classifier to “match” a problem, on average, if the dimension of the feature space is greater than  n /2 - 1
Mixture-of-Expert Models Classifiers whose decision is based on the outputs of component classifiers Also called Ensemble classifiers Modular classifiers Pooled classifiers Useful if each component classifier is highly trained (“expert”) in a different region of the feature space
Mixture Model for  Producing Patterns
Mixture-of-Experts Architecture
Ensemble Classifiers
Maximum-Likelihood Estimation
Final Decision Rule Choose the category corresponding to the maximum discriminant value after the pooling system Winner-take-all method Use the decision of the single component classifier that is the “most confident”, i.e., largest  g rj Suboptimal but simple Works well if the component classifiers are experts in separate regions
Component Classifiers without Discriminant Functions Example A KNN classifier (rank order) A decision tree (label) A neural network (analog value) A rule-based system (label)
Heuristics to Convert Outputs to Discrimunant Values
Illustration Examples 0.0 0 3/21=0.143 4th 0.111 0.1 0.0 0 5/21=0.238 2nd 0.129 0.2 0.0 0 6/21=0.286 1st 0.143 0.3 0.0 0 2/21=0.095 5th 0.260 0.9 1.0 1 1/21=0.048 6th 0.193 0.6 0.0 0 4/21=0.194 3rd 0.158 0.4 g i g i g i One-of- c Rank Order Analog
Ad

More Related Content

What's hot (20)

Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
ananth
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
Marina Santini
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
butest
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
Abhishek Vijayvargia
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Marina Santini
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
Functional Imperative
 
Chapter 05 k nn
Chapter 05 k nnChapter 05 k nn
Chapter 05 k nn
Raman Kannan
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3
BarryK88
 
Decision tree
Decision treeDecision tree
Decision tree
Ami_Surati
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
butest
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
Raman Kannan
 
Dealing with inconsistency
Dealing with inconsistencyDealing with inconsistency
Dealing with inconsistency
Rajat Sharma
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forests
Marc Garcia
 
Decision trees
Decision treesDecision trees
Decision trees
Rohit Srivastava
 
Download It
Download ItDownload It
Download It
butest
 
Machine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision TreesMachine Learning Lecture 3 Decision Trees
Machine Learning Lecture 3 Decision Trees
ananth
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
Marina Santini
 
Machine Learning 1 - Introduction
Machine Learning 1 - IntroductionMachine Learning 1 - Introduction
Machine Learning 1 - Introduction
butest
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
Xueping Peng
 
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai UniversityMachine Learning Unit 3 Semester 3  MSc IT Part 2 Mumbai University
Machine Learning Unit 3 Semester 3 MSc IT Part 2 Mumbai University
Madhav Mishra
 
Covering (Rules-based) Algorithm
Covering (Rules-based) AlgorithmCovering (Rules-based) Algorithm
Covering (Rules-based) Algorithm
ZHAO Sam
 
Machine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree LearningMachine Learning 3 - Decision Tree Learning
Machine Learning 3 - Decision Tree Learning
butest
 
Decision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learningDecision tree, softmax regression and ensemble methods in machine learning
Decision tree, softmax regression and ensemble methods in machine learning
Abhishek Vijayvargia
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Marina Santini
 
Introduction to Machine Learning Classifiers
Introduction to Machine Learning ClassifiersIntroduction to Machine Learning Classifiers
Introduction to Machine Learning Classifiers
Functional Imperative
 
Data mining assignment 3
Data mining assignment 3Data mining assignment 3
Data mining assignment 3
BarryK88
 
Machine Learning for NLP
Machine Learning for NLPMachine Learning for NLP
Machine Learning for NLP
butest
 
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
Raman Kannan
 
Dealing with inconsistency
Dealing with inconsistencyDealing with inconsistency
Dealing with inconsistency
Rajat Sharma
 
Understanding random forests
Understanding random forestsUnderstanding random forests
Understanding random forests
Marc Garcia
 
Download It
Download ItDownload It
Download It
butest
 

Viewers also liked (16)

MSShin-Machine_Learning_Algorithm_in_Period_Estimation.ppt
MSShin-Machine_Learning_Algorithm_in_Period_Estimation.pptMSShin-Machine_Learning_Algorithm_in_Period_Estimation.ppt
MSShin-Machine_Learning_Algorithm_in_Period_Estimation.ppt
butest
 
Pseudo-Genetic Machine Learning Algorithm
Pseudo-Genetic Machine Learning AlgorithmPseudo-Genetic Machine Learning Algorithm
Pseudo-Genetic Machine Learning Algorithm
David Juboor
 
ppt.gif
ppt.gifppt.gif
ppt.gif
butest
 
Machine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sMachine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’s
Siddharth Chakravarty
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Deepak George
 
Rで学ぶ逆変換(逆関数)法
Rで学ぶ逆変換(逆関数)法Rで学ぶ逆変換(逆関数)法
Rで学ぶ逆変換(逆関数)法
Nagi Teramo
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
Muhammad Ahad
 
Random forest
Random forestRandom forest
Random forest
Musa Hawamdah
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
Gilles Louppe
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Edureka!
 
Azure Machine Learning tutorial
Azure Machine Learning tutorialAzure Machine Learning tutorial
Azure Machine Learning tutorial
Giacomo Lanciano
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
Max Lin
 
Introduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep LearningIntroduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep Learning
Terry Taewoong Um
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
Lukas Masuch
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
MSShin-Machine_Learning_Algorithm_in_Period_Estimation.ppt
MSShin-Machine_Learning_Algorithm_in_Period_Estimation.pptMSShin-Machine_Learning_Algorithm_in_Period_Estimation.ppt
MSShin-Machine_Learning_Algorithm_in_Period_Estimation.ppt
butest
 
Pseudo-Genetic Machine Learning Algorithm
Pseudo-Genetic Machine Learning AlgorithmPseudo-Genetic Machine Learning Algorithm
Pseudo-Genetic Machine Learning Algorithm
David Juboor
 
ppt.gif
ppt.gifppt.gif
ppt.gif
butest
 
Machine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’sMachine learning algorithm for classification of activity of daily life’s
Machine learning algorithm for classification of activity of daily life’s
Siddharth Chakravarty
 
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting MachinesDecision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Decision Tree Ensembles - Bagging, Random Forest & Gradient Boosting Machines
Deepak George
 
Rで学ぶ逆変換(逆関数)法
Rで学ぶ逆変換(逆関数)法Rで学ぶ逆変換(逆関数)法
Rで学ぶ逆変換(逆関数)法
Nagi Teramo
 
Artificial Intelligence
Artificial Intelligence Artificial Intelligence
Artificial Intelligence
Muhammad Ahad
 
Understanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to PracticeUnderstanding Random Forests: From Theory to Practice
Understanding Random Forests: From Theory to Practice
Gilles Louppe
 
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
K-Means Clustering Algorithm - Cluster Analysis | Machine Learning Algorithm ...
Edureka!
 
Azure Machine Learning tutorial
Azure Machine Learning tutorialAzure Machine Learning tutorial
Azure Machine Learning tutorial
Giacomo Lanciano
 
Machine Learning on Big Data
Machine Learning on Big DataMachine Learning on Big Data
Machine Learning on Big Data
Max Lin
 
Introduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep LearningIntroduction to Machine Learning and Deep Learning
Introduction to Machine Learning and Deep Learning
Terry Taewoong Um
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Lior Rokach
 
Deep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial IntelligenceDeep Learning - The Past, Present and Future of Artificial Intelligence
Deep Learning - The Past, Present and Future of Artificial Intelligence
Lukas Masuch
 
Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests Data Science - Part V - Decision Trees & Random Forests
Data Science - Part V - Decision Trees & Random Forests
Derek Kane
 
Ad

Similar to MachineLearning.ppt (20)

An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
butest
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
chenhm
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
butest
 
1_2 Introduction to Machine Learning.pdf
1_2 Introduction to Machine Learning.pdf1_2 Introduction to Machine Learning.pdf
1_2 Introduction to Machine Learning.pdf
RaviBhuva13
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
chenhm
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
AaryanArora10
 
Pattern Recognition and understanding patterns
Pattern Recognition and understanding patternsPattern Recognition and understanding patterns
Pattern Recognition and understanding patterns
gulhanep9
 
Pattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture NotesPattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture Notes
Akshaya821957
 
ppt
pptppt
ppt
butest
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
Sadia Zafar
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
Yogendra Singh
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
Krish_ver2
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Sherri Gunder
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
Shivakrishnan18
 
ppt
pptppt
ppt
butest
 
ppt
pptppt
ppt
butest
 
Image Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptxImage Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptx
ssuseracb8ba
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
Sara Hooker
 
An Introduction to boosting
An Introduction to boostingAn Introduction to boosting
An Introduction to boosting
butest
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
chenhm
 
lecture_mooney.ppt
lecture_mooney.pptlecture_mooney.ppt
lecture_mooney.ppt
butest
 
1_2 Introduction to Machine Learning.pdf
1_2 Introduction to Machine Learning.pdf1_2 Introduction to Machine Learning.pdf
1_2 Introduction to Machine Learning.pdf
RaviBhuva13
 
Intro to Feature Selection
Intro to Feature SelectionIntro to Feature Selection
Intro to Feature Selection
chenhm
 
A tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbiesA tour of the top 10 algorithms for machine learning newbies
A tour of the top 10 algorithms for machine learning newbies
Vimal Gupta
 
Sample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdfSample_Subjective_Questions_Answers (1).pdf
Sample_Subjective_Questions_Answers (1).pdf
AaryanArora10
 
Pattern Recognition and understanding patterns
Pattern Recognition and understanding patternsPattern Recognition and understanding patterns
Pattern Recognition and understanding patterns
gulhanep9
 
Pattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture NotesPattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture Notes
Akshaya821957
 
Probability distribution Function & Decision Trees in machine learning
Probability distribution Function  & Decision Trees in machine learningProbability distribution Function  & Decision Trees in machine learning
Probability distribution Function & Decision Trees in machine learning
Sadia Zafar
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
Yogendra Singh
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
Krish_ver2
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Sherri Gunder
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
butest
 
Module III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptxModule III - Classification Decision tree (1).pptx
Module III - Classification Decision tree (1).pptx
Shivakrishnan18
 
Image Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptxImage Recognition of recognition pattern.pptx
Image Recognition of recognition pattern.pptx
ssuseracb8ba
 
Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
Sara Hooker
 
Ad

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
PPT
PPTPPT
PPT
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
hier
hierhier
hier
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 
EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

MachineLearning.ppt

  • 1. Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University
  • 2. Some Fundamental Problems Which algorithm is “best”? Are there any reasons to favor one algorithm over another? Is “Occam’s razor” really so evident? Do simpler or “smoother” classifiers generalize better? If so, why? Are there fundamental “conservation” or “constraint” laws other than Bayes error rate?
  • 3. Meaning of “Algorithm-Independent” Mathematical foundations that do not depend upon the particular classifier or learning algorithm used e.g., bias and variance concept Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use e.g., cross-validation and resampling techniques
  • 4. Roadmap No pattern classification method is inherently superior to any other Ways to quantify and adjust the “match” between a learning algorithm and the problem it addresses Estimation of accuracies and comparison of different classifiers with certain assumptions Methods for integrating component classifiers
  • 5. Generalization Performance by Off-Training Set Error Consider a two-category problem Training set D Training patterns x i y i = 1 or -1 for i = 1, . . ., n is generated by unknown target function F ( x ) to be learned F ( x ) is often with a random component The same input could lead to different categories Giving non-zero Bayes error
  • 6. Generalization Performance by Off-Training Set Error Let H be the (discrete) set of hypotheses, or sets of parameters to be learned A particular h belongs to H quantized weights in neural network Parameters q in a functional model Sets of decisions in a tree P ( h ) : prior probability that the algorithm will produce hypothesis h after training
  • 7. Generalization Performance by Off-Training Set Error P ( h | D ) : probability the algorithm will yield h when trained on data D Nearest-neighbor and decision tree: non-zero only for a single hypothesis Neural network: can be a broad distribution E : error for zero-one or other loss function
  • 8. Generalization Performance by Off-Training Set Error A natural measure Expected off-training-set classification error for the k th candidate learning algorithm
  • 9. No Free Lunch Theorem
  • 10. No Free Lunch Theorem For any two algorithms No matter how clever in choosing a “good” algorithm and a “bad” algorithm, if all target functions are equally likely, the “good” algorithm will not outperform the “bad” one There is at least one target function for which random guessing is a better algorithm
  • 11. No Free Lunch Theorem Even if we know D , averaged over all target functions no algorithm yields an off-training set error that is superior to any other
  • 12. Example 1 No Free Lunch for Binary Data -1 1 1 111 -1 1 1 110 -1 1 -1 101 -1 1 1 100 -1 1 -1 011 1 1 1 010 -1 -1 -1 001 D 1 1 1 000 h 2 h 1 F x
  • 13. No Free Lunch Theorem
  • 14. Conservation in Generalization Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems Can trade performance on problems we do not expect to encounter with those that we do expect to encounter It is the assumptions about the learning domains that are relevant
  • 15. Ugly Duckling Theorem In the absence of assumptions there is no privileged or “best” feature representation Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct
  • 16. Venn Diagram Representation of Features as Predicates
  • 17. Rank of a Predicate Number of the simplest or indivisible elements it contains Example: rank r = 1 x 1 : f 1 AND NOT f 2 x 2 : f 1 AND f 2 x 3 : f 2 AND NOT f 1 x 4 : NOT ( f 1 OR f 2 ) C (4,1) = 4 predicates
  • 18. Examples of Rank of a Predicate Rank r = 2 x 1 OR x 2 : f 1 x 1 OR x 3 : f 1 XOR f 2 x 1 OR x 4 : NOT f 2 x 2 OR x 3 : f 2 x 2 OR x 4 : ( f 1 AND f 2 ) OR NOT ( f 1 OR f 2 ) x 3 OR x 4 : NOT f 1 C (4, 2) = 6 predicates
  • 19. Examples of Rank of a Predicate Rank r = 3 x 1 OR x 2 OR x 3 : f 1 OR f 2 x 1 OR x 2 OR x 4 : f 1 OR NOT f 2 x 1 OR x 3 OR x 4 : NOT ( f 1 AND f 2 ) x 2 OR x 3 OR x 4 : f 2 OR NOT f 1 C (4, 3) = 4 predicates
  • 20. Total Number of Predicates in Absence of Constraints Let d be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features)
  • 21. A Measure of Similarity in Absence of Prior Information Number of features or attributes shared by two patterns Concept difficulties e.g., blind_in_right_eyes and blind_in_left_eyes , (1,0) more similar to (1,1) and (0,0) than to (0,1) There are always multiple ways to represent vectors of attributes e.g. blind_in_right_eye and same_in_both_eyes No principled reason to prefer one of these representations over another
  • 22. A Plausible Measure of Similarity in Absence of Prior Information Number of predicates the patterns share Consider two distinct patterns no predicates of rank 1 is shared 1 predicate of rank 2 is shared C ( d -2, 1) predicates of rank 3 is shared C ( d -2, r -2) predicates of rank r is shared Total number of predicates shared
  • 23. Ugly Duckling Theorem Given a finite set of predicates that enables us to distinguish any two patterns The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns If pattern similarity is based on the total number of predicates shared, any two patterns are “equally similar”
  • 24. Ugly Duckling Theorem No problem-independent or privileged or “best” set of features or feature attributes Also applies to a continuous feature spaces
  • 25. Minimum Description Length (MDL) Find some irreducible, smallest representation (“signal”) of all members of a category All variation among the individual patterns is then “noise” By simplifying recognizers appropriately, the signal can be retained while the noise is ignored
  • 26. Algorithm Complexity (Kolmogorov Complexity) Kolmogrov complexity of binary string x On an abstract computer (Turing machine) U As the shortest program (binary) string y Without additional data, computes the string x and halts
  • 27. Algorithm Complexity Example Suppose x consists solely of n 1 s Use some fixed number of bits k to specify a loop for printing a string of 1 s Need log 2 n more bits to specify the iteration number n , the condition for halting Thus K ( x ) = O (log 2 n )
  • 28. Algorithm Complexity Example Constant  =11.001001000011111110110101010001… The shortest program is the one that can produce any arbitrary large number of consecutive digits of  Thus K ( x ) = O ( 1 )
  • 29. Algorithm Complexity Example Assume that x is a “truly” random binary string Can not be expressed as a shorter string Thus K ( x ) = O (| x |)
  • 30. Minimum Description Length (MDL) Principle Minimize the sum of the model’s algorithmic complexity and the description of the training data with respect to that model
  • 31. An Application of MDL Principle For decision-tree classifiers, a model h specifies the tree and the decisions at the nodes Algorithmic complexity of h is proportional to the number of nodes Complexity of data is expressed in terms of the entropy (in bits) of the data Tree-pruning based on entropy is equivalent to the MDL principle
  • 32. Convergence of MDL Classifiers MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data Can not prove that the MDL principle leads to a superior performance in the finite data case Still consistent with the no free lunch principle
  • 33. Bayesian Perspective of MDL Principle
  • 34. Overfitting Avoidance Avoiding overfitting or minimizing description length are not inherently beneficial Amount to a preference over the forms or parameters of classifiers Beneficial only if they match their problems There are problems that overfitting avoidance leads to worse performances
  • 35. Explanation of Success of Occam’s Razor Through evolution and strong selection pressure on our neurons Likely to ignore problems for which Occam’s razor does not hold Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology Principle of satisficing: creating an adequate though possibly nonoptimal solution
  • 36. Bias and Variance Ways to measure the “match” or “alignment” of the learning algorithm to the classification problem Bias Accuracy or quality of the match High bias implies a poor match Variance Precision or specificity of the match High variance implies a weak match
  • 37. Bias and Variance for Regression
  • 39. Bias-Variance Dilemma Given a target function Model has many parameters Generally low bias Fits data well Yields high variance Model has few parameters Generally high bias May not fit data well The fit does not change much for different data sets (low variance)
  • 40. Bias-Variance Dilemma Best way to get low bias and low variance Have prior information about the target function Virtually never get zero bias and zero variance Only one learning problem to solve and the answer is already known Large amount of training data will yield improved performance as the model is sufficiently general
  • 41. Bias and Variance for Classification Reference J. H. Friedman, “On bias, variance, 0/1-loss, and the curse of dimensionality,” Data Mining and Knowledge Discovery , vol. 1, no. 1, pp. 55-77, 1997.
  • 42. Bias and Variance for Classification Two-category problem Target function F ( x )=Pr[ y =1| x ]=1-Pr[ y =0| x ] Discriminant function y d = F ( x ) +   : zero mean, centered binomialy distributed random variable F ( x ) = E [ y d | x ]
  • 43. Bias and Variance for Classification Find estimate g ( x ; D ) to minimize E D [( g ( x ; D )- y d ) 2 ] The estimated g ( x ; D ) can be used to classify x The bias and variance concept for regression can be applied to g ( x ; D ) as an estimate of F ( x ) However, this is not related to classification error directly
  • 44. Bias and Variance for Classification
  • 45. Bias and Variance for Classification
  • 46. Bias and Variance for Classification
  • 47. Bias and Variance for Classification
  • 48. Bias and Variance for Classification Sign of the boundary bias affects the role of variance in the error Low variance is generally important for accurate classification, if the sign is positive Low boundary bias need not result in lower error rate Simple methods is often with lower variance, and need not be inferior to more flexible methods
  • 49. Error rates and optimal K vs. N for d = 20 in KNN
  • 50. Estimation and classification error vs. d for N = 12800 in KNN d
  • 53. Generalization to Estimates of Other Statistics Estimator of other statistics median, 25th percentile, mode, etc. Leave-one-out estimate Jackknife estimate and its related variance
  • 56. Bootstrap Randomly selecting n points from the training data set D , with replacement Repeat this process independently B times to yield B bootstrap data set, treated as independent sets Bootstrap estimate of a statistic 
  • 57. Bootstrap Bias and Variance Estimate
  • 58. Properties of Bootstrap Estimates The larger the number B of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance B can be adjusted to the computational resources Jackknife estimate requires exactly n repetitions
  • 59. Bagging Arcing Adaptive Reweighting and Combining Reusing or selecting data in order to improve classification, e.g., AdaBoost Bagging Bootstrap aggregation Multiple versions of D , by drawing n’ < n samples from D with replacement Each set trains a component classifier Final decision is based on vote of each component classifier
  • 60. Unstable Algorithm and Bagging Unstable algorithm “small” changes in training data lead to significantly different classifiers and relatively “large” changes in accuracy In general bagging improves recognition for unstable classifiers Effectively averages over such discontinuities No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers
  • 61. Boosting Create the first classifier With accuracy on the training set greater than average (weak learner) Add a new classifier Form an ensemble Joint decision rule has much higher accuracy on the training set Classification performance has been “boosted”
  • 62. Boosting Procedure Trains successive component classifiers with a subset of the training data that is most “informative” given the current set of component classifiers
  • 63. Training Data and Weak Learner
  • 64. Flip a fair coin Head Select remaining samples from D Present them to C 1 one by one until C 1 misclassifies a pattern Add the misclassified pattern to D 2 Tail Find a pattern that C 1 classifies correctly “Most Informative” Set Given C 1
  • 65. Third Data Set and Classifier C 3 Randomly select a remaining training pattern Add the pattern if C 1 and C 2 disagree, otherwise ignore it
  • 66. Classification of a Test Pattern If C 1 and C 2 agree, use their label If they disagree, trust C 3
  • 67. Choosing n 1 For final vote, n 1 ~ n 2 ~ n 3 ~ n /3 is desired Reasonable guess: n /3 Simple problem: n 2 << n 1 Difficult problem: n 2 too large In practice we need to run the whole boosting procedure a few times To use the full training set To get roughly equal partitions of the training set
  • 68. AdaBoost Adaptive boosting Most popular version of basic boosting Continue adding weak learners until some desired low training error has been achieved
  • 72. AdaBoost vs. No Free Lunch Theorem Boosting only improves classification if the component classifiers perform better than chance Can not be guaranteed a priori Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization Proven effective in many real-world applications
  • 73. Learning with Queries Set of unlabeled patterns Exists some (possibly costly) way of labeling any pattern To determine which unlabeled patterns would be most informative if they were presented as a query to an oracle Also called active learning or interactive learning Can be refined further as cost-based learning
  • 74. Application Example Design a classifier for handwritten numerals Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern Human as the oracle
  • 75. Learning with Queries Begin with a preliminary, weak classifier developed with a small set of labeled samples Two related methods for selecting an informative pattern Confidence-based query selection Voting-based or committee-based query selection
  • 76. Selecting Most Informative Patterns Confidence-based query selection Pattern that two largest discriminant functions have nearly the same value i.e., patterns lie near the current decision boundaries Voting-based query selection Pattern that yields the greatest disagreement among the k resulting category labels
  • 78. Arcing and Active Learning vs. IID Sampling If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low Resampling methods are generally use techniques not attempt to model or fit the full category distributions Not fitting parameters in a model But instead seeking decision boundaries directly
  • 79. Arcing and Active Learning As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions Allow to try to “match” the final classifier to the problem by indirectly adjusting the bias and variance Can be used with arbitrary classification techniques
  • 80. Estimating the Generalization Rate See if the classifier performs well enough to be useful Compare its performance with that of a competing design Requires making assumptions about the classifier or the problem or both All the methods given here are heuristic
  • 81. Parametric Models Compute from the assumed parametric model Example: two-class multivariate normal case Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix Problems Overly optimistic Always suspect the model Error rate may be difficult to compute
  • 82. Simple Cross-Validation Randomly split the set of labeled training samples D into a training set and a validation set
  • 83. m -Fold Cross-Validation Training set is randomly divided into m disjoint sets of equal size n/m The classifier is trained m times Each time with a different set held out as a validation set Estimated performance is the mean of these m errors When m = n , it is in effect the leave-one-out approach
  • 84. Forms of Learning for Cross-Validation neural networks of fixed topology Number of epochs or presentations of the training set Number of hidden units Width of the Gaussian window in Parzen windows Optimal k in the k -nearest neighbor classifier
  • 85. Portion  of D as a Validation Set Should be small Validation set is used merely to know when to stop adjusting parameters Training set is used to set large number of parameters or degrees of freedoms Traditional default Set  = 0.1 Proven effective in many applications
  • 86. Anti-Cross-Validation Cross-validation need not work on every problem Anti-cross-validation Halt when the validation error is the first local maximum Must explore different values of  Possibly abandon the use of cross-validation if performance cannot be improved
  • 87. Estimation of Error Rate Let p be the true and unknown error rate of the classifier Assume k of the n’ independent, randomly drawn test samples are misclassified, then k has the binomial distribution Maximum-likelihood estimate for p
  • 88. 95% Confidence Intervals for a Given Estimated p
  • 89. Jackknife Estimation of Classification Accuracy Use leave-one-out approach Obtain Jackknife estimate for the mean and variance of the leave-one-out accuracies Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance
  • 90. Jackknife Estimation of Classification Accuracy
  • 91. Bootstrap Estimation of Classification Accuracy Train B classifiers, each with a different bootstrap data set Test on other bootstrap data sets Bootstrap estimate is the mean of these bootstrap accuracies
  • 92. Maximum-Likelihood Comparison (ML-II) Also called maximum-likelihood selection Find the maximum-likelihood parameters for each of the candidate models Calculate the resulting likelihoods (evidences) Choose the model with the largest likelihood
  • 94. Scientific Process *D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, 4(3), 415-447, 1992
  • 97. Concept of Occam Factor An inherent bias toward simple models (small  0  ) Models that are overly complex (large  0  ) are automatically self-penalizing
  • 98. Evidence for Gaussian Parameters
  • 99. Bayesian Model Selection vs. No Free Lunch Theorem Bayesian model selection Ignore the prior over the space of models Effectively assume that it is uniform Not take into account how models correspond to underlying target functions Usually corresponds to non-uniform prior over target functions No Free Lunch Theorem Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results
  • 100. Error Rate as a Function of Number n of Training Samples Classifiers trained by a small number of samples will not performed well on new data Typical steps Estimate unknown parameters from samples Use these estimates to determine the classifier Calculate the error rate for the resulting classifier
  • 101. Analytical Analysis Case of two categories having equal prior probabilities Partition feature space into some m disjoint cells, C 1 , . . ., C m Conditional probabilities p ( x |  1 ) and p ( x |  2 ) do not vary appreciably within any cell Need only know which cell x falls
  • 105. Results of Simulation Experiments
  • 106. Discussions on Error Rate for Given n For every curve involving finite n there is an optimal number of cells At first increasing number of cells make it easier to distinguish between distributions represented by p and q If the number of cells becomes too large, there will not be enough training patterns to fill them Eventually number of patterns in most cells will be zero
  • 107. Discussions on Error Rate for Given n For n = 500 , the minimal error rate occurs somewhere around m = 20 Form the cells by dividing each feature axis into l intervals With d features, m = l d If l = 2 , using more than four or five binary features will lead to worsen rather than better performance
  • 108. Test Errors vs. Number of Training Patterns
  • 111. Sum and Difference of Test and Training Error
  • 112. Fraction of Dichotomies of n Points in d Dimensions That are Linear
  • 113. One-Dimensional Case f ( n = 4, d = 1) = 0.5 X 1111 X 0111 X 1110 0110 1101 0101 X 1100 0100 1011 X 0011 1010 0010 1001 X 0001 X 1000 X 0000 Linearly Separable? Labels Linearly Separable? Labels
  • 114. Capacity of a Separating Plane Not until n is a sizable fraction of 2( d +1) that the problem begins to become difficult Capacity of a hyperplane At n = 2( d +1) , half of the possible dichotomies are still linear Can not expect a linear classifier to “match” a problem, on average, if the dimension of the feature space is greater than n /2 - 1
  • 115. Mixture-of-Expert Models Classifiers whose decision is based on the outputs of component classifiers Also called Ensemble classifiers Modular classifiers Pooled classifiers Useful if each component classifier is highly trained (“expert”) in a different region of the feature space
  • 116. Mixture Model for Producing Patterns
  • 120. Final Decision Rule Choose the category corresponding to the maximum discriminant value after the pooling system Winner-take-all method Use the decision of the single component classifier that is the “most confident”, i.e., largest g rj Suboptimal but simple Works well if the component classifiers are experts in separate regions
  • 121. Component Classifiers without Discriminant Functions Example A KNN classifier (rank order) A decision tree (label) A neural network (analog value) A rule-based system (label)
  • 122. Heuristics to Convert Outputs to Discrimunant Values
  • 123. Illustration Examples 0.0 0 3/21=0.143 4th 0.111 0.1 0.0 0 5/21=0.238 2nd 0.129 0.2 0.0 0 6/21=0.286 1st 0.143 0.3 0.0 0 2/21=0.095 5th 0.260 0.9 1.0 1 1/21=0.048 6th 0.193 0.6 0.0 0 4/21=0.194 3rd 0.158 0.4 g i g i g i One-of- c Rank Order Analog
  翻译: