MachineLearning.ppt

Algorithm-Independent Machine Learning Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia, National Taiwan University

Some Fundamental Problems Which algorithm is “best”? Are there any reasons to favor one algorithm over another? Is “Occam’s razor” really so evident? Do simpler or “smoother” classifiers generalize better? If so, why? Are there fundamental “conservation” or “constraint” laws other than Bayes error rate?

Meaning of “Algorithm-Independent” Mathematical foundations that do not depend upon the particular classifier or learning algorithm used e.g., bias and variance concept Techniques that can be used in conjunction with different learning algorithm, or provide guidance in their use e.g., cross-validation and resampling techniques

Roadmap No pattern classification method is inherently superior to any other Ways to quantify and adjust the “match” between a learning algorithm and the problem it addresses Estimation of accuracies and comparison of different classifiers with certain assumptions Methods for integrating component classifiers

Generalization Performance by Off-Training Set Error Consider a two-category problem Training set D Training patterns x i y i = 1 or -1 for i = 1, . . ., n is generated by unknown target function F ( x ) to be learned F ( x ) is often with a random component The same input could lead to different categories Giving non-zero Bayes error

Generalization Performance by Off-Training Set Error Let H be the (discrete) set of hypotheses, or sets of parameters to be learned A particular h belongs to H quantized weights in neural network Parameters q in a functional model Sets of decisions in a tree P ( h ) : prior probability that the algorithm will produce hypothesis h after training

Generalization Performance by Off-Training Set Error P ( h | D ) : probability the algorithm will yield h when trained on data D Nearest-neighbor and decision tree: non-zero only for a single hypothesis Neural network: can be a broad distribution E : error for zero-one or other loss function

Generalization Performance by Off-Training Set Error A natural measure Expected off-training-set classification error for the k th candidate learning algorithm

No Free Lunch Theorem For any two algorithms No matter how clever in choosing a “good” algorithm and a “bad” algorithm, if all target functions are equally likely, the “good” algorithm will not outperform the “bad” one There is at least one target function for which random guessing is a better algorithm

No Free Lunch Theorem Even if we know D , averaged over all target functions no algorithm yields an off-training set error that is superior to any other

Example 1 No Free Lunch for Binary Data -1 1 1 111 -1 1 1 110 -1 1 -1 101 -1 1 1 100 -1 1 -1 011 1 1 1 010 -1 -1 -1 001 D 1 1 1 000 h 2 h 1 F x

Conservation in Generalization Can not achieve positive performance on some problems without getting an equal and opposite amount of negative performance on other problems Can trade performance on problems we do not expect to encounter with those that we do expect to encounter It is the assumptions about the learning domains that are relevant

Ugly Duckling Theorem In the absence of assumptions there is no privileged or “best” feature representation Even the notion of similarity between patterns depends implicitly on assumptions that may or may not be correct

Venn Diagram Representation of Features as Predicates

Rank of a Predicate Number of the simplest or indivisible elements it contains Example: rank r = 1 x 1 : f 1 AND NOT f 2 x 2 : f 1 AND f 2 x 3 : f 2 AND NOT f 1 x 4 : NOT ( f 1 OR f 2 ) C (4,1) = 4 predicates

Examples of Rank of a Predicate Rank r = 2 x 1 OR x 2 : f 1 x 1 OR x 3 : f 1 XOR f 2 x 1 OR x 4 : NOT f 2 x 2 OR x 3 : f 2 x 2 OR x 4 : ( f 1 AND f 2 ) OR NOT ( f 1 OR f 2 ) x 3 OR x 4 : NOT f 1 C (4, 2) = 6 predicates

Examples of Rank of a Predicate Rank r = 3 x 1 OR x 2 OR x 3 : f 1 OR f 2 x 1 OR x 2 OR x 4 : f 1 OR NOT f 2 x 1 OR x 3 OR x 4 : NOT ( f 1 AND f 2 ) x 2 OR x 3 OR x 4 : f 2 OR NOT f 1 C (4, 3) = 4 predicates

Total Number of Predicates in Absence of Constraints Let d be the number of regions in the Venn diagrams (i.e., number of distinctive patterns, or number of possible values determined by combinations of the features)

A Measure of Similarity in Absence of Prior Information Number of features or attributes shared by two patterns Concept difficulties e.g., blind_in_right_eyes and blind_in_left_eyes , (1,0) more similar to (1,1) and (0,0) than to (0,1) There are always multiple ways to represent vectors of attributes e.g. blind_in_right_eye and same_in_both_eyes No principled reason to prefer one of these representations over another

A Plausible Measure of Similarity in Absence of Prior Information Number of predicates the patterns share Consider two distinct patterns no predicates of rank 1 is shared 1 predicate of rank 2 is shared C ( d -2, 1) predicates of rank 3 is shared C ( d -2, r -2) predicates of rank r is shared Total number of predicates shared

Ugly Duckling Theorem Given a finite set of predicates that enables us to distinguish any two patterns The number of predicates shared by any two such patterns is constant and independent of the choice of those patterns If pattern similarity is based on the total number of predicates shared, any two patterns are “equally similar”

Ugly Duckling Theorem No problem-independent or privileged or “best” set of features or feature attributes Also applies to a continuous feature spaces

Minimum Description Length (MDL) Find some irreducible, smallest representation (“signal”) of all members of a category All variation among the individual patterns is then “noise” By simplifying recognizers appropriately, the signal can be retained while the noise is ignored

Algorithm Complexity (Kolmogorov Complexity) Kolmogrov complexity of binary string x On an abstract computer (Turing machine) U As the shortest program (binary) string y Without additional data, computes the string x and halts

Algorithm Complexity Example Suppose x consists solely of n 1 s Use some fixed number of bits k to specify a loop for printing a string of 1 s Need log 2 n more bits to specify the iteration number n , the condition for halting Thus K ( x ) = O (log 2 n )

Algorithm Complexity Example Constant  =11.001001000011111110110101010001… The shortest program is the one that can produce any arbitrary large number of consecutive digits of  Thus K ( x ) = O ( 1 )

Algorithm Complexity Example Assume that x is a “truly” random binary string Can not be expressed as a shorter string Thus K ( x ) = O (| x |)

Minimum Description Length (MDL) Principle Minimize the sum of the model’s algorithmic complexity and the description of the training data with respect to that model

An Application of MDL Principle For decision-tree classifiers, a model h specifies the tree and the decisions at the nodes Algorithmic complexity of h is proportional to the number of nodes Complexity of data is expressed in terms of the entropy (in bits) of the data Tree-pruning based on entropy is equivalent to the MDL principle

Convergence of MDL Classifiers MDL classifiers are guaranteed to converge to the ideal or true model in the limit of more and more data Can not prove that the MDL principle leads to a superior performance in the finite data case Still consistent with the no free lunch principle

Bayesian Perspective of MDL Principle

Overfitting Avoidance Avoiding overfitting or minimizing description length are not inherently beneficial Amount to a preference over the forms or parameters of classifiers Beneficial only if they match their problems There are problems that overfitting avoidance leads to worse performances

Explanation of Success of Occam’s Razor Through evolution and strong selection pressure on our neurons Likely to ignore problems for which Occam’s razor does not hold Researchers naturally develop simple algorithms before more complex ones – a bias imposed by methodology Principle of satisficing: creating an adequate though possibly nonoptimal solution

Bias and Variance Ways to measure the “match” or “alignment” of the learning algorithm to the classification problem Bias Accuracy or quality of the match High bias implies a poor match Variance Precision or specificity of the match High variance implies a weak match

Bias and Variance for Regression

Bias-Variance Dilemma Given a target function Model has many parameters Generally low bias Fits data well Yields high variance Model has few parameters Generally high bias May not fit data well The fit does not change much for different data sets (low variance)

Bias-Variance Dilemma Best way to get low bias and low variance Have prior information about the target function Virtually never get zero bias and zero variance Only one learning problem to solve and the answer is already known Large amount of training data will yield improved performance as the model is sufficiently general

Bias and Variance for Classification Reference J. H. Friedman, “On bias, variance, 0/1-loss, and the curse of dimensionality,” Data Mining and Knowledge Discovery , vol. 1, no. 1, pp. 55-77, 1997.

Bias and Variance for Classification Two-category problem Target function F ( x )=Pr[ y =1| x ]=1-Pr[ y =0| x ] Discriminant function y d = F ( x ) +   : zero mean, centered binomialy distributed random variable F ( x ) = E [ y d | x ]

Bias and Variance for Classification Find estimate g ( x ; D ) to minimize E D [( g ( x ; D )- y d ) 2 ] The estimated g ( x ; D ) can be used to classify x The bias and variance concept for regression can be applied to g ( x ; D ) as an estimate of F ( x ) However, this is not related to classification error directly

Bias and Variance for Classification

Bias and Variance for Classification Sign of the boundary bias affects the role of variance in the error Low variance is generally important for accurate classification, if the sign is positive Low boundary bias need not result in lower error rate Simple methods is often with lower variance, and need not be inferior to more flexible methods

Error rates and optimal K vs. N for d = 20 in KNN

Estimation and classification error vs. d for N = 12800 in KNN d

Boundary Bias-Variance Trade-Off

Leave-One-Out Method (Jackknife)

Generalization to Estimates of Other Statistics Estimator of other statistics median, 25th percentile, mode, etc. Leave-one-out estimate Jackknife estimate and its related variance

Bootstrap Randomly selecting n points from the training data set D , with replacement Repeat this process independently B times to yield B bootstrap data set, treated as independent sets Bootstrap estimate of a statistic 

Bootstrap Bias and Variance Estimate

Properties of Bootstrap Estimates The larger the number B of bootstrap samples, the more satisfactory is the estimate of a statistic and its variance B can be adjusted to the computational resources Jackknife estimate requires exactly n repetitions

Bagging Arcing Adaptive Reweighting and Combining Reusing or selecting data in order to improve classification, e.g., AdaBoost Bagging Bootstrap aggregation Multiple versions of D , by drawing n’ < n samples from D with replacement Each set trains a component classifier Final decision is based on vote of each component classifier

Unstable Algorithm and Bagging Unstable algorithm “small” changes in training data lead to significantly different classifiers and relatively “large” changes in accuracy In general bagging improves recognition for unstable classifiers Effectively averages over such discontinuities No convincing theoretical derivations or simulation studies showing bagging helps for all unstable classifiers

Boosting Create the first classifier With accuracy on the training set greater than average (weak learner) Add a new classifier Form an ensemble Joint decision rule has much higher accuracy on the training set Classification performance has been “boosted”

Boosting Procedure Trains successive component classifiers with a subset of the training data that is most “informative” given the current set of component classifiers

Training Data and Weak Learner

Flip a fair coin Head Select remaining samples from D Present them to C 1 one by one until C 1 misclassifies a pattern Add the misclassified pattern to D 2 Tail Find a pattern that C 1 classifies correctly “Most Informative” Set Given C 1

Third Data Set and Classifier C 3 Randomly select a remaining training pattern Add the pattern if C 1 and C 2 disagree, otherwise ignore it

Classification of a Test Pattern If C 1 and C 2 agree, use their label If they disagree, trust C 3

Choosing n 1 For final vote, n 1 ~ n 2 ~ n 3 ~ n /3 is desired Reasonable guess: n /3 Simple problem: n 2 << n 1 Difficult problem: n 2 too large In practice we need to run the whole boosting procedure a few times To use the full training set To get roughly equal partitions of the training set

AdaBoost Adaptive boosting Most popular version of basic boosting Continue adding weak learners until some desired low training error has been achieved

Final Decision Discriminant function

AdaBoost vs. No Free Lunch Theorem Boosting only improves classification if the component classifiers perform better than chance Can not be guaranteed a priori Exponential reduction in error on the training set does not ensure reduction of the off-training set error or generalization Proven effective in many real-world applications

Learning with Queries Set of unlabeled patterns Exists some (possibly costly) way of labeling any pattern To determine which unlabeled patterns would be most informative if they were presented as a query to an oracle Also called active learning or interactive learning Can be refined further as cost-based learning

Application Example Design a classifier for handwritten numerals Using unlabeled pixel images scanned from documents from a corpus too large to label every pattern Human as the oracle

Learning with Queries Begin with a preliminary, weak classifier developed with a small set of labeled samples Two related methods for selecting an informative pattern Confidence-based query selection Voting-based or committee-based query selection

Selecting Most Informative Patterns Confidence-based query selection Pattern that two largest discriminant functions have nearly the same value i.e., patterns lie near the current decision boundaries Voting-based query selection Pattern that yields the greatest disagreement among the k resulting category labels

Arcing and Active Learning vs. IID Sampling If take a model of true distribution and train it with a highly skewed distribution by active learning, the final classifier accuracy might be low Resampling methods are generally use techniques not attempt to model or fit the full category distributions Not fitting parameters in a model But instead seeking decision boundaries directly

Arcing and Active Learning As number of component classifiers is increased, resampling, boosting and related methods effectively broaden that class of implementable functions Allow to try to “match” the final classifier to the problem by indirectly adjusting the bias and variance Can be used with arbitrary classification techniques

Estimating the Generalization Rate See if the classifier performs well enough to be useful Compare its performance with that of a competing design Requires making assumptions about the classifier or the problem or both All the methods given here are heuristic

Parametric Models Compute from the assumed parametric model Example: two-class multivariate normal case Bhattacharyya or Chernoff bounds using estimated mean and covariance matrix Problems Overly optimistic Always suspect the model Error rate may be difficult to compute

Simple Cross-Validation Randomly split the set of labeled training samples D into a training set and a validation set

m -Fold Cross-Validation Training set is randomly divided into m disjoint sets of equal size n/m The classifier is trained m times Each time with a different set held out as a validation set Estimated performance is the mean of these m errors When m = n , it is in effect the leave-one-out approach

Forms of Learning for Cross-Validation neural networks of fixed topology Number of epochs or presentations of the training set Number of hidden units Width of the Gaussian window in Parzen windows Optimal k in the k -nearest neighbor classifier

Portion  of D as a Validation Set Should be small Validation set is used merely to know when to stop adjusting parameters Training set is used to set large number of parameters or degrees of freedoms Traditional default Set  = 0.1 Proven effective in many applications

Anti-Cross-Validation Cross-validation need not work on every problem Anti-cross-validation Halt when the validation error is the first local maximum Must explore different values of  Possibly abandon the use of cross-validation if performance cannot be improved

Estimation of Error Rate Let p be the true and unknown error rate of the classifier Assume k of the n’ independent, randomly drawn test samples are misclassified, then k has the binomial distribution Maximum-likelihood estimate for p

95% Confidence Intervals for a Given Estimated p

Jackknife Estimation of Classification Accuracy Use leave-one-out approach Obtain Jackknife estimate for the mean and variance of the leave-one-out accuracies Use traditional hypothesis testing to see if one classifier is superior to another with statistical significance

Jackknife Estimation of Classification Accuracy

Bootstrap Estimation of Classification Accuracy Train B classifiers, each with a different bootstrap data set Test on other bootstrap data sets Bootstrap estimate is the mean of these bootstrap accuracies

Maximum-Likelihood Comparison (ML-II) Also called maximum-likelihood selection Find the maximum-likelihood parameters for each of the candidate models Calculate the resulting likelihoods (evidences) Choose the model with the largest likelihood

Scientific Process *D. J. C. MacKay, “Bayesian interpolation,” Neural Computation, 4(3), 415-447, 1992

Concept of Occam Factor An inherent bias toward simple models (small  0  ) Models that are overly complex (large  0  ) are automatically self-penalizing

Evidence for Gaussian Parameters

Bayesian Model Selection vs. No Free Lunch Theorem Bayesian model selection Ignore the prior over the space of models Effectively assume that it is uniform Not take into account how models correspond to underlying target functions Usually corresponds to non-uniform prior over target functions No Free Lunch Theorem Allows that for some particular non-uniform prior there may be an algorithm that gives better than chance, or even optimal, results

Error Rate as a Function of Number n of Training Samples Classifiers trained by a small number of samples will not performed well on new data Typical steps Estimate unknown parameters from samples Use these estimates to determine the classifier Calculate the error rate for the resulting classifier

Analytical Analysis Case of two categories having equal prior probabilities Partition feature space into some m disjoint cells, C 1 , . . ., C m Conditional probabilities p ( x |  1 ) and p ( x |  2 ) do not vary appreciably within any cell Need only know which cell x falls

Results of Simulation Experiments

Discussions on Error Rate for Given n For every curve involving finite n there is an optimal number of cells At first increasing number of cells make it easier to distinguish between distributions represented by p and q If the number of cells becomes too large, there will not be enough training patterns to fill them Eventually number of patterns in most cells will be zero

Discussions on Error Rate for Given n For n = 500 , the minimal error rate occurs somewhere around m = 20 Form the cells by dividing each feature axis into l intervals With d features, m = l d If l = 2 , using more than four or five binary features will lead to worsen rather than better performance

Test Errors vs. Number of Training Patterns

Sum and Difference of Test and Training Error

Fraction of Dichotomies of n Points in d Dimensions That are Linear

One-Dimensional Case f ( n = 4, d = 1) = 0.5 X 1111 X 0111 X 1110 0110 1101 0101 X 1100 0100 1011 X 0011 1010 0010 1001 X 0001 X 1000 X 0000 Linearly Separable? Labels Linearly Separable? Labels

Capacity of a Separating Plane Not until n is a sizable fraction of 2( d +1) that the problem begins to become difficult Capacity of a hyperplane At n = 2( d +1) , half of the possible dichotomies are still linear Can not expect a linear classifier to “match” a problem, on average, if the dimension of the feature space is greater than n /2 - 1

Mixture-of-Expert Models Classifiers whose decision is based on the outputs of component classifiers Also called Ensemble classifiers Modular classifiers Pooled classifiers Useful if each component classifier is highly trained (“expert”) in a different region of the feature space

Mixture Model for Producing Patterns

Mixture-of-Experts Architecture

Final Decision Rule Choose the category corresponding to the maximum discriminant value after the pooling system Winner-take-all method Use the decision of the single component classifier that is the “most confident”, i.e., largest g rj Suboptimal but simple Works well if the component classifiers are experts in separate regions

Component Classifiers without Discriminant Functions Example A KNN classifier (rank order) A decision tree (label) A neural network (analog value) A rule-based system (label)

Heuristics to Convert Outputs to Discrimunant Values

Illustration Examples 0.0 0 3/21=0.143 4th 0.111 0.1 0.0 0 5/21=0.238 2nd 0.129 0.2 0.0 0 6/21=0.286 1st 0.143 0.3 0.0 0 2/21=0.095 5th 0.260 0.9 1.0 1 1/21=0.048 6th 0.193 0.6 0.0 0 4/21=0.194 3rd 0.158 0.4 g i g i g i One-of- c Rank Order Analog

MachineLearning.ppt

Recommended

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to MachineLearning.ppt (20)

More from butest (20)

MachineLearning.ppt