Chapter4-ML.pptx slide for concept of mechanic learning

 Introduction
 Classification is a form of data analysis that extracts models describing important data classes.
 Such models, called classifiers, predict categorical (discrete, unordered) class labels.
 For example, we can build a classification model to categorize bank loan applications as either
safe or risky.
 Such analysis can help provide us with a better understanding of the data at large.
 Many classification methods have been proposed by researchers in machine learning, pattern
recognition, and statistics.
 Classification has numerous applications, including fraud detection, target marketing,
performance prediction, manufacturing, and medical diagnosis.

 What Is Classification?
 A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and which
are “risky” for the bank.
 A marketing manager at AllElectronics needs data analysis to help guess whether a customer with
a given profile will buy a new computer.
 In each of these examples, the data analysis task is classification, where a model or classifier is
constructed to predict class (categorical) labels, such as “safe” or “risky” for the loan application
data; “yes” or “no” for the marketing data.
 Suppose that the marketing manager wants to predict how much a given customer will spend
during a sale at AllElectronics.
 This data analysis task is an example of numeric prediction, where the model constructed predicts
a continuous-valued function, or ordered value, as opposed to a class label.
 This model is a predictor.
 Regression analysis is a statistical methodology that is most often used for numeric prediction.
 Classification and numeric prediction are the two major types of prediction problems.

 General Approach to Classification
 “How does classification work?”
 Data classification is a two-step process, consisting of a learning step (where a classification
model is constructed) and a classification step (where the model is used to predict class labels
for given data).
 The process is shown for the loan application data of Figure below.

 In the first step, a classifier is built describing a predetermined set of data classes or concepts.
 This is the learning step (or training phase), where a classification algorithm builds the classifier
by analyzing or “learning from” a training set made up of database tuples and their associated
class labels.
 A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n
measurements made on the tuple from n database attributes, A1, A2,..., An.
 Each tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute.
 The class label attribute is discrete-valued and unordered.
 It is categorical in that each value serves as a category or class.
 The individual tuples making up the training set are referred to as training tuples and are
randomly sampled from the database under analysis.

 Because the class label of each training tuple is provided, this step is also known as supervised
learning.
 This first step of the classification process can also be viewed as the learning of a function, y = f
(X), that can predict the associated class label y of a given tuple X.
 In this view, we wish to learn a function that separates the data classes.
 Typically, this mapping is represented in the form of classification rules, decision trees, or
mathematical formulae.
 In our example, the mapping is represented as classification rules that identify loan applications
as being either safe or risky (Figure a).
 The rules can be used to categorize future data tuples, as well as provide deeper insight into the
data contents.

 “What about classification accuracy?”
 In the second step (Figure b), the model is used for classification.
 First, the predictive accuracy of the classifier is estimated.
 If we were to use the training set to measure the classifier’s accuracy, this estimate would likely be
optimistic, because the classifier tends to overfit the data .
 Therefore, a test set is used, made up of test tuples and their associated class labels.
 They are independent of the training tuples, meaning that they were not used to construct the
classifier.

 The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.
 The associated class label of each test tuple is compared with the learned classifier’s class
prediction for that tuple.
 If the accuracy of the classifier is considered acceptable, the classifier can be used to classify
future data tuples for which the class label is not known.
 For example, the classification rules learned in Figure (a) from the analysis of data from previous
loan applications can be used to approve or reject new or future loan applicants.

 Decision Tree Induction
 Decision tree induction is the learning of decision trees from class-labeled training tuples.
 A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal
terminal node) holds a class label.
 The topmost node in a tree is the root node.
 A typical decision tree is shown in Figure below.

 It represents the concept buys computer, that is, it predicts whether a customer at AllElectronics is
likely to purchase a computer.
 Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals.
 Some decision tree algorithms produce only binary trees (where each internal node branches to
exactly two other nodes), whereas others can produce nonbinary trees.
 “How are decision trees used for classification?”
 Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple
are tested against the decision tree.
 A path is traced from the root to a leaf node, which holds the class prediction for that tuple.
 Decision trees can easily be converted to classification rules.

 “Why are decision tree classifiers so popular?”
 The construction of decision tree classifiers does not require any domain knowledge or parameter
setting, and therefore is appropriate for exploratory knowledge discovery.
 Decision trees can handle multidimensional data.
 The learning and classification steps of decision tree induction are simple and fast.
 In general, decision tree classifiers have good accuracy.
 However, successful use may depend on the data at hand.
 Decision tree induction algorithms have been used for classification in many application areas such
as medicine, manufacturing and production, financial analysis, astronomy, and molecular biology.

 Decision Tree Induction
 During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
 This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marin,
and P. T. Stone.
 Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which newer
supervised learning algorithms are often compared.
 In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the
book Classification and Regression Trees (CART), which described the generation of binary decision
decision trees.
 ID3 and CART were invented independently of one another at around the same time, yet follow a
similar approach for learning decision trees from training tuples.

 ID3, C4.5, and CART adopt a greedy approach in which decision trees are constructed in a top-
down recursive divide-and-conquer manner.
 Most algorithms for decision tree induction follow a top-down approach, which starts with a
training set of tuples and their associated class labels.
 The training set is recursively partitioned into smaller subsets as the tree is being built.

 A basic decision tree algorithm is summarized in Figure below.

 The strategy is as follows.
1. The algorithm is called with three parameters: D(data partition), attribute list, and Attribute selection
method.
 Initially, D, is the complete set of training tuples and their associated class labels.
 The parameter attribute list is a list of attributes describing the tuples.
 Attribute selection method specifies a heuristic procedure for selecting the attribute that “best”
discriminates the given tuples according to class.
 This procedure employs an attribute selection measure such as information gain or the Gini index.
 Whether the tree is strictly binary is generally driven by the attribute selection measure.
 Some attribute selection measures, such as the Gini index, enforce the resulting tree to be binary.
 Others, like information gain, do not, therein allowing multiway splits (i.e., two or more branches to
be grown from a node).

2. The tree starts as a single node, N, representing the training tuples in D (step 1)
3. If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that
class (steps 2 and 3).
4. Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion.
 The splitting criterion tells us which attribute to test at node N by determining the “best” way to
separate or partition the tuples in D into individual classes (step 6).
 The splitting criterion also tells us which branches to grow from node N with respect to the
outcomes of the chosen test.
 The splitting criterion is determined so that, ideally, the resulting partitions at each branch are as
“pure” as possible.
 A partition is pure if all the tuples in it belong to the same class.

5. The node N is labeled with the splitting criterion, which serves as a test at the node (step 7).
 A branch is grown from node N for each of the outcomes of the splitting criterion.
 The tuples in D are partitioned accordingly (steps 10 to 11).
 There are three possible scenarios, as illustrated in Figure below.

 Let A be the splitting attribute.
 A has v distinct values, {a1, a2,..., av }, based on the training data.
1. A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to the
known values of A. A branch is created for each known value, aj , of A and labeled with that value
(Figure a). Partition Dj is the subset of class-labeled tuples in D having value aj of A. (steps 8 and 9).
2. A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding
to the conditions A ≤ split point and A > split point, respectively, where split point is the split-point
returned by Attribute selection method as part of the splitting criterion. Two branches are grown
from N and labeled according to the previous outcomes (Figure b). The tuples are partitioned such
that D1 holds the subset of class-labeled tuples in D for which A ≤ split point, while D2 holds the
rest.

3. A is discrete-valued and a binary tree must be produced : The test at node N is of the form
“A ∈ SA?,” where SA is the splitting subset for A, returned by Attribute selection method as part of
the splitting criterion. It is a subset of the known values of A. If a given tuple has value aj of A
and if aj ∈ SA, then the test at node N is satisfied. Two branches are grown from N (Figure c). By
convention, the left branch out of N is labeled yes so that D1 corresponds to the subset of class-
labeled tuples in D that satisfy the test. The right branch out of N is labeled no so that D2
corresponds to the subset of class-labeled tuples from D that do not satisfy the test.

 The algorithm uses the same process recursively to form a decision tree for the tuples at each
resulting partition, Dj , of D (step 14).
 The recursive partitioning stops only when any one of the following terminating conditions is
true:
1. All the tuples in partition D (represented at node N) belong to the same class (steps 2 and 3).
2. There are no remaining attributes on which the tuples may be further partitioned (step 4). In
this case, majority voting is employed (step 5). This involves converting node N into a leaf and
labeling it with the most common class in D. Alternatively, the class distribution of the node
tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step 12). In this case, a
leaf is created with the majority class in D (step 13).
 The resulting decision tree is returned (step 15).

 Attribute Selection Measures
 An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes.
 If we were to split D into smaller partitions according to the outcomes of the splitting criterion,
ideally each partition would be pure (i.e., all the tuples that fall into a given partition would belong
to the same class).
 Attribute selection measures are also known as splitting rules because they determine how the
tuples at a given node are to be split.
 The attribute selection measure provides a ranking for each attribute describing the given training
tuples.
 The attribute having the best score for the measure is chosen as the splitting attribute for the
given tuples.

 There are three popular attribute selection measures—information gain, gain ratio, and Gini
index.
 The notation used herein is as follows.
 Let D, the data partition, be a training set of class-labeled tuples.
 Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1,...,
m).
 Let Ci,D be the set of tuples of class Ci in D.
 Let |D| and |Ci,D| denote the number of tuples in D and Ci,D, respectively

 Information Gain
 ID3 uses information gain as its attribute selection measure.
 Let node N represent or hold the tuples of partition D.
 The attribute with the highest information gain is chosen as the splitting attribute for node N.
 This attribute minimizes the information needed to classify the tuples in the resulting partitions
and reflects the least randomness or “impurity” in these partitions.
 Such an approach minimizes the expected number of tests needed to classify a given tuple and
guarantees that a simple tree is found.

 The expected information needed to classify a tuple in D is given by
 where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|D|.
 Info(D) is just the average amount of information needed to identify the class label of a tuple in
D.
 How much more information would we still need (after the partitioning) to arrive at an exact
classification?

 This amount is measured by
 InfoA (D) is the expected information required to classify a tuple from D based on the partitioning
by A.
 The smaller the expected information (still) required, the greater the purity of the partitions.
 Information gain is defined as the difference between the original information requirement (i.e.,
based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning
on A).
 That is,
 The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at
node N.

 Example 8.1 Induction of a decision tree using information gain.
 Table below presents a training set, D, of class-labeled tuples randomly selected from the
AllElectronics customer database.

 The class label attribute, buys computer, has two distinct values (namely, {yes, no}); therefore, there
are two distinct classes (i.e., m = 2).
 Let class C1 correspond to yes and class C2 correspond to no.
 There are nine tuples of class yes and five tuples of class no.
 A (root) node N is created for the tuples in D. To find the splitting criterion for these tuples, we must
compute the information gain of each attribute.
 We first use Eq. (8.1) to compute the expected information needed to classify a tuple in D:

 Next, we need to compute the expected information requirement for each attribute.
 Let’s start with the attribute age.
 We need to look at the distribution of yes and no tuples for each category of age.
 For the age category “youth,” there are two yes tuples and three no tuples.
 For the category “middle aged,” there are four yes tuples and zero no tuples.
 For the category “senior,” there are three yes tuples and two no tuples.
 Using Eq. (8.2), the expected information needed to classify a tuple in D if the tuples are
partitioned according to age is

 Hence, the gain in information from such a partitioning would be
Gain(age) = Info(D) − Infoage(D) = 0.940 − 0.694 = 0.246 bits.
 Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and Gain(credit
rating) = 0.048 bits.
 Because age has the highest information gain among the attributes, it is selected as the splitting
attribute.
 Node N is labeled with age, and branches are grown for each of the attribute’s values. The tuples
are then partitioned accordingly, as shown in Figure below.

 Notice that the tuples falling into the partition for age = middle aged all belong to the same
class.
 Because they all belong to class “yes,” a leaf should therefore be created at the end of this
branch and labeled “yes.”
 The final decision tree returned by the algorithm was shown earlier in Figure below.

 Gain Ratio
 C4.5, a successor of ID3, uses an extension to information gain known as gain ratio.
 It applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as
 This value represents the potential information generated by splitting the training data set, D,
into v partitions, corresponding to the v outcomes of a test on attribute A.
 Note that, for each outcome, it considers the number of tuples having that outcome with
respect to the total number of tuples in D.

 It differs from information gain, which measures the information with respect to classification
that is acquired based on the same partitioning.
 The gain ratio is defined as
 The attribute with the maximum gain ratio is selected as the splitting attribute.

 Example 8.2 Computation of gain ratio for the attribute income.
 A test on income splits the data of Table 8.1 into three partitions, namely low, medium, and high,
containing four, six, and four tuples, respectively. To compute the gain ratio of income, we first
use Eq. (8.5) to obtain
 From Example 8.1, we have Gain(income) = 0.029.
 Therefore, GainRatio(income) = 0.029/1.557 = 0.019.

 Gini Index
 The Gini index is used in CART.
 Using the notation previously described, the Gini index measures the impurity of D, a data
partition or set of training tuples, as
 where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|.
 The sum is computed over m classes.
 The Gini index considers a binary split for each attribute.

 Let’s first consider the case where A is a discrete-valued attribute having v distinct values, {a1, a2,...,
av }, occurring in D.
 To determine the best binary split on A, we examine all the possible subsets that can be formed
using known values of A.
 For example, if income has three possible values, namely {low, medium, high}, then the possible
subsets are {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low}, {medium}, {high},
and {}.
 We exclude the power set, {low, medium, high}, and the empty set from consideration since,
conceptually, they do not represent a split.
 Therefore, there are 2v − 2 possible ways to form two partitions of the data, D, based on a binary
split on A.

 When considering a binary split, we compute a weighted sum of the impurity of each resulting
partition.
 For example, if a binary split on A partitions D into D1 and D2, the Gini index of D given that
partitioning is
 For each attribute, each of the possible binary splits is considered.
 For a discrete-valued attribute, the subset that gives the minimum Gini index for that attribute is
selected as its splitting subset.
 For continuous-valued attributes, each possible split-point must be considered.

 The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-
valued attribute A is
 The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum Gini
index is selected as the splitting attribute.
 This attribute and either its splitting subset (for a discrete-valued splitting attribute) or split-
point (for a continuous-valued splitting attribute) together form the splitting criterion.

 Example 8.3 Induction of a decision tree using the Gini index.
 Let D be the training data shown earlier in Table 8.1, where there are nine tuples belonging to
the class buys computer = yes and the remaining five tuples belong to the class buys
computer = no.
 A (root) node N is created for the tuples in D.
 We first use Eq. (8.7) for the Gini index to compute the impurity of D:
 To find the splitting criterion for the tuples in D, we need to compute the Gini index for each
attribute.
 Let’s start with the attribute income and consider each of the possible splitting subsets.
 Consider the subset {low, medium}. This would result in 10 tuples in partition D1 satisfying the
condition “income ∈ {low, medium}.”
 The remaining four tuples of D would be assigned to partition D2.

 The Gini index value computed based on this partitioning is

 Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the subsets {low,
high} and {medium}) and 0.450 (for the subsets {medium, high} and {low}).
 Therefore, the best binary split for attribute income is on {low, medium} (or {high}) because it
minimizes the Gini index.
 Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best split for age with a Gini
index of 0.375; the attributes student and credit rating are both binary, with Gini index values of
0.367 and 0.429, respectively.
 The attribute age and splitting subset {youth, senior} therefore give the minimum Gini index
overall, with a reduction in impurity of 0.459 − 0.357 = 0.102.
 The binary split “age ∈ {youth, senior?}” results in the maximum reduction in impurity of the
tuples in D and is returned as the splitting criterion.
 Node N is labeled with the criterion, two branches are grown from it, and the tuples are
partitioned accordingly.

 Tree Pruning
 When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers.
 Tree pruning methods address this problem of overfitting the data.
 Such methods typically use statistical measures to remove the least-reliable branches.
 An unpruned tree and a pruned version of it are shown in Figure below.
 Pruned trees tend to be smaller and less complex and, thus, easier to comprehend.
 They are usually faster and better at correctly classifying independent test data (i.e., of
previously unseen tuples) than unpruned trees.
 “How does tree pruning work?”
 There are two common approaches to tree pruning: prepruning and postpruning

 In the prepruning approach, a tree is “pruned” by halting its construction early.
 Upon halting, the node becomes a leaf.
 The leaf may hold the most frequent class among the subset tuples or the probability
distribution of those tuples.
 When constructing a tree, measures such as statistical significance, information gain, Gini
index, and so on, can be used to assess the goodness of a split.
 The second and more common approach is postpruning, which removes subtrees from a
“fully grown” tree.
 A subtree at a given node is pruned by removing its branches and replacing it with a leaf.
 The leaf is labeled with the most frequent class among the subtree being replaced.

 Bayes Classification Methods
 “What are Bayesian classifiers?”
 Bayesian classifiers are statistical classifiers.
 They can predict class membership probabilities such as the probability that a given tuple belongs
to a particular class.
 Bayesian classification is based on Bayes’ theorem.
 Studies comparing classification algorithms have found a simple Bayesian classifier known as the
na¨ıve Bayesian classifier to be comparable in performance with decision tree and selected neural
network classifiers.
 Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases.

 Bayes’ Theorem
 Bayes’ theorem is named after Thomas Bayes, who did early work in probability and decision
theory during the 18th century.
 Let X be a data tuple.
 In Bayesian terms, X is considered “evidence.”
 As usual, it is described by measurements made on a set of n attributes.
 Let H be some hypothesis such as that the data tuple X belongs to a specified class C.
 For classification problems, we want to determine P(H|X), the probability that the hypothesis H
holds given the “evidence” or observed data tuple X.
 In other words, we are looking for the probability that tuple X belongs to class C, given that we
know the attribute description of X.

 P(H|X) is the posterior probability, of H conditioned on X.
 For example, suppose our world of data tuples is confined to customers described by the
attributes age and income, respectively, and that X is a 35-year-old customer with an income of
$40,000.
 Suppose that H is the hypothesis that our customer will buy a computer.
 Then P(H|X) reflects the probability that customer X will buy a computer given that we know the
customer’s age and income.
 In contrast, P(H) is the prior probability, of H.
 For our example, this is the probability that any given customer will buy a computer, regardless
of age, income, or any other information.
 Similarly, P(X|H) is the posterior probability of X conditioned on H.
 That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we
know the customer will buy a computer.

 P(X) is the prior probability of X.
 Using our example, it is the probability that a person from our set of customers is 35 years
old and earns $40,000.
 “How are these probabilities estimated?”
 P(H), P(X|H), and P(X) may be estimated from the given data.
 Bayes’ theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).
 Bayes’ theorem is P

 Na¨ıve Bayesian Classification
 The na¨ıve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented
by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n measurements made on the tuple
from n attributes, respectively, A1, A2,..., An.
2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior probability, conditioned on X. That is, the na¨ıve
Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the maximum posteriori
hypothesis. By Bayes’ theorem (Eq. 8.10),

3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely, that is,
P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize
P(X|Ci)P(Ci). Note that the class prior probabilities may be estimated by P(Ci) = |Ci,D|/|D|, where |Ci,D| is
the number of training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally expensive to compute
P(X|Ci). To reduce computation in evaluating P(X|Ci), the na¨ıve assumption of class-conditional
independence is made. This presumes that the attributes’ values are conditionally independent of one
another, given the class label of the tuple (i.e., that there are no dependence relationships among the
attributes). Thus,
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci),..., P(xn|Ci) from the training tuples. Recall that
here xk refers to the value of attribute Ak for tuple X. For each attribute, we look at whether the
attribute is categorical or continuous-valued.

 Example - Predicting a class label using na¨ıve Bayesian classification.
 We wish to predict the class label of a tuple using na¨ıve Bayesian classification, given the
same training data as in decision tree induction.
 The training data were shown earlier in Table below.

 The data tuples are described by the attributes age, income, student, and credit rating.
 The class label attribute, buys computer, has two distinct values (namely, {yes, no}).
 Let C1 correspond to the class buys computer = yes and C2 correspond to buys computer = no.
 The tuple we wish to classify is
X = (age = youth, income = medium, student = yes, credit rating = fair)

 k-Nearest-Neighbor Classifiers
 The k-nearest-neighbor method was first described in the early 1950s.
 The method is labor intensive when given large training sets, and did not gain popularity until
the 1960s when increased computing power became available.
 It has since been widely used in the area of pattern recognition.
 Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given
test tuple with training tuples that are similar to it.
 The training tuples are described by n attributes.
 Each tuple represents a point in an n-dimensional space.
 In this way, all the training tuples are stored in an n-dimensional pattern space.
 When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for
the k training tuples that are closest to the unknown tuple.
 These k training tuples are the k “nearest neighbors” of the unknown tuple.

 “Closeness” is defined in terms of a distance metric, such as Euclidean distance.
 The Euclidean distance between two points or tuples, say, X1 = (x11, x12,..., x1n) and X2 = (x21,
x22,..., x2n), is
 In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1 and in tuple X2, square this difference, and accumulate it.
 The square root is taken of the total accumulated distance count.
 Typically, we normalize the values of each attribute before using Eq. (9.22).
 This helps prevent attributes with initially large ranges (e.g., income) from outweighing
attributes with initially smaller ranges (e.g., binary attributes).

 Min-max normalization, for example, can be used to transform a value v of a numeric attribute A
to v 0 in the range [0, 1] by computing
 where minA and maxA are the minimum and maximum values of attribute A.
 For k-nearest-neighbor classification, the unknown tuple is assigned the most common class
among its k-nearest neighbors.
 When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in
pattern space.

 “But how can distance be computed for attributes that are not numeric, but nominal ((or
categorical) such as color?”
 For nominal attributes, a simple method is to compare the corresponding value of the
attribute in tuple X1 with that in tuple X2.
 If the two are identical (e.g., tuples X1 and X2 both have the color blue), then the difference
between the two is taken as 0.
 If the two are different (e.g., tuple X1 is blue but tuple X2 is red), then the difference is
considered to be 1.
 Other methods may incorporate more sophisticated schemes for differential grading (e.g.,
where a larger difference score is assigned, say, for blue and white than for blue and black).

 Model Evaluation and Selection
 Metrics for Evaluating Classifier Performance
 The classifier evaluation measures presented in this section are summarized in Figure
below.
 They include accuracy (also known as recognition rate), sensitivity (or recall), specificity,
precision, F1, and Fβ

 Positive tuples and negative tuples
 Given two classes, for example, the positive tuples may be buys computer = yes while the
negative tuples are buys computer = no.
 Suppose we use our classifier on a test set of labeled tuples.
 P is the number of positive tuples and N is the number of negative tuples

 There are four terms we need to know that are the “building blocks” used in computing many
evaluation measures.
 True positives (TP): These refer to the positive tuples that were correctly labeled by the
classifier. Let TP be the number of true positives.
 True negatives(TN): These are the negative tuples that were correctly labeled by the classifier.
Let TN be the number of true negatives.
 False positives (FP): These are the negative tuples that were incorrectly labeled as positive
(e.g., tuples of class buys computer = no for which the classifier predicted buys computer =
yes). Let FP be the number of false positives.
 False negatives (FN): These are the positive tuples that were mislabeled as negative (e.g.,
of class buys computer = yes for which the classifier predicted buys computer = no). Let FN be
the number of false negatives.

 These terms are summarized in the confusion matrix

 The confusion matrix is a useful tool for analyzing how well your classifier can recognize tuples of
different classes.
 TP and TN tell us when the classifier is getting things right, while FP and FN tell us when the
classifier is getting things wrong.
 Now let’s look at the evaluation measures, starting with accuracy.
 The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier. That is,
 We can also speak of the error rate or misclassification rate of a classifier, M, which is simply 1 −
accuracy(M), where accuracy(M) is the accuracy of M.
 This also can be computed as

 The sensitivity and specificity measures can be used, respectively, for this purpose.
 Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of positive
tuples that are correctly identified), while specificity is the true negative rate (i.e., the proportion of
negative tuples that are correctly identified).
 These measures are defined as

 The precision and recall measures are also widely used in classification.
 Precision can be thought of as a measure of exactness (i.e., what percentage of tuples labeled as
positive are actually such), whereas recall is a measure of completeness (what percentage of
positive tuples are labeled as such).
 If recall seems familiar, that’s because it is the same as sensitivity (or the true positive rate).
 These measures can be computed as

 Cross-Validation
 In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets
or “folds,” D1, D2,..., Dk , each of approximately equal size.
 Training and testing is performed k times.
 In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively
used to train the model.
 That is, in the first iteration, subsets D2,..., Dk collectively serve as the training set to obtain a first
model, which is tested on D1; the second iteration is trained on subsets D1, D3,..., Dk and tested on
D2; and so on
 Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial
tuples.
 That is, only one sample is “left out” at a time for the test set.
 In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in
each fold is approximately the same as that in the initial data..

 Bootstrap
 the bootstrap method samples the given training tuples uniformly with replacement.
 That is, each time a tuple is selected, it is equally likely to be selected again and re-added to
the training set.
 For instance, imagine a machine that randomly selects tuples for our training set.
 In sampling with replacement, the machine is allowed to select the same tuple more than once.
 There are several bootstrap methods.

 A commonly used one is the .632 bootstrap, which works as follows.
 Suppose we are given a data set of d tuples.
 The data set is sampled d times, with replacement, resulting in a bootstrap sample or training
set of d samples.
 It is very likely that some of the original data tuples will occur more than once in this sample.
 The data tuples that did not make it into the training set end up forming the test set.
 Suppose we were to try this out several times. As it turns out, on average, 63.2% of the original
data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set

 Techniques to Improve Classification Accuracy
 Bagging, boosting, and random forests are examples of ensemble methods (Figure below).
 An ensemble combines a series of k learned models (or base classifiers), M1, M2,..., Mk , with the
aim of creating an improved composite classification model, M∗.
 A given data set, D, is used to create k training sets, D1, D2,..., Dk , where Di (1 ≤ i ≤ k − 1) is
used to generate classifier Mi .
 Given a new data tuple to classify, the base classifiers each vote by returning a class prediction.
 The ensemble returns a class prediction based on the votes of the base classifiers.

 Bagging
 Given a set, D, of d tuples, bagging works as follows.
 For iteration i(i = 1, 2,..., k), a training set, Di , of d tuples is sampled with replacement from the
original set of tuples, D.
 Note that the term bagging stands for bootstrap aggregation.
 Each training set is a bootstrap sample
 Because sampling with replacement is used, some of the original tuples of D may not be included
in Di , whereas others may occur more than once.
 A classifier model, Mi , is learned for each training set, Di .
 To classify an unknown tuple, X, each classifier, Mi , returns its class prediction, which counts as
one vote.
 The bagged classifier, M∗, counts the votes and assigns the class with the most votes to X.
 Bagging can be applied to the prediction of continuous values by taking the average value of
each prediction for a given test tuple.

 Boosting and AdaBoost
 In boosting, weights are also assigned to each training tuple.
 A series of k classifiers is iteratively learned. After a classifier, Mi , is learned, the weights are updated
to allow the subsequent classifier, Mi+1, to “pay more attention” to the training tuples that were
misclassified by Mi .
 The final boosted classifier, M∗, combines the votes of each individual classifier, where the weight of
each classifier’s vote is a function of its accuracy.
 AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm.
 Suppose we want to boost the accuracy of a learning method.
 We are given D, a data set of d class-labeled tuples, (X1, y1),(X2, y2),...,(Xd, yd), where yi is the class
label of tuple Xi .
 Initially, AdaBoost assigns each training tuple an equal weight of 1/d.
 Generating k classifiers for the ensemble requires k rounds through the rest of the algorithm.
 In round i, the tuples from D are sampled to form a training set, Di , of size d.

 Sampling with replacement is used—the same tuple may be selected more than once.
 Each tuple’s chance of being selected is based on its weight.
 A classifier model, Mi , is derived from the training tuples of Di .
 Its error is then calculated using Di as a test set.
 The weights of the training tuples are then adjusted according to how they were classified.
 If a tuple was incorrectly classified, its weight is increased.
 If a tuple was correctly classified, its weight is decreased.
 A tuple’s weight reflects how difficult it is to classify— the higher the weight, the more often it has
been misclassified.
 These weights will be used to generate the training samples for the classifier of the next round.
 The basic idea is that when we build a classifier, we want it to focus more on the misclassified tuples of
the previous round.
 Some classifiers may be better at classifying some “difficult” tuples than others. In this way, we build a
series of classifiers that complement each other

 Random Forests
 Random forests can be built using bagging in tandem with random attribute selection.
 A training set, D, of d tuples is given.
 The general procedure to generate k decision trees for the ensemble is as follows.
 For each iteration, i(i = 1, 2,..., k), a training set, Di , of d tuples is sampled with replacement from D.
 That is, each Di is a bootstrap sample of D, so that some tuples may occur more than once in Di ,
while others may be excluded.
 Let F be the number of attributes to be used to determine the split at each node, where F is much
smaller than the number of available attributes.
 To construct a decision tree classifier, Mi , randomly select, at each node, F attributes as candidates
for the split at the node.
 The CART methodology is used to grow the trees.
 The trees are grown to maximum size and are not pruned.
 Random forests formed this way, with random input selection, are called Forest-RI.

Chapter4-ML.pptx slide for concept of mechanic learning

Recommended

More Related Content

Similar to Chapter4-ML.pptx slide for concept of mechanic learning (20)

Recently uploaded (10)

Chapter4-ML.pptx slide for concept of mechanic learning