This document discusses classification and prediction techniques for data analysis. Classification predicts categorical labels, while prediction models continuous values. Common algorithms include decision tree induction and Naive Bayesian classification. Decision trees use measures like information gain to build classifiers by recursively partitioning training data. Naive Bayesian classifiers apply Bayes' theorem to estimate probabilities for classification. Both approaches are popular due to their accuracy, speed and interpretability.
A Decision Tree Based Classifier for Classification & Prediction of Diseasesijsrd.com
In this paper, we are proposing a modified algorithm for classification. This algorithm is based on the concept of the decision trees. The proposed algorithm is better then the previous algorithms. It provides more accurate results. We have tested the proposed method on the example of patient data set. Our proposed methodology uses greedy approach to select the best attribute. To do so the information gain is used. The attribute with highest information gain is selected. If information gain is not good then again divide attributes values into groups. These steps are done until we get good classification/misclassification ratio. The proposed algorithms classify the data sets more accurately and efficiently.
This document summarizes several major data classification techniques, including decision tree induction, Bayesian classification, rule-based classification, classification by back propagation, support vector machines, lazy learners, genetic algorithms, rough set approach, and fuzzy set approach. It provides an overview of each technique, describing their basic concepts and key algorithms. The goal is to help readers understand different data classification methodologies and which may be suitable for various domain-specific problems.
This document discusses classification and prediction in data analysis. It defines classification as predicting categorical class labels, such as predicting if a loan applicant is risky or safe. Prediction predicts continuous numeric values, such as predicting how much a customer will spend. The document provides examples of classification, including a bank predicting loan risk and a company predicting computer purchases. It also provides an example of prediction, where a company predicts customer spending. It then discusses how classification works, including building a classifier model from training data and using the model to classify new data. Finally, it discusses decision tree induction for classification and the k-means algorithm.
Machine learning is a type of artificial intelligence that allows software to learn from data without being explicitly programmed. The document discusses several machine learning techniques including supervised learning algorithms like linear regression, logistic regression, decision trees, support vector machines, K-nearest neighbors, and Naive Bayes. Unsupervised learning algorithms covered include clustering techniques like K-means and hierarchical clustering. Applications of machine learning include spam filtering, fraud detection, image recognition, and medical diagnosis.
Classification is a popular data mining technique that assigns items to target categories or classes. It builds models called classifiers to predict the class of records with unknown class labels. Some common applications of classification include fraud detection, target marketing, and medical diagnosis. Classification involves a learning step where a model is constructed by analyzing a training set with class labels, and a classification step where the model predicts labels for new data. Supervised learning uses labeled data to train machine learning algorithms to produce correct outcomes for new examples.
This document discusses classification using decision tree models. It begins with an introduction to classification, describing it as assigning objects to predefined categories. Decision trees are then overviewed as a powerful classifier that uses a hierarchical structure to split a dataset. Important parameters for evaluating model accuracy are covered, such as precision, recall, and AUC. The document also describes an exercise using the Weka tool to build decision trees on a dataset about term deposit subscriptions. It concludes with discussing uses of decision trees for applications like marketing and medical diagnosis.
DCOM (Distributed Component Object Model) and CORBA (Common Object Request Broker Architecture) are two popular distributed object models. In this paper, we make architectural comparison of DCOM and CORBA at three different layers: basic programming architecture, remoting architecture, and the wire protocol architecture.
classification in data mining and data warehousing.pdf321106410027
The document discusses various classification techniques in machine learning. It begins with an overview of classification and supervised vs. unsupervised learning. Classification aims to predict categorical class labels by constructing a predictive model from labeled training data. Decision tree induction is then covered as a basic classification algorithm that recursively partitions data based on attribute values until reaching single class leaf nodes. Bayes classification methods are also mentioned, which classify examples based on applying Bayes' theorem to calculate posterior probabilities.
UNIT 3: Data Warehousing and Data MiningNandakumar P
UNIT-III Classification and Prediction: Issues Regarding Classification and Prediction – Classification by Decision Tree Introduction – Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction – Accuracy and Error Measures – Evaluating the Accuracy of a Classifier or Predictor – Ensemble Methods – Model Section.
Analysis of Classification Algorithm in Data Miningijdmtaiir
Data Mining is the extraction of hidden predictive
information from large database. Classification is the process
of finding a model that describes and distinguishes data classes
or concept. This paper performs the study of prediction of class
label using C4.5 and Naïve Bayesian algorithm.C4.5 generates
classifiers expressed as decision trees from a fixed set of
examples. The resulting tree is used to classify future samples
.The leaf nodes of the decision tree contain the class name
whereas a non-leaf node is a decision node. The decision node
is an attribute test with each branch (to another decision tree)
being a possible value of the attribute. C4.5 uses information
gain to help it decide which attribute goes into a decision node.
A Naïve Bayesian classifier is a simple probabilistic classifier
based on applying Baye’s theorem with strong (naive)
independence assumptions. Naive Bayesian classifier assumes
that the effect of an attribute value on a given class is
independent of the values of the other attribute. This
assumption is called class conditional independence. The
results indicate that Predicting of class label using Naïve
Bayesian classifier is very effective and simple compared to
C4.5 classifier
The document discusses various classification and prediction techniques used in machine learning, including decision trees, naive Bayes classification, and k-nearest neighbors. It provides examples of how these techniques work and compares supervised versus unsupervised learning approaches. The document also covers issues around preparing data and comparing different classification and prediction methods.
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
IMAGE CLASSIFICATION USING KNN, RANDOM FOREST AND SVM ALGORITHM ON GLAUCOMA DATASETS AND EXPLAIN THE ACCURACY, SENSITIVITY, AND SPECIFICITY OF EACH AND EVERY ALGORITHMS
Classifiers are algorithms that map input data to categories in order to build models for predicting unknown data. There are several types of classifiers that can be used including logistic regression, decision trees, random forests, support vector machines, Naive Bayes, and neural networks. Each uses different techniques such as splitting data, averaging predictions, or maximizing margins to classify data. The best classifier depends on the problem and achieving high accuracy, sensitivity, and specificity.
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
1. Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three other matrices.
2. SVD is primarily used for dimensionality reduction, information extraction, and noise reduction.
3. Key applications of SVD include matrix approximation, principal component analysis, image compression, recommendation systems, and signal processing.
1. Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three other matrices.
2. SVD is primarily used for dimensionality reduction, information extraction, and noise reduction.
3. Key applications of SVD include matrix approximation, principal component analysis, image compression, recommendation systems, and signal processing.
This document provides an introduction and overview of key concepts in data mining. It defines data mining as extracting hidden predictive information from large databases to help companies make knowledge-driven decisions. The document outlines different types of patterns that can be mined, including frequent patterns, associations, correlations, and outliers. It also discusses technologies commonly used in data mining such as statistics, machine learning, databases, and visualization. Major issues addressed include developing new mining methodologies, enabling user interaction, improving efficiency and scalability, handling diverse data types, and addressing societal impacts.
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...ijcnes
Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining can be applied on the educational data for predicting the students behavior. This paper focuses on finding thesuitablealgorithm which yields the best result to find out the reason behind students absenteeism in an academic year. The first step in this processis to gather students data by using questionnaire.The datais collected from 123 under graduate students from a private college which is situated in a semirural area. The second step is to clean the data which is appropriate for mining purpose and choose the relevant attributes. In the final step, three different Decision tree induction algorithms namely, ID3(Iterative Dichotomiser), C4.5 and CART(Classification and Regression Tree)were applied for comparison of results for the same data sample collected using questionnaire. The results were compared to find the algorithm which yields the best result in predicting the reason for student s absenteeism.
This document discusses data mining classification and decision trees. It defines classification, provides examples, and discusses techniques like decision trees. It covers decision tree induction processes like determining the best split, measures of impurity, and stopping criteria. It also addresses issues like overfitting and model evaluation, discussing metrics, methods of evaluation like cross validation, and comparing models.
This document discusses data mining classification and decision trees. It defines classification, provides examples, and discusses techniques like decision trees. It covers decision tree induction processes like determining the best split, measures of impurity, and stopping criteria. It also addresses issues like overfitting, model evaluation methods, and comparing model performance.
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
Abstract
Decision Tree learning algorithms have been able to capture knowledge successfully. Decision Trees are best considered when
instances are described by attribute-value pairs and when the target function has a discrete value. The main task of these
decision trees is to use inductive methods to the given values of attributes of an unknown object and determine an
appropriate classification by applying decision tree rules. Decision Trees are very effective forms to evaluate the performance
and represent the algorithms because of their robustness, simplicity, capability of handling numerical and categorical data,
ability to work with large datasets and comprehensibility to a name a few. There are various decision tree algorithms available
like ID3, CART, C4.5, VFDT, QUEST, CTREE, GUIDE, CHAID, CRUISE, etc. In this paper a comparative study on three of
these popular decision tree algorithms - (Iterative Dichotomizer 3), C4.5 which is an evolution of ID3 and VFDT (Very
Fast Decision Tree has been made. An empirical study has been conducted to compare C4.5 and VFDT in terms of accuracy
and execution time and various conclusions have been drawn.
Key Words: Decision tree, ID3, C4.5, VFDT, Information Gain, Gain Ratio, Gini Index, Over−fitting.
A Survey of Modern Data Classification Techniquesijsrd.com
This document provides an overview of modern data classification techniques. It describes decision tree learning algorithms, which use tree structures to classify observations by mapping them to target class labels based on their features. The document discusses common decision tree algorithms like ID3 and C4.5 and their use of recursive partitioning to split data into subsets. It also reviews related work on decision tree algorithms and their applications in domains like medicine, manufacturing, and molecular biology. The conclusion states that current and improved classification algorithms efficiently predict target attributes but require significant time and complex extracted rules.
This document presents a new algorithm called UDT-CDF for building decision trees to classify uncertain numerical data. It improves on previous algorithms like UDT that were based on probability density functions (PDFs). The key aspects of the new algorithm are:
1. It uses cumulative distribution functions (CDFs) rather than PDFs to represent uncertain numerical attributes, since CDFs provide more complete probability information.
2. It splits data at decision tree nodes based on the CDF, placing data with values covering the split point into both branches weighted by the CDF.
3. Experimental results show the new CDF-based algorithm achieves more accurate classifications and is more computationally efficient than the PDF-based UDT algorithm,
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Java developer-friendly frontends: Build UIs without the JavaScript hassle- JCONJago de Vreede
Have you ever needed to build a UI as a backend developer but didn’t want to dive deep into JavaScript frameworks? Sometimes, all you need is a straightforward way to display and interact with data. So, what are the best options for Java developers?
In this talk, we’ll explore three popular tools that make it easy to build UIs in a way that suits backend-focused developers:
HTMX for enhancing static HTML pages with dynamic interactions without heavy JavaScript,
Vaadin for full-stack applications entirely in Java with minimal frontend skills, and
JavaFX for creating Java-based UIs with drag-and-drop simplicity.
We’ll build the same UI in each technology, comparing the developer experience. At the end of the talk, you’ll be better equipped to choose the best UI technology for your next project.
Ad
More Related Content
Similar to Chapter4-ML.pptx slide for concept of mechanic learning (20)
DCOM (Distributed Component Object Model) and CORBA (Common Object Request Broker Architecture) are two popular distributed object models. In this paper, we make architectural comparison of DCOM and CORBA at three different layers: basic programming architecture, remoting architecture, and the wire protocol architecture.
classification in data mining and data warehousing.pdf321106410027
The document discusses various classification techniques in machine learning. It begins with an overview of classification and supervised vs. unsupervised learning. Classification aims to predict categorical class labels by constructing a predictive model from labeled training data. Decision tree induction is then covered as a basic classification algorithm that recursively partitions data based on attribute values until reaching single class leaf nodes. Bayes classification methods are also mentioned, which classify examples based on applying Bayes' theorem to calculate posterior probabilities.
UNIT 3: Data Warehousing and Data MiningNandakumar P
UNIT-III Classification and Prediction: Issues Regarding Classification and Prediction – Classification by Decision Tree Introduction – Bayesian Classification – Rule Based Classification – Classification by Back propagation – Support Vector Machines – Associative Classification – Lazy Learners – Other Classification Methods – Prediction – Accuracy and Error Measures – Evaluating the Accuracy of a Classifier or Predictor – Ensemble Methods – Model Section.
Analysis of Classification Algorithm in Data Miningijdmtaiir
Data Mining is the extraction of hidden predictive
information from large database. Classification is the process
of finding a model that describes and distinguishes data classes
or concept. This paper performs the study of prediction of class
label using C4.5 and Naïve Bayesian algorithm.C4.5 generates
classifiers expressed as decision trees from a fixed set of
examples. The resulting tree is used to classify future samples
.The leaf nodes of the decision tree contain the class name
whereas a non-leaf node is a decision node. The decision node
is an attribute test with each branch (to another decision tree)
being a possible value of the attribute. C4.5 uses information
gain to help it decide which attribute goes into a decision node.
A Naïve Bayesian classifier is a simple probabilistic classifier
based on applying Baye’s theorem with strong (naive)
independence assumptions. Naive Bayesian classifier assumes
that the effect of an attribute value on a given class is
independent of the values of the other attribute. This
assumption is called class conditional independence. The
results indicate that Predicting of class label using Naïve
Bayesian classifier is very effective and simple compared to
C4.5 classifier
The document discusses various classification and prediction techniques used in machine learning, including decision trees, naive Bayes classification, and k-nearest neighbors. It provides examples of how these techniques work and compares supervised versus unsupervised learning approaches. The document also covers issues around preparing data and comparing different classification and prediction methods.
IMAGE CLASSIFICATION USING DIFFERENT CLASSICAL APPROACHESVikash Kumar
IMAGE CLASSIFICATION USING KNN, RANDOM FOREST AND SVM ALGORITHM ON GLAUCOMA DATASETS AND EXPLAIN THE ACCURACY, SENSITIVITY, AND SPECIFICITY OF EACH AND EVERY ALGORITHMS
Classifiers are algorithms that map input data to categories in order to build models for predicting unknown data. There are several types of classifiers that can be used including logistic regression, decision trees, random forests, support vector machines, Naive Bayes, and neural networks. Each uses different techniques such as splitting data, averaging predictions, or maximizing margins to classify data. The best classifier depends on the problem and achieving high accuracy, sensitivity, and specificity.
EDAB Module 5 Singular Value Decomposition (SVD).pptxrajalakshmi5921
1. Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three other matrices.
2. SVD is primarily used for dimensionality reduction, information extraction, and noise reduction.
3. Key applications of SVD include matrix approximation, principal component analysis, image compression, recommendation systems, and signal processing.
1. Singular Value Decomposition (SVD) is a matrix factorization technique that decomposes a matrix into three other matrices.
2. SVD is primarily used for dimensionality reduction, information extraction, and noise reduction.
3. Key applications of SVD include matrix approximation, principal component analysis, image compression, recommendation systems, and signal processing.
This document provides an introduction and overview of key concepts in data mining. It defines data mining as extracting hidden predictive information from large databases to help companies make knowledge-driven decisions. The document outlines different types of patterns that can be mined, including frequent patterns, associations, correlations, and outliers. It also discusses technologies commonly used in data mining such as statistics, machine learning, databases, and visualization. Major issues addressed include developing new mining methodologies, enabling user interaction, improving efficiency and scalability, handling diverse data types, and addressing societal impacts.
Perfomance Comparison of Decsion Tree Algorithms to Findout the Reason for St...ijcnes
Educational data mining is used to study the data available in the educational field and bring out the hidden knowledge from it. Classification methods like decision trees, rule mining can be applied on the educational data for predicting the students behavior. This paper focuses on finding thesuitablealgorithm which yields the best result to find out the reason behind students absenteeism in an academic year. The first step in this processis to gather students data by using questionnaire.The datais collected from 123 under graduate students from a private college which is situated in a semirural area. The second step is to clean the data which is appropriate for mining purpose and choose the relevant attributes. In the final step, three different Decision tree induction algorithms namely, ID3(Iterative Dichotomiser), C4.5 and CART(Classification and Regression Tree)were applied for comparison of results for the same data sample collected using questionnaire. The results were compared to find the algorithm which yields the best result in predicting the reason for student s absenteeism.
This document discusses data mining classification and decision trees. It defines classification, provides examples, and discusses techniques like decision trees. It covers decision tree induction processes like determining the best split, measures of impurity, and stopping criteria. It also addresses issues like overfitting and model evaluation, discussing metrics, methods of evaluation like cross validation, and comparing models.
This document discusses data mining classification and decision trees. It defines classification, provides examples, and discusses techniques like decision trees. It covers decision tree induction processes like determining the best split, measures of impurity, and stopping criteria. It also addresses issues like overfitting, model evaluation methods, and comparing model performance.
Efficient classification of big data using vfdt (very fast decision tree)eSAT Journals
Abstract
Decision Tree learning algorithms have been able to capture knowledge successfully. Decision Trees are best considered when
instances are described by attribute-value pairs and when the target function has a discrete value. The main task of these
decision trees is to use inductive methods to the given values of attributes of an unknown object and determine an
appropriate classification by applying decision tree rules. Decision Trees are very effective forms to evaluate the performance
and represent the algorithms because of their robustness, simplicity, capability of handling numerical and categorical data,
ability to work with large datasets and comprehensibility to a name a few. There are various decision tree algorithms available
like ID3, CART, C4.5, VFDT, QUEST, CTREE, GUIDE, CHAID, CRUISE, etc. In this paper a comparative study on three of
these popular decision tree algorithms - (Iterative Dichotomizer 3), C4.5 which is an evolution of ID3 and VFDT (Very
Fast Decision Tree has been made. An empirical study has been conducted to compare C4.5 and VFDT in terms of accuracy
and execution time and various conclusions have been drawn.
Key Words: Decision tree, ID3, C4.5, VFDT, Information Gain, Gain Ratio, Gini Index, Over−fitting.
A Survey of Modern Data Classification Techniquesijsrd.com
This document provides an overview of modern data classification techniques. It describes decision tree learning algorithms, which use tree structures to classify observations by mapping them to target class labels based on their features. The document discusses common decision tree algorithms like ID3 and C4.5 and their use of recursive partitioning to split data into subsets. It also reviews related work on decision tree algorithms and their applications in domains like medicine, manufacturing, and molecular biology. The conclusion states that current and improved classification algorithms efficiently predict target attributes but require significant time and complex extracted rules.
This document presents a new algorithm called UDT-CDF for building decision trees to classify uncertain numerical data. It improves on previous algorithms like UDT that were based on probability density functions (PDFs). The key aspects of the new algorithm are:
1. It uses cumulative distribution functions (CDFs) rather than PDFs to represent uncertain numerical attributes, since CDFs provide more complete probability information.
2. It splits data at decision tree nodes based on the CDF, placing data with values covering the split point into both branches weighted by the CDF.
3. Experimental results show the new CDF-based algorithm achieves more accurate classifications and is more computationally efficient than the PDF-based UDT algorithm,
Data Science - Part V - Decision Trees & Random Forests Derek Kane
This lecture provides an overview of decision tree machine learning algorithms and random forest ensemble techniques. The practical example includes diagnosing Type II diabetes and evaluating customer churn in the telecommunication industry.
Java developer-friendly frontends: Build UIs without the JavaScript hassle- JCONJago de Vreede
Have you ever needed to build a UI as a backend developer but didn’t want to dive deep into JavaScript frameworks? Sometimes, all you need is a straightforward way to display and interact with data. So, what are the best options for Java developers?
In this talk, we’ll explore three popular tools that make it easy to build UIs in a way that suits backend-focused developers:
HTMX for enhancing static HTML pages with dynamic interactions without heavy JavaScript,
Vaadin for full-stack applications entirely in Java with minimal frontend skills, and
JavaFX for creating Java-based UIs with drag-and-drop simplicity.
We’ll build the same UI in each technology, comparing the developer experience. At the end of the talk, you’ll be better equipped to choose the best UI technology for your next project.
Paper: World Game (s) Great Redesign.pdfSteven McGee
Paper: The World Game (s) Great Redesign using Eco GDP Economic Epochs for programmable money pdf
Paper: THESIS: All artifacts internet, programmable net of money are formed using:
1) Epoch time cycle intervals ex: created by silicon microchip oscillations
2) Syntax parsed, processed during epoch time cycle intervals
Presentation Mehdi Monitorama 2022 Cancer and Monitoringmdaoudi
What observability can learn from medicine: why diagnosing complex systems takes more than one tool—and how to think like an engineer and a doctor.
What do a doctor and an SRE have in common? A diagnostic mindset.
Here’s how medicine can teach us to better understand and care for complex systems.
GiacomoVacca - WebRTC - troubleshooting media negotiation.pdfGiacomo Vacca
Presented at Kamailio World 2025.
Establishing WebRTC sessions reliably and quickly, and maintaining good media quality throughout a session, are ongoing challenges for service providers. This presentation dives into the details of session negotiation and media setup, with a focus on troubleshooting techniques and diagnostic tools. Special attention will be given to scenarios involving FreeSWITCH as the media server and Kamailio as the signalling proxy, highlighting common pitfalls and practical solutions drawn from real-world deployments.
2. Introduction
Classification is a form of data analysis that extracts models describing important data classes.
Such models, called classifiers, predict categorical (discrete, unordered) class labels.
For example, we can build a classification model to categorize bank loan applications as either
safe or risky.
Such analysis can help provide us with a better understanding of the data at large.
Many classification methods have been proposed by researchers in machine learning, pattern
recognition, and statistics.
Classification has numerous applications, including fraud detection, target marketing,
performance prediction, manufacturing, and medical diagnosis.
3. What Is Classification?
A bank loans officer needs analysis of her data to learn which loan applicants are “safe” and which
are “risky” for the bank.
A marketing manager at AllElectronics needs data analysis to help guess whether a customer with
a given profile will buy a new computer.
In each of these examples, the data analysis task is classification, where a model or classifier is
constructed to predict class (categorical) labels, such as “safe” or “risky” for the loan application
data; “yes” or “no” for the marketing data.
Suppose that the marketing manager wants to predict how much a given customer will spend
during a sale at AllElectronics.
This data analysis task is an example of numeric prediction, where the model constructed predicts
a continuous-valued function, or ordered value, as opposed to a class label.
This model is a predictor.
Regression analysis is a statistical methodology that is most often used for numeric prediction.
Classification and numeric prediction are the two major types of prediction problems.
4. General Approach to Classification
“How does classification work?”
Data classification is a two-step process, consisting of a learning step (where a classification
model is constructed) and a classification step (where the model is used to predict class labels
for given data).
The process is shown for the loan application data of Figure below.
6. In the first step, a classifier is built describing a predetermined set of data classes or concepts.
This is the learning step (or training phase), where a classification algorithm builds the classifier
by analyzing or “learning from” a training set made up of database tuples and their associated
class labels.
A tuple, X, is represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n
measurements made on the tuple from n database attributes, A1, A2,..., An.
Each tuple, X, is assumed to belong to a predefined class as determined by another database
attribute called the class label attribute.
The class label attribute is discrete-valued and unordered.
It is categorical in that each value serves as a category or class.
The individual tuples making up the training set are referred to as training tuples and are
randomly sampled from the database under analysis.
7. Because the class label of each training tuple is provided, this step is also known as supervised
learning.
This first step of the classification process can also be viewed as the learning of a function, y = f
(X), that can predict the associated class label y of a given tuple X.
In this view, we wish to learn a function that separates the data classes.
Typically, this mapping is represented in the form of classification rules, decision trees, or
mathematical formulae.
In our example, the mapping is represented as classification rules that identify loan applications
as being either safe or risky (Figure a).
The rules can be used to categorize future data tuples, as well as provide deeper insight into the
data contents.
8. “What about classification accuracy?”
In the second step (Figure b), the model is used for classification.
First, the predictive accuracy of the classifier is estimated.
If we were to use the training set to measure the classifier’s accuracy, this estimate would likely be
optimistic, because the classifier tends to overfit the data .
Therefore, a test set is used, made up of test tuples and their associated class labels.
They are independent of the training tuples, meaning that they were not used to construct the
classifier.
9. The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier.
The associated class label of each test tuple is compared with the learned classifier’s class
prediction for that tuple.
If the accuracy of the classifier is considered acceptable, the classifier can be used to classify
future data tuples for which the class label is not known.
For example, the classification rules learned in Figure (a) from the analysis of data from previous
loan applications can be used to approve or reject new or future loan applicants.
10. Decision Tree Induction
Decision tree induction is the learning of decision trees from class-labeled training tuples.
A decision tree is a flowchart-like tree structure, where each internal node (nonleaf node) denotes a
test on an attribute, each branch represents an outcome of the test, and each leaf node (or terminal
terminal node) holds a class label.
The topmost node in a tree is the root node.
A typical decision tree is shown in Figure below.
11. It represents the concept buys computer, that is, it predicts whether a customer at AllElectronics is
likely to purchase a computer.
Internal nodes are denoted by rectangles, and leaf nodes are denoted by ovals.
Some decision tree algorithms produce only binary trees (where each internal node branches to
exactly two other nodes), whereas others can produce nonbinary trees.
“How are decision trees used for classification?”
Given a tuple, X, for which the associated class label is unknown, the attribute values of the tuple
are tested against the decision tree.
A path is traced from the root to a leaf node, which holds the class prediction for that tuple.
Decision trees can easily be converted to classification rules.
12. “Why are decision tree classifiers so popular?”
The construction of decision tree classifiers does not require any domain knowledge or parameter
setting, and therefore is appropriate for exploratory knowledge discovery.
Decision trees can handle multidimensional data.
The learning and classification steps of decision tree induction are simple and fast.
In general, decision tree classifiers have good accuracy.
However, successful use may depend on the data at hand.
Decision tree induction algorithms have been used for classification in many application areas such
as medicine, manufacturing and production, financial analysis, astronomy, and molecular biology.
13. Decision Tree Induction
During the late 1970s and early 1980s, J. Ross Quinlan, a researcher in machine learning, developed
developed a decision tree algorithm known as ID3 (Iterative Dichotomiser).
This work expanded on earlier work on concept learning systems, described by E. B. Hunt, J. Marin,
and P. T. Stone.
Quinlan later presented C4.5 (a successor of ID3), which became a benchmark to which newer
supervised learning algorithms are often compared.
In 1984, a group of statisticians (L. Breiman, J. Friedman, R. Olshen, and C. Stone) published the
book Classification and Regression Trees (CART), which described the generation of binary decision
decision trees.
ID3 and CART were invented independently of one another at around the same time, yet follow a
similar approach for learning decision trees from training tuples.
14. ID3, C4.5, and CART adopt a greedy approach in which decision trees are constructed in a top-
down recursive divide-and-conquer manner.
Most algorithms for decision tree induction follow a top-down approach, which starts with a
training set of tuples and their associated class labels.
The training set is recursively partitioned into smaller subsets as the tree is being built.
15. A basic decision tree algorithm is summarized in Figure below.
16. The strategy is as follows.
1. The algorithm is called with three parameters: D(data partition), attribute list, and Attribute selection
method.
Initially, D, is the complete set of training tuples and their associated class labels.
The parameter attribute list is a list of attributes describing the tuples.
Attribute selection method specifies a heuristic procedure for selecting the attribute that “best”
discriminates the given tuples according to class.
This procedure employs an attribute selection measure such as information gain or the Gini index.
Whether the tree is strictly binary is generally driven by the attribute selection measure.
Some attribute selection measures, such as the Gini index, enforce the resulting tree to be binary.
Others, like information gain, do not, therein allowing multiway splits (i.e., two or more branches to
be grown from a node).
17. 2. The tree starts as a single node, N, representing the training tuples in D (step 1)
3. If the tuples in D are all of the same class, then node N becomes a leaf and is labeled with that
class (steps 2 and 3).
4. Otherwise, the algorithm calls Attribute selection method to determine the splitting criterion.
The splitting criterion tells us which attribute to test at node N by determining the “best” way to
separate or partition the tuples in D into individual classes (step 6).
The splitting criterion also tells us which branches to grow from node N with respect to the
outcomes of the chosen test.
The splitting criterion is determined so that, ideally, the resulting partitions at each branch are as
“pure” as possible.
A partition is pure if all the tuples in it belong to the same class.
18. 5. The node N is labeled with the splitting criterion, which serves as a test at the node (step 7).
A branch is grown from node N for each of the outcomes of the splitting criterion.
The tuples in D are partitioned accordingly (steps 10 to 11).
There are three possible scenarios, as illustrated in Figure below.
19. Let A be the splitting attribute.
A has v distinct values, {a1, a2,..., av }, based on the training data.
1. A is discrete-valued: In this case, the outcomes of the test at node N correspond directly to the
known values of A. A branch is created for each known value, aj , of A and labeled with that value
(Figure a). Partition Dj is the subset of class-labeled tuples in D having value aj of A. (steps 8 and 9).
2. A is continuous-valued: In this case, the test at node N has two possible outcomes, corresponding
to the conditions A ≤ split point and A > split point, respectively, where split point is the split-point
returned by Attribute selection method as part of the splitting criterion. Two branches are grown
from N and labeled according to the previous outcomes (Figure b). The tuples are partitioned such
that D1 holds the subset of class-labeled tuples in D for which A ≤ split point, while D2 holds the
rest.
20. 3. A is discrete-valued and a binary tree must be produced : The test at node N is of the form
“A ∈ SA?,” where SA is the splitting subset for A, returned by Attribute selection method as part of
the splitting criterion. It is a subset of the known values of A. If a given tuple has value aj of A
and if aj ∈ SA, then the test at node N is satisfied. Two branches are grown from N (Figure c). By
convention, the left branch out of N is labeled yes so that D1 corresponds to the subset of class-
labeled tuples in D that satisfy the test. The right branch out of N is labeled no so that D2
corresponds to the subset of class-labeled tuples from D that do not satisfy the test.
21. The algorithm uses the same process recursively to form a decision tree for the tuples at each
resulting partition, Dj , of D (step 14).
The recursive partitioning stops only when any one of the following terminating conditions is
true:
1. All the tuples in partition D (represented at node N) belong to the same class (steps 2 and 3).
2. There are no remaining attributes on which the tuples may be further partitioned (step 4). In
this case, majority voting is employed (step 5). This involves converting node N into a leaf and
labeling it with the most common class in D. Alternatively, the class distribution of the node
tuples may be stored.
3. There are no tuples for a given branch, that is, a partition Dj is empty (step 12). In this case, a
leaf is created with the majority class in D (step 13).
The resulting decision tree is returned (step 15).
23. Attribute Selection Measures
An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes.
If we were to split D into smaller partitions according to the outcomes of the splitting criterion,
ideally each partition would be pure (i.e., all the tuples that fall into a given partition would belong
to the same class).
Attribute selection measures are also known as splitting rules because they determine how the
tuples at a given node are to be split.
The attribute selection measure provides a ranking for each attribute describing the given training
tuples.
The attribute having the best score for the measure is chosen as the splitting attribute for the
given tuples.
24. There are three popular attribute selection measures—information gain, gain ratio, and Gini
index.
The notation used herein is as follows.
Let D, the data partition, be a training set of class-labeled tuples.
Suppose the class label attribute has m distinct values defining m distinct classes, Ci (for i = 1,...,
m).
Let Ci,D be the set of tuples of class Ci in D.
Let |D| and |Ci,D| denote the number of tuples in D and Ci,D, respectively
25. Information Gain
ID3 uses information gain as its attribute selection measure.
Let node N represent or hold the tuples of partition D.
The attribute with the highest information gain is chosen as the splitting attribute for node N.
This attribute minimizes the information needed to classify the tuples in the resulting partitions
and reflects the least randomness or “impurity” in these partitions.
Such an approach minimizes the expected number of tests needed to classify a given tuple and
guarantees that a simple tree is found.
26. The expected information needed to classify a tuple in D is given by
where pi is the nonzero probability that an arbitrary tuple in D belongs to class Ci and is
estimated by |Ci,D|/|D|.
Info(D) is just the average amount of information needed to identify the class label of a tuple in
D.
How much more information would we still need (after the partitioning) to arrive at an exact
classification?
27. This amount is measured by
InfoA (D) is the expected information required to classify a tuple from D based on the partitioning
by A.
The smaller the expected information (still) required, the greater the purity of the partitions.
Information gain is defined as the difference between the original information requirement (i.e.,
based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning
on A).
That is,
The attribute A with the highest information gain, Gain(A), is chosen as the splitting attribute at
node N.
28. Example 8.1 Induction of a decision tree using information gain.
Table below presents a training set, D, of class-labeled tuples randomly selected from the
AllElectronics customer database.
29. The class label attribute, buys computer, has two distinct values (namely, {yes, no}); therefore, there
are two distinct classes (i.e., m = 2).
Let class C1 correspond to yes and class C2 correspond to no.
There are nine tuples of class yes and five tuples of class no.
A (root) node N is created for the tuples in D. To find the splitting criterion for these tuples, we must
compute the information gain of each attribute.
We first use Eq. (8.1) to compute the expected information needed to classify a tuple in D:
30. Next, we need to compute the expected information requirement for each attribute.
Let’s start with the attribute age.
We need to look at the distribution of yes and no tuples for each category of age.
For the age category “youth,” there are two yes tuples and three no tuples.
For the category “middle aged,” there are four yes tuples and zero no tuples.
For the category “senior,” there are three yes tuples and two no tuples.
Using Eq. (8.2), the expected information needed to classify a tuple in D if the tuples are
partitioned according to age is
31. Hence, the gain in information from such a partitioning would be
Gain(age) = Info(D) − Infoage(D) = 0.940 − 0.694 = 0.246 bits.
Similarly, we can compute Gain(income) = 0.029 bits, Gain(student) = 0.151 bits, and Gain(credit
rating) = 0.048 bits.
Because age has the highest information gain among the attributes, it is selected as the splitting
attribute.
Node N is labeled with age, and branches are grown for each of the attribute’s values. The tuples
are then partitioned accordingly, as shown in Figure below.
32. Notice that the tuples falling into the partition for age = middle aged all belong to the same
class.
Because they all belong to class “yes,” a leaf should therefore be created at the end of this
branch and labeled “yes.”
The final decision tree returned by the algorithm was shown earlier in Figure below.
33. Gain Ratio
C4.5, a successor of ID3, uses an extension to information gain known as gain ratio.
It applies a kind of normalization to information gain using a “split information” value defined
analogously with Info(D) as
This value represents the potential information generated by splitting the training data set, D,
into v partitions, corresponding to the v outcomes of a test on attribute A.
Note that, for each outcome, it considers the number of tuples having that outcome with
respect to the total number of tuples in D.
34. It differs from information gain, which measures the information with respect to classification
that is acquired based on the same partitioning.
The gain ratio is defined as
The attribute with the maximum gain ratio is selected as the splitting attribute.
35. Example 8.2 Computation of gain ratio for the attribute income.
A test on income splits the data of Table 8.1 into three partitions, namely low, medium, and high,
containing four, six, and four tuples, respectively. To compute the gain ratio of income, we first
use Eq. (8.5) to obtain
From Example 8.1, we have Gain(income) = 0.029.
Therefore, GainRatio(income) = 0.029/1.557 = 0.019.
36. Gini Index
The Gini index is used in CART.
Using the notation previously described, the Gini index measures the impurity of D, a data
partition or set of training tuples, as
where pi is the probability that a tuple in D belongs to class Ci and is estimated by |Ci,D|/|D|.
The sum is computed over m classes.
The Gini index considers a binary split for each attribute.
37. Let’s first consider the case where A is a discrete-valued attribute having v distinct values, {a1, a2,...,
av }, occurring in D.
To determine the best binary split on A, we examine all the possible subsets that can be formed
using known values of A.
For example, if income has three possible values, namely {low, medium, high}, then the possible
subsets are {low, medium, high}, {low, medium}, {low, high}, {medium, high}, {low}, {medium}, {high},
and {}.
We exclude the power set, {low, medium, high}, and the empty set from consideration since,
conceptually, they do not represent a split.
Therefore, there are 2v − 2 possible ways to form two partitions of the data, D, based on a binary
split on A.
38. When considering a binary split, we compute a weighted sum of the impurity of each resulting
partition.
For example, if a binary split on A partitions D into D1 and D2, the Gini index of D given that
partitioning is
For each attribute, each of the possible binary splits is considered.
For a discrete-valued attribute, the subset that gives the minimum Gini index for that attribute is
selected as its splitting subset.
For continuous-valued attributes, each possible split-point must be considered.
39. The reduction in impurity that would be incurred by a binary split on a discrete- or continuous-
valued attribute A is
The attribute that maximizes the reduction in impurity (or, equivalently, has the minimum Gini
index is selected as the splitting attribute.
This attribute and either its splitting subset (for a discrete-valued splitting attribute) or split-
point (for a continuous-valued splitting attribute) together form the splitting criterion.
40. Example 8.3 Induction of a decision tree using the Gini index.
Let D be the training data shown earlier in Table 8.1, where there are nine tuples belonging to
the class buys computer = yes and the remaining five tuples belong to the class buys
computer = no.
A (root) node N is created for the tuples in D.
We first use Eq. (8.7) for the Gini index to compute the impurity of D:
To find the splitting criterion for the tuples in D, we need to compute the Gini index for each
attribute.
Let’s start with the attribute income and consider each of the possible splitting subsets.
Consider the subset {low, medium}. This would result in 10 tuples in partition D1 satisfying the
condition “income ∈ {low, medium}.”
The remaining four tuples of D would be assigned to partition D2.
41. The Gini index value computed based on this partitioning is
42. Similarly, the Gini index values for splits on the remaining subsets are 0.458 (for the subsets {low,
high} and {medium}) and 0.450 (for the subsets {medium, high} and {low}).
Therefore, the best binary split for attribute income is on {low, medium} (or {high}) because it
minimizes the Gini index.
Evaluating age, we obtain {youth, senior} (or {middle aged}) as the best split for age with a Gini
index of 0.375; the attributes student and credit rating are both binary, with Gini index values of
0.367 and 0.429, respectively.
The attribute age and splitting subset {youth, senior} therefore give the minimum Gini index
overall, with a reduction in impurity of 0.459 − 0.357 = 0.102.
The binary split “age ∈ {youth, senior?}” results in the maximum reduction in impurity of the
tuples in D and is returned as the splitting criterion.
Node N is labeled with the criterion, two branches are grown from it, and the tuples are
partitioned accordingly.
43. Tree Pruning
When a decision tree is built, many of the branches will reflect anomalies in the training data
due to noise or outliers.
Tree pruning methods address this problem of overfitting the data.
Such methods typically use statistical measures to remove the least-reliable branches.
An unpruned tree and a pruned version of it are shown in Figure below.
Pruned trees tend to be smaller and less complex and, thus, easier to comprehend.
They are usually faster and better at correctly classifying independent test data (i.e., of
previously unseen tuples) than unpruned trees.
“How does tree pruning work?”
There are two common approaches to tree pruning: prepruning and postpruning
44. In the prepruning approach, a tree is “pruned” by halting its construction early.
Upon halting, the node becomes a leaf.
The leaf may hold the most frequent class among the subset tuples or the probability
distribution of those tuples.
When constructing a tree, measures such as statistical significance, information gain, Gini
index, and so on, can be used to assess the goodness of a split.
The second and more common approach is postpruning, which removes subtrees from a
“fully grown” tree.
A subtree at a given node is pruned by removing its branches and replacing it with a leaf.
The leaf is labeled with the most frequent class among the subtree being replaced.
45. Bayes Classification Methods
“What are Bayesian classifiers?”
Bayesian classifiers are statistical classifiers.
They can predict class membership probabilities such as the probability that a given tuple belongs
to a particular class.
Bayesian classification is based on Bayes’ theorem.
Studies comparing classification algorithms have found a simple Bayesian classifier known as the
na¨ıve Bayesian classifier to be comparable in performance with decision tree and selected neural
network classifiers.
Bayesian classifiers have also exhibited high accuracy and speed when applied to large databases.
46. Bayes’ Theorem
Bayes’ theorem is named after Thomas Bayes, who did early work in probability and decision
theory during the 18th century.
Let X be a data tuple.
In Bayesian terms, X is considered “evidence.”
As usual, it is described by measurements made on a set of n attributes.
Let H be some hypothesis such as that the data tuple X belongs to a specified class C.
For classification problems, we want to determine P(H|X), the probability that the hypothesis H
holds given the “evidence” or observed data tuple X.
In other words, we are looking for the probability that tuple X belongs to class C, given that we
know the attribute description of X.
47. P(H|X) is the posterior probability, of H conditioned on X.
For example, suppose our world of data tuples is confined to customers described by the
attributes age and income, respectively, and that X is a 35-year-old customer with an income of
$40,000.
Suppose that H is the hypothesis that our customer will buy a computer.
Then P(H|X) reflects the probability that customer X will buy a computer given that we know the
customer’s age and income.
In contrast, P(H) is the prior probability, of H.
For our example, this is the probability that any given customer will buy a computer, regardless
of age, income, or any other information.
Similarly, P(X|H) is the posterior probability of X conditioned on H.
That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we
know the customer will buy a computer.
48. P(X) is the prior probability of X.
Using our example, it is the probability that a person from our set of customers is 35 years
old and earns $40,000.
“How are these probabilities estimated?”
P(H), P(X|H), and P(X) may be estimated from the given data.
Bayes’ theorem is useful in that it provides a way of calculating the posterior probability,
P(H|X), from P(H), P(X|H), and P(X).
Bayes’ theorem is P
49. Na¨ıve Bayesian Classification
The na¨ıve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented
by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n measurements made on the tuple
from n attributes, respectively, A1, A2,..., An.
2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will predict that X
belongs to the class having the highest posterior probability, conditioned on X. That is, the na¨ıve
Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus, we maximize P(Ci|X). The class Ci for which P(Ci|X) is maximized is called the maximum posteriori
hypothesis. By Bayes’ theorem (Eq. 8.10),
50. 3. As P(X) is constant for all classes, only P(X|Ci)P(Ci) needs to be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely, that is,
P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|Ci). Otherwise, we maximize
P(X|Ci)P(Ci). Note that the class prior probabilities may be estimated by P(Ci) = |Ci,D|/|D|, where |Ci,D| is
the number of training tuples of class Ci in D.
4. Given data sets with many attributes, it would be extremely computationally expensive to compute
P(X|Ci). To reduce computation in evaluating P(X|Ci), the na¨ıve assumption of class-conditional
independence is made. This presumes that the attributes’ values are conditionally independent of one
another, given the class label of the tuple (i.e., that there are no dependence relationships among the
attributes). Thus,
We can easily estimate the probabilities P(x1|Ci), P(x2|Ci),..., P(xn|Ci) from the training tuples. Recall that
here xk refers to the value of attribute Ak for tuple X. For each attribute, we look at whether the
attribute is categorical or continuous-valued.
51. Example - Predicting a class label using na¨ıve Bayesian classification.
We wish to predict the class label of a tuple using na¨ıve Bayesian classification, given the
same training data as in decision tree induction.
The training data were shown earlier in Table below.
52. The data tuples are described by the attributes age, income, student, and credit rating.
The class label attribute, buys computer, has two distinct values (namely, {yes, no}).
Let C1 correspond to the class buys computer = yes and C2 correspond to buys computer = no.
The tuple we wish to classify is
X = (age = youth, income = medium, student = yes, credit rating = fair)
54. k-Nearest-Neighbor Classifiers
The k-nearest-neighbor method was first described in the early 1950s.
The method is labor intensive when given large training sets, and did not gain popularity until
the 1960s when increased computing power became available.
It has since been widely used in the area of pattern recognition.
Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given
test tuple with training tuples that are similar to it.
The training tuples are described by n attributes.
Each tuple represents a point in an n-dimensional space.
In this way, all the training tuples are stored in an n-dimensional pattern space.
When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for
the k training tuples that are closest to the unknown tuple.
These k training tuples are the k “nearest neighbors” of the unknown tuple.
55. “Closeness” is defined in terms of a distance metric, such as Euclidean distance.
The Euclidean distance between two points or tuples, say, X1 = (x11, x12,..., x1n) and X2 = (x21,
x22,..., x2n), is
In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1 and in tuple X2, square this difference, and accumulate it.
The square root is taken of the total accumulated distance count.
Typically, we normalize the values of each attribute before using Eq. (9.22).
This helps prevent attributes with initially large ranges (e.g., income) from outweighing
attributes with initially smaller ranges (e.g., binary attributes).
56. Min-max normalization, for example, can be used to transform a value v of a numeric attribute A
to v 0 in the range [0, 1] by computing
where minA and maxA are the minimum and maximum values of attribute A.
For k-nearest-neighbor classification, the unknown tuple is assigned the most common class
among its k-nearest neighbors.
When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in
pattern space.
57. “But how can distance be computed for attributes that are not numeric, but nominal ((or
categorical) such as color?”
For nominal attributes, a simple method is to compare the corresponding value of the
attribute in tuple X1 with that in tuple X2.
If the two are identical (e.g., tuples X1 and X2 both have the color blue), then the difference
between the two is taken as 0.
If the two are different (e.g., tuple X1 is blue but tuple X2 is red), then the difference is
considered to be 1.
Other methods may incorporate more sophisticated schemes for differential grading (e.g.,
where a larger difference score is assigned, say, for blue and white than for blue and black).
58. Model Evaluation and Selection
Metrics for Evaluating Classifier Performance
The classifier evaluation measures presented in this section are summarized in Figure
below.
They include accuracy (also known as recognition rate), sensitivity (or recall), specificity,
precision, F1, and Fβ
59. Positive tuples and negative tuples
Given two classes, for example, the positive tuples may be buys computer = yes while the
negative tuples are buys computer = no.
Suppose we use our classifier on a test set of labeled tuples.
P is the number of positive tuples and N is the number of negative tuples
60. There are four terms we need to know that are the “building blocks” used in computing many
evaluation measures.
True positives (TP): These refer to the positive tuples that were correctly labeled by the
classifier. Let TP be the number of true positives.
True negatives(TN): These are the negative tuples that were correctly labeled by the classifier.
Let TN be the number of true negatives.
False positives (FP): These are the negative tuples that were incorrectly labeled as positive
(e.g., tuples of class buys computer = no for which the classifier predicted buys computer =
yes). Let FP be the number of false positives.
False negatives (FN): These are the positive tuples that were mislabeled as negative (e.g.,
of class buys computer = yes for which the classifier predicted buys computer = no). Let FN be
the number of false negatives.
61. These terms are summarized in the confusion matrix
62. The confusion matrix is a useful tool for analyzing how well your classifier can recognize tuples of
different classes.
TP and TN tell us when the classifier is getting things right, while FP and FN tell us when the
classifier is getting things wrong.
Now let’s look at the evaluation measures, starting with accuracy.
The accuracy of a classifier on a given test set is the percentage of test set tuples that are
correctly classified by the classifier. That is,
We can also speak of the error rate or misclassification rate of a classifier, M, which is simply 1 −
accuracy(M), where accuracy(M) is the accuracy of M.
This also can be computed as
63. The sensitivity and specificity measures can be used, respectively, for this purpose.
Sensitivity is also referred to as the true positive (recognition) rate (i.e., the proportion of positive
tuples that are correctly identified), while specificity is the true negative rate (i.e., the proportion of
negative tuples that are correctly identified).
These measures are defined as
64. The precision and recall measures are also widely used in classification.
Precision can be thought of as a measure of exactness (i.e., what percentage of tuples labeled as
positive are actually such), whereas recall is a measure of completeness (what percentage of
positive tuples are labeled as such).
If recall seems familiar, that’s because it is the same as sensitivity (or the true positive rate).
These measures can be computed as
65. Cross-Validation
In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets
or “folds,” D1, D2,..., Dk , each of approximately equal size.
Training and testing is performed k times.
In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively
used to train the model.
That is, in the first iteration, subsets D2,..., Dk collectively serve as the training set to obtain a first
model, which is tested on D1; the second iteration is trained on subsets D1, D3,..., Dk and tested on
D2; and so on
Leave-one-out is a special case of k-fold cross-validation where k is set to the number of initial
tuples.
That is, only one sample is “left out” at a time for the test set.
In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in
each fold is approximately the same as that in the initial data..
66. Bootstrap
the bootstrap method samples the given training tuples uniformly with replacement.
That is, each time a tuple is selected, it is equally likely to be selected again and re-added to
the training set.
For instance, imagine a machine that randomly selects tuples for our training set.
In sampling with replacement, the machine is allowed to select the same tuple more than once.
There are several bootstrap methods.
67. A commonly used one is the .632 bootstrap, which works as follows.
Suppose we are given a data set of d tuples.
The data set is sampled d times, with replacement, resulting in a bootstrap sample or training
set of d samples.
It is very likely that some of the original data tuples will occur more than once in this sample.
The data tuples that did not make it into the training set end up forming the test set.
Suppose we were to try this out several times. As it turns out, on average, 63.2% of the original
data tuples will end up in the bootstrap sample, and the remaining 36.8% will form the test set
68. Techniques to Improve Classification Accuracy
Bagging, boosting, and random forests are examples of ensemble methods (Figure below).
An ensemble combines a series of k learned models (or base classifiers), M1, M2,..., Mk , with the
aim of creating an improved composite classification model, M∗.
A given data set, D, is used to create k training sets, D1, D2,..., Dk , where Di (1 ≤ i ≤ k − 1) is
used to generate classifier Mi .
Given a new data tuple to classify, the base classifiers each vote by returning a class prediction.
The ensemble returns a class prediction based on the votes of the base classifiers.
69. Bagging
Given a set, D, of d tuples, bagging works as follows.
For iteration i(i = 1, 2,..., k), a training set, Di , of d tuples is sampled with replacement from the
original set of tuples, D.
Note that the term bagging stands for bootstrap aggregation.
Each training set is a bootstrap sample
Because sampling with replacement is used, some of the original tuples of D may not be included
in Di , whereas others may occur more than once.
A classifier model, Mi , is learned for each training set, Di .
To classify an unknown tuple, X, each classifier, Mi , returns its class prediction, which counts as
one vote.
The bagged classifier, M∗, counts the votes and assigns the class with the most votes to X.
Bagging can be applied to the prediction of continuous values by taking the average value of
each prediction for a given test tuple.
70. Boosting and AdaBoost
In boosting, weights are also assigned to each training tuple.
A series of k classifiers is iteratively learned. After a classifier, Mi , is learned, the weights are updated
to allow the subsequent classifier, Mi+1, to “pay more attention” to the training tuples that were
misclassified by Mi .
The final boosted classifier, M∗, combines the votes of each individual classifier, where the weight of
each classifier’s vote is a function of its accuracy.
AdaBoost (short for Adaptive Boosting) is a popular boosting algorithm.
Suppose we want to boost the accuracy of a learning method.
We are given D, a data set of d class-labeled tuples, (X1, y1),(X2, y2),...,(Xd, yd), where yi is the class
label of tuple Xi .
Initially, AdaBoost assigns each training tuple an equal weight of 1/d.
Generating k classifiers for the ensemble requires k rounds through the rest of the algorithm.
In round i, the tuples from D are sampled to form a training set, Di , of size d.
71. Sampling with replacement is used—the same tuple may be selected more than once.
Each tuple’s chance of being selected is based on its weight.
A classifier model, Mi , is derived from the training tuples of Di .
Its error is then calculated using Di as a test set.
The weights of the training tuples are then adjusted according to how they were classified.
If a tuple was incorrectly classified, its weight is increased.
If a tuple was correctly classified, its weight is decreased.
A tuple’s weight reflects how difficult it is to classify— the higher the weight, the more often it has
been misclassified.
These weights will be used to generate the training samples for the classifier of the next round.
The basic idea is that when we build a classifier, we want it to focus more on the misclassified tuples of
the previous round.
Some classifiers may be better at classifying some “difficult” tuples than others. In this way, we build a
series of classifiers that complement each other
72. Random Forests
Random forests can be built using bagging in tandem with random attribute selection.
A training set, D, of d tuples is given.
The general procedure to generate k decision trees for the ensemble is as follows.
For each iteration, i(i = 1, 2,..., k), a training set, Di , of d tuples is sampled with replacement from D.
That is, each Di is a bootstrap sample of D, so that some tuples may occur more than once in Di ,
while others may be excluded.
Let F be the number of attributes to be used to determine the split at each node, where F is much
smaller than the number of available attributes.
To construct a decision tree classifier, Mi , randomly select, at each node, F attributes as candidates
for the split at the node.
The CART methodology is used to grow the trees.
The trees are grown to maximum size and are not pruned.
Random forests formed this way, with random input selection, are called Forest-RI.