SlideShare a Scribd company logo
Classification Chapter# 05 Data Mining The Web By  Hussain Ahmad  M.S (Semantic Web) University of Peshawar Pakistan
..Classification.. In clustering we use the document class labels for evaluation purposes. In classification they are an essential part of the input to the learning system. The objective of the system is to create a mapping (also called a  model  or  hypothesis ) between a set of documents and a set of class labels. This mapping is then used to determine automatically the class of new (unlabeled) documents
..Classification.. This mapping process is called  classification. The general framework for classification includes the model creation phase and other steps. Therefore, the general framework is usually called  supervised learning  (also,  learning from examples ,  concept learning ) and includes the following steps:
..Classification.. Step 1: Data collection and preprocessing. Documents are  Collected, cleaned, and properly organized, the terms (features) identified, and a vector space representation created. Documents are organized in classes (categories), based on their topic, user preference, or any other criterion. Data divided into two subsets: Training set . This part of the data will be used to create the model.  Test set . This part of the data is used for testing the model.
..Classification.. Step 2: Building the model.   This is the actual learning (also called  training ) step, which includes the use of the learning algorithm. It is usually an iterative and interactive process that may include other steps and may be repeated several times so that the best model is created: Feature selection Applying the learning algorithm Validating the model (using the validation subset to tune some parameters of the learning algorithm)
..Classification.. Step 3: Testing and evaluating the model At this step the model is applied to the documents from the test set and their actual class labels are compared to the labels predicted. Step 4: Using the model to classify new   documents   (with unknown class labels).
Web Documents exhibit some specific properties The web documents exhibit some specific properties which may require some adjustment or use of proper learning algorithms. Here are the basic ones Text and web documents include thousands of words.  The document features inherit some of the properties of the natural language text from which they are derived. Documents are of different sizes.
Nearest-Neighbor Algorithm The nearest-neighbor algorithm is a straightforward application of similarity (or distance) for the purposes of classification. It predicts the class of a new document using the class label of the closest document from the training set. Because it uses just one instance from the training set, this basic version of the algorithm is called  - one-nearest neighbor  (1-NN).
..NN Algorithm.. The closeness is measured by minimal distance or maximal similarity. The most common approach is to use the TFIDF ( term frequency–inverse document frequency  ) framework to represent both the test and training documents and to compute the cosine similarity between the document vectors.
..NN Algorithm.. Let us consider department document collection, represented as TFIDF vectors with six attributes along with the class labels for each document, as shown in Table. Assume that the class of the Theatre document is unknown. To determine the class of this document, we compute the cosine similarity between the Theatre vector and all other vectors.
 
 
..NN Algorithm.. The 1-NN approach simply picks the most similar document i.e.  Criminal Justice  and uses its label B to predict the class of  Theatre . However, if we look at the nearest neighbor of Theatre (Criminal Justice) we see only one nonzero attribute, which produces the prediction. This makes the algorithm extremely sensitive to noise and irrelevant attributes
..NN Algorithm.. Therefore using 1-NN, Two assumptions are made There is no noise, and  All attributes are equally important for the classification. k-NN is a generalization of 1-NN The parameter  k  is selected to be a small odd number (usually, 3 or 5) For example, 3-NN Classify Theatre as of class B, because this is the majority label in the top three documents (B,A,B). 5-NN will predict class A, because the set of labels of the top five documents is  { B,A,B,A,A }
..NN Algorithm.. Distance-weighted k-NN For example, the distance-weighted 3-NN with the simplest weighting scheme [sim ( X,Y  )] will predict class B for the  Theatre document . Because the weight for label B (documents Criminal Justice and Communication) is B = 0 . 967075 + 0 . 605667 = 1 . 572742,  while the weight for Anthropology is A = 0 . 695979,  And thus B  >  A.
FEATURE SELECTION The objective of feature selection is to find a subset of attributes that best describe a set of documents with respect to the classification task i.e., the attributes with which the learning algorithm achieves maximal accuracy. A simple solution is to try all subsets and pick the one that maximizes accuracy. This solution is impractical, due to the huge number of subsets that have to be investigated (2 n   for  n  attributes).
Naive Bayes Algorithm Bayesian classification Approaches: One based on the Boolean document representation and Another based on document representation by term counts. Consider the set of Boolean document vectors shown in Table.
 
..Naive Bayes Algorithm.. Classifying Theatre document given the rest of documents with known class labels. The Bayesian approach determines the class of document  x  as the one that maximizes the conditional probability  P ( C  |  x ). According to Bayes’ rule, Given that  x  is a vector of  n  attribute values [i.e.,  x  = ( x 1 , x 2 , . . . , xn )], this assumption leads to:
..Naive Bayes Algorithm.. Now to find the class of the Theatre document, we compute the conditional probability of class A and class B given that this document has already occurred. For class A we have
..Naive Bayes Algorithm.. To calculate each of the probabilities above, we take the proportion of the corresponding attribute value in class A. For example, in the  science  column we have 0’s in four documents out of 11 from class A. Thus,  P (science = 0 | A) = 4 / 11. For class B we obtain
..Naive Bayes Algorithm.. The probabilities of classes A and B are estimated with the proportion of documents in each P (A) = 11 / 19 = 0 . 578947 and P (B) = 8 / 19 = 0 . 421053 Putting all this in the Bayes formula: At this point we can make the decision that Theatre belongs to class A
..Naive Bayes Algorithm.. Although the Boolean naive Bayes algorithm uses all training documents but it ignores the term counts. Bayesian model based on term counts will classify our test document. Assume that there are  m  terms  t 1 , t 2 , . . . , tm. and  n  documents  d 1 , d 2 , . . . , dn  from class C. Let us denote the number of times that term  ti  occurs in document  dj  as  nij.
..Naive Bayes Algorithm.. And the probability with which term  ti  occurs in all documents from class C as  P ( ti  | C ) This can be estimated with the number of times that  ti  occurs in all documents from class C over the total number of terms in the documents from class C.
 
..Naive Bayes Algorithm.. First we calculate the probabilities  P ( ti  | C)] For example, this happens with the term  history  and class A; that is, P ( history  | A) = 0 Consequently, the documents, which have a nonzero count for  history  will have zero probability in class A.  That is  P (History | A) = 0,  P (Music | A) = 0, and  P (Philosophy | A) = 0.
..Naive Bayes Algorithm.. A common approach to avoid this problem is to use the  Laplace estimator . The idea is to add 1 to the frequency count in the numerator and 2 (or the number of classes, if more than two) to the denominator. The Laplace estimator helps to deal with a zero probability situation.
..Naive Bayes Algorithm.. Now we compute the probabilities of each term given each class using the Laplace estimator. For example,  P (history | A) = (0 + 1)/(57 + 2) = 0.017 and  P(history | B) = (9 + 1)/(29 + 2) = 0.323. Plugging all these probabilities in the formula results in P(A | Theatre) ≈ 0.0000354208 and  P(B | Theatre) ≈ 0.00000476511,
Numerical Approaches In the TFIDF vector space framework, we use  cosine similarity  as a measure of document similarity. However, the same vector representation allows documents to be considered as  points in a metric space. That is, given a set of points, the objective is to find a surface that divides the space in two parts, so that the points that fall in each part belong to a single class. Linear regression , the most popular approach based on this idea.
..Numerical Approaches.. Linear regression is a standard technique for numerical prediction. It works naturally with numerical attributes (including the class). The class value  C  predicted is computed as a  linear combination  of the attribute values  xi  as follows:
..Numerical Approaches.. The objective is to find the coefficients  wi  given a number of training instances  xi  with their class values  C. There are several approaches to the use of linear regression for classification (predicting class labels). One simple approach to binary classification is to substitute class labels with the values −1 and 1. The predicted class is determined by the sign of the linear combination. For example, consider our six-attribute document vectors (Table 5.1). Let us use −1 for class A and 1 for class B
..Numerical Approaches.. Then the task is to find seven coefficients  w 0 ,w 1 , . . . , w 6 which satisfy a system of 19 linear equations The result is positive, and thus the class predicted for Theatre is B and also agrees with the prediction of  1-NN.
RELATIONAL   LEARNING All classification methods that we have discussed so far are based solely on the document content and more specifically on the bag-of-words model. Many additional document features, such as the internal HTML structure, language structure, and interdocument link structure, are ignored. All this may be a valuable source of information for the classification task. The basic problem with this information into the classification algorithm is the need for uniform representation .
..RELATIONAL   LEARNING.. Relational learning   extends content-based approach to a relational representation. Allows various types of information to be represented in a uniform way and used for web document classification. In our domain we have documents  d  and terms  t  connected with the basic relation  contains. That is, if term  t  occurs in document  d , the relation contains ( d ,  t ) is true.
Ad

More Related Content

What's hot (20)

A* Algorithm
A* AlgorithmA* Algorithm
A* Algorithm
Dr. C.V. Suresh Babu
 
Inductive analytical approaches to learning
Inductive analytical approaches to learningInductive analytical approaches to learning
Inductive analytical approaches to learning
swapnac12
 
Lecture 2: Artificial Neural Network
Lecture 2: Artificial Neural NetworkLecture 2: Artificial Neural Network
Lecture 2: Artificial Neural Network
Mohamed Loey
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.
Meghaj Mallick
 
Game Playing in Artificial intelligence.pptx
Game Playing in Artificial intelligence.pptxGame Playing in Artificial intelligence.pptx
Game Playing in Artificial intelligence.pptx
urvashipundir04
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Greedy algorithm
Greedy algorithmGreedy algorithm
Greedy algorithm
CHANDAN KUMAR
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Alia Hamwi
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
Data structures question paper anna university
Data structures question paper anna universityData structures question paper anna university
Data structures question paper anna university
sangeethajames07
 
Soft computing01
Soft computing01Soft computing01
Soft computing01
university of sargodha
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
Krish_ver2
 
Introduction to distributed database
Introduction to distributed databaseIntroduction to distributed database
Introduction to distributed database
Sonia Panesar
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
Krish_ver2
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
Duy Tung Pham
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
Deep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer DetectionDeep Learning for Lung Cancer Detection
Deep Learning for Lung Cancer Detection
Miguel González-Fierro
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Salah Amean
 
Inductive analytical approaches to learning
Inductive analytical approaches to learningInductive analytical approaches to learning
Inductive analytical approaches to learning
swapnac12
 
Lecture 2: Artificial Neural Network
Lecture 2: Artificial Neural NetworkLecture 2: Artificial Neural Network
Lecture 2: Artificial Neural Network
Mohamed Loey
 
Cure, Clustering Algorithm
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering Algorithm
Lino Possamai
 
Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.Concurrency Control in Distributed Database.
Concurrency Control in Distributed Database.
Meghaj Mallick
 
Game Playing in Artificial intelligence.pptx
Game Playing in Artificial intelligence.pptxGame Playing in Artificial intelligence.pptx
Game Playing in Artificial intelligence.pptx
urvashipundir04
 
3.5 model based clustering
3.5 model based clustering3.5 model based clustering
3.5 model based clustering
Krish_ver2
 
Supervised learning
Supervised learningSupervised learning
Supervised learning
Alia Hamwi
 
Overfitting & Underfitting
Overfitting & UnderfittingOverfitting & Underfitting
Overfitting & Underfitting
SOUMIT KAR
 
Data structures question paper anna university
Data structures question paper anna universityData structures question paper anna university
Data structures question paper anna university
sangeethajames07
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
Krish_ver2
 
Introduction to distributed database
Introduction to distributed databaseIntroduction to distributed database
Introduction to distributed database
Sonia Panesar
 
5.5 graph mining
5.5 graph mining5.5 graph mining
5.5 graph mining
Krish_ver2
 
Lecture 1 graphical models
Lecture 1  graphical modelsLecture 1  graphical models
Lecture 1 graphical models
Duy Tung Pham
 
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain RatioLecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Lecture 4 Decision Trees (2): Entropy, Information Gain, Gain Ratio
Marina Santini
 
Big Data technology Landscape
Big Data technology LandscapeBig Data technology Landscape
Big Data technology Landscape
ShivanandaVSeeri
 
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5 Data Mining:  Concepts and Techniques (3rd ed.)— Chapter 5
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5
Salah Amean
 

Similar to Classification Of Web Documents (20)

Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
Margaret Wang
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
Hasan H Topcu
 
Data mining knowledge representation Notes
Data mining knowledge representation NotesData mining knowledge representation Notes
Data mining knowledge representation Notes
RevathiSundar4
 
ppt
pptppt
ppt
butest
 
nnml.ppt
nnml.pptnnml.ppt
nnml.ppt
yang947066
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
butest
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
Datamining Tools
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
DataminingTools Inc
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
Anshika865276
 
MACHINE LEARNING Unit -2 Algorithm.pptx
MACHINE LEARNING Unit  -2 Algorithm.pptxMACHINE LEARNING Unit  -2 Algorithm.pptx
MACHINE LEARNING Unit -2 Algorithm.pptx
ARVIND SARDAR
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
midi
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
butest
 
Pattern Recognition and understanding patterns
Pattern Recognition and understanding patternsPattern Recognition and understanding patterns
Pattern Recognition and understanding patterns
gulhanep9
 
Pattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture NotesPattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture Notes
Akshaya821957
 
TEXT CLUSTERING.doc
TEXT CLUSTERING.docTEXT CLUSTERING.doc
TEXT CLUSTERING.doc
naveenchaurasia
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
Datamining Tools
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
DataminingTools Inc
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
guest0edcaf
 
Supervised algorithms
Supervised algorithmsSupervised algorithms
Supervised algorithms
Yassine Akhiat
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
台灣資料科學年會
 
Data.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and predictionData.Mining.C.6(II).classification and prediction
Data.Mining.C.6(II).classification and prediction
Margaret Wang
 
Learning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwiseLearning to Rank - From pairwise approach to listwise
Learning to Rank - From pairwise approach to listwise
Hasan H Topcu
 
Data mining knowledge representation Notes
Data mining knowledge representation NotesData mining knowledge representation Notes
Data mining knowledge representation Notes
RevathiSundar4
 
Machine learning and Neural Networks
Machine learning and Neural NetworksMachine learning and Neural Networks
Machine learning and Neural Networks
butest
 
Machine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.pptMachine Learning and Artificial Neural Networks.ppt
Machine Learning and Artificial Neural Networks.ppt
Anshika865276
 
MACHINE LEARNING Unit -2 Algorithm.pptx
MACHINE LEARNING Unit  -2 Algorithm.pptxMACHINE LEARNING Unit  -2 Algorithm.pptx
MACHINE LEARNING Unit -2 Algorithm.pptx
ARVIND SARDAR
 
20070702 Text Categorization
20070702 Text Categorization20070702 Text Categorization
20070702 Text Categorization
midi
 
Lecture 2
Lecture 2Lecture 2
Lecture 2
butest
 
Pattern Recognition and understanding patterns
Pattern Recognition and understanding patternsPattern Recognition and understanding patterns
Pattern Recognition and understanding patterns
gulhanep9
 
Pattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture NotesPattern Recognition- Basic Lecture Notes
Pattern Recognition- Basic Lecture Notes
Akshaya821957
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
Datamining Tools
 
Textmining Predictive Models
Textmining Predictive ModelsTextmining Predictive Models
Textmining Predictive Models
guest0edcaf
 
[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程[系列活動] Machine Learning 機器學習課程
[系列活動] Machine Learning 機器學習課程
台灣資料科學年會
 
Ad

Recently uploaded (20)

*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
CNS infections (encephalitis, meningitis & Brain abscess
CNS infections (encephalitis, meningitis & Brain abscessCNS infections (encephalitis, meningitis & Brain abscess
CNS infections (encephalitis, meningitis & Brain abscess
Mohamed Rizk Khodair
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...
parmarjuli1412
 
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
Celine George
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
UPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guideUPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guide
abmerca
 
Pope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptxPope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptx
Martin M Flynn
 
Overview Well-Being and Creative Careers
Overview Well-Being and Creative CareersOverview Well-Being and Creative Careers
Overview Well-Being and Creative Careers
University of Amsterdam
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
Cultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptxCultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptx
UmeshTimilsina1
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18
Celine George
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
All About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdfAll About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdf
TechSoup
 
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and GuestsLDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDM & Mia eStudios
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
The role of wall art in interior designing
The role of wall art in interior designingThe role of wall art in interior designing
The role of wall art in interior designing
meghaark2110
 
*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
CNS infections (encephalitis, meningitis & Brain abscess
CNS infections (encephalitis, meningitis & Brain abscessCNS infections (encephalitis, meningitis & Brain abscess
CNS infections (encephalitis, meningitis & Brain abscess
Mohamed Rizk Khodair
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...
parmarjuli1412
 
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
Celine George
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
UPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guideUPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guide
abmerca
 
Pope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptxPope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptx
Martin M Flynn
 
Overview Well-Being and Creative Careers
Overview Well-Being and Creative CareersOverview Well-Being and Creative Careers
Overview Well-Being and Creative Careers
University of Amsterdam
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
Cultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptxCultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptx
UmeshTimilsina1
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18
Celine George
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
All About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdfAll About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdf
TechSoup
 
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and GuestsLDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDM & Mia eStudios
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
The role of wall art in interior designing
The role of wall art in interior designingThe role of wall art in interior designing
The role of wall art in interior designing
meghaark2110
 
Ad

Classification Of Web Documents

  • 1. Classification Chapter# 05 Data Mining The Web By Hussain Ahmad M.S (Semantic Web) University of Peshawar Pakistan
  • 2. ..Classification.. In clustering we use the document class labels for evaluation purposes. In classification they are an essential part of the input to the learning system. The objective of the system is to create a mapping (also called a model or hypothesis ) between a set of documents and a set of class labels. This mapping is then used to determine automatically the class of new (unlabeled) documents
  • 3. ..Classification.. This mapping process is called classification. The general framework for classification includes the model creation phase and other steps. Therefore, the general framework is usually called supervised learning (also, learning from examples , concept learning ) and includes the following steps:
  • 4. ..Classification.. Step 1: Data collection and preprocessing. Documents are Collected, cleaned, and properly organized, the terms (features) identified, and a vector space representation created. Documents are organized in classes (categories), based on their topic, user preference, or any other criterion. Data divided into two subsets: Training set . This part of the data will be used to create the model. Test set . This part of the data is used for testing the model.
  • 5. ..Classification.. Step 2: Building the model. This is the actual learning (also called training ) step, which includes the use of the learning algorithm. It is usually an iterative and interactive process that may include other steps and may be repeated several times so that the best model is created: Feature selection Applying the learning algorithm Validating the model (using the validation subset to tune some parameters of the learning algorithm)
  • 6. ..Classification.. Step 3: Testing and evaluating the model At this step the model is applied to the documents from the test set and their actual class labels are compared to the labels predicted. Step 4: Using the model to classify new documents (with unknown class labels).
  • 7. Web Documents exhibit some specific properties The web documents exhibit some specific properties which may require some adjustment or use of proper learning algorithms. Here are the basic ones Text and web documents include thousands of words. The document features inherit some of the properties of the natural language text from which they are derived. Documents are of different sizes.
  • 8. Nearest-Neighbor Algorithm The nearest-neighbor algorithm is a straightforward application of similarity (or distance) for the purposes of classification. It predicts the class of a new document using the class label of the closest document from the training set. Because it uses just one instance from the training set, this basic version of the algorithm is called - one-nearest neighbor (1-NN).
  • 9. ..NN Algorithm.. The closeness is measured by minimal distance or maximal similarity. The most common approach is to use the TFIDF ( term frequency–inverse document frequency ) framework to represent both the test and training documents and to compute the cosine similarity between the document vectors.
  • 10. ..NN Algorithm.. Let us consider department document collection, represented as TFIDF vectors with six attributes along with the class labels for each document, as shown in Table. Assume that the class of the Theatre document is unknown. To determine the class of this document, we compute the cosine similarity between the Theatre vector and all other vectors.
  • 11.  
  • 12.  
  • 13. ..NN Algorithm.. The 1-NN approach simply picks the most similar document i.e. Criminal Justice and uses its label B to predict the class of Theatre . However, if we look at the nearest neighbor of Theatre (Criminal Justice) we see only one nonzero attribute, which produces the prediction. This makes the algorithm extremely sensitive to noise and irrelevant attributes
  • 14. ..NN Algorithm.. Therefore using 1-NN, Two assumptions are made There is no noise, and All attributes are equally important for the classification. k-NN is a generalization of 1-NN The parameter k is selected to be a small odd number (usually, 3 or 5) For example, 3-NN Classify Theatre as of class B, because this is the majority label in the top three documents (B,A,B). 5-NN will predict class A, because the set of labels of the top five documents is { B,A,B,A,A }
  • 15. ..NN Algorithm.. Distance-weighted k-NN For example, the distance-weighted 3-NN with the simplest weighting scheme [sim ( X,Y )] will predict class B for the Theatre document . Because the weight for label B (documents Criminal Justice and Communication) is B = 0 . 967075 + 0 . 605667 = 1 . 572742, while the weight for Anthropology is A = 0 . 695979, And thus B > A.
  • 16. FEATURE SELECTION The objective of feature selection is to find a subset of attributes that best describe a set of documents with respect to the classification task i.e., the attributes with which the learning algorithm achieves maximal accuracy. A simple solution is to try all subsets and pick the one that maximizes accuracy. This solution is impractical, due to the huge number of subsets that have to be investigated (2 n for n attributes).
  • 17. Naive Bayes Algorithm Bayesian classification Approaches: One based on the Boolean document representation and Another based on document representation by term counts. Consider the set of Boolean document vectors shown in Table.
  • 18.  
  • 19. ..Naive Bayes Algorithm.. Classifying Theatre document given the rest of documents with known class labels. The Bayesian approach determines the class of document x as the one that maximizes the conditional probability P ( C | x ). According to Bayes’ rule, Given that x is a vector of n attribute values [i.e., x = ( x 1 , x 2 , . . . , xn )], this assumption leads to:
  • 20. ..Naive Bayes Algorithm.. Now to find the class of the Theatre document, we compute the conditional probability of class A and class B given that this document has already occurred. For class A we have
  • 21. ..Naive Bayes Algorithm.. To calculate each of the probabilities above, we take the proportion of the corresponding attribute value in class A. For example, in the science column we have 0’s in four documents out of 11 from class A. Thus, P (science = 0 | A) = 4 / 11. For class B we obtain
  • 22. ..Naive Bayes Algorithm.. The probabilities of classes A and B are estimated with the proportion of documents in each P (A) = 11 / 19 = 0 . 578947 and P (B) = 8 / 19 = 0 . 421053 Putting all this in the Bayes formula: At this point we can make the decision that Theatre belongs to class A
  • 23. ..Naive Bayes Algorithm.. Although the Boolean naive Bayes algorithm uses all training documents but it ignores the term counts. Bayesian model based on term counts will classify our test document. Assume that there are m terms t 1 , t 2 , . . . , tm. and n documents d 1 , d 2 , . . . , dn from class C. Let us denote the number of times that term ti occurs in document dj as nij.
  • 24. ..Naive Bayes Algorithm.. And the probability with which term ti occurs in all documents from class C as P ( ti | C ) This can be estimated with the number of times that ti occurs in all documents from class C over the total number of terms in the documents from class C.
  • 25.  
  • 26. ..Naive Bayes Algorithm.. First we calculate the probabilities P ( ti | C)] For example, this happens with the term history and class A; that is, P ( history | A) = 0 Consequently, the documents, which have a nonzero count for history will have zero probability in class A. That is P (History | A) = 0, P (Music | A) = 0, and P (Philosophy | A) = 0.
  • 27. ..Naive Bayes Algorithm.. A common approach to avoid this problem is to use the Laplace estimator . The idea is to add 1 to the frequency count in the numerator and 2 (or the number of classes, if more than two) to the denominator. The Laplace estimator helps to deal with a zero probability situation.
  • 28. ..Naive Bayes Algorithm.. Now we compute the probabilities of each term given each class using the Laplace estimator. For example, P (history | A) = (0 + 1)/(57 + 2) = 0.017 and P(history | B) = (9 + 1)/(29 + 2) = 0.323. Plugging all these probabilities in the formula results in P(A | Theatre) ≈ 0.0000354208 and P(B | Theatre) ≈ 0.00000476511,
  • 29. Numerical Approaches In the TFIDF vector space framework, we use cosine similarity as a measure of document similarity. However, the same vector representation allows documents to be considered as points in a metric space. That is, given a set of points, the objective is to find a surface that divides the space in two parts, so that the points that fall in each part belong to a single class. Linear regression , the most popular approach based on this idea.
  • 30. ..Numerical Approaches.. Linear regression is a standard technique for numerical prediction. It works naturally with numerical attributes (including the class). The class value C predicted is computed as a linear combination of the attribute values xi as follows:
  • 31. ..Numerical Approaches.. The objective is to find the coefficients wi given a number of training instances xi with their class values C. There are several approaches to the use of linear regression for classification (predicting class labels). One simple approach to binary classification is to substitute class labels with the values −1 and 1. The predicted class is determined by the sign of the linear combination. For example, consider our six-attribute document vectors (Table 5.1). Let us use −1 for class A and 1 for class B
  • 32. ..Numerical Approaches.. Then the task is to find seven coefficients w 0 ,w 1 , . . . , w 6 which satisfy a system of 19 linear equations The result is positive, and thus the class predicted for Theatre is B and also agrees with the prediction of 1-NN.
  • 33. RELATIONAL LEARNING All classification methods that we have discussed so far are based solely on the document content and more specifically on the bag-of-words model. Many additional document features, such as the internal HTML structure, language structure, and interdocument link structure, are ignored. All this may be a valuable source of information for the classification task. The basic problem with this information into the classification algorithm is the need for uniform representation .
  • 34. ..RELATIONAL LEARNING.. Relational learning extends content-based approach to a relational representation. Allows various types of information to be represented in a uniform way and used for web document classification. In our domain we have documents d and terms t connected with the basic relation contains. That is, if term t occurs in document d , the relation contains ( d , t ) is true.
  翻译: