SlideShare a Scribd company logo
Data mining knowledge representation
1 What Defines a Data Mining Task?
• Task relevant data: where and how to retrieve the data to be used
for mining
• Background knowledge: Concept hierarchies
• Interestingness measures: informal and formal selection techniques
to be applied to the output knowledge
• Representing input data and output knowledge: the structures used
to represent the input of the output of the data mining techniques
• Visualization techniques: needed to best view and document the
results of the whole process
2 Task relevant data
• Database or data warehouse name: where to find the data
• Database tables or data warehouse cubes
• Condition for data selection, relevant attributes or dimensions and
data grouping criteria: all this is used in the SQL query to retrieve
the data
1
3 Background knowledge: Concept hierarchies
The concept hierarchies are induced by a partial order1
over the values
of a given attribute. Depending on the type of the ordering relation we
distinguish several types of concept hierarchies.
3.1 Schema hierarchy
• Relating concept generality. The ordering reflects the generality of
the attribute values, e.g. street < city < state < country.
3.2 Set-grouping hierarchy
• The ordering relation is the subset relation (⊆). Applies to set
values.
• Example:
{13, ..., 39} = young; {13, ..., 19} = teenage;
{13, ..., 19} ⊆ {13, ..., 39} ⇒ teenage < young.
• Theory:
– power set: the set of all subsets of a set, X.
– lattice (2X
, ⊆), sup(X, Y ) = X ∩ Y , inf(X, Y ) = X ∪ Y .
X ∩ Y
X Y
X ∪ Y
@
@
@
@
@
@
– top element > = {} (empty set), bottom element ⊥ = X.
1Consider a set A and an ordering relation R. R is a full order if for any x, y ∈ A, xRy exists. R is a partial order
if for any x ∈ A, there exists y ∈ A, such that either xRy or yRx exists.
2
3.3 Operation-derived hierarchy
Produced by applying an operation (encoding, decoding, information
extraction). For example:
markovz@cs.ccsu.edu
instantiates the hierarcy user−name < department < university <
usa−univeristy.
3.4 Rule-based hierarchy
Using rules to define the partial order, for example:
if antecedent then consequent
defines the order antecedent < consequent.
4 Interestingness measures
Criteria to evaluate hypotheses (knowledge extracted from data when
applying data mining techniques). This issue will be discussed in more
detail in Lecture notes - Chapter 9: ”Evaluating what’s been learned”.
4.1 Bayesian evaluation
• E - data
• H = {H1, H2, ..., Hn} - hypotheses
• Hbest = argmaxi P(Hi|E)
• Bayes theorem:
P(Hi|E) =
P(Hi)P(E|Hi)
Pn
i=1 P(Hi)P(E|Hi)
3
4.2 Simplicity
Occam’s Razor
Consider for example, association rule length, decision tree size, num-
ber and length of classification rules. The intuition suggests that the
best hypotesis is the simplest (shortest) one. This is the so called Oc-
cam’s Razor Principle also expressed as a mathematical theorem (Oc-
cam’s Razor Theorem). Here is an example of applying this principle
to grammars:
• Data:
E = {0, 000, 00000, 0000000, 000000000}
• Hypotheses:
G1 : S → 0|000|00000|0000000|000000000
G2 : S → 00S|0
• Best hypothesis: G2 (fewer and simpler rules)
However, as simplicity is a subjective measure we need formal criteria
to define it.
Formal criteria for simplicity
• Bayesian approach: need of large volume of experimental results
(statistics) to define prior probabilities.
• Algorithmic (Kolmogorov) complexity of an object (bit string): the
length of the shortest program of Universal Turing Machine, that
generates the string. Problems: computational complexity.
• Information-based approches: Minimum Description Length Prin-
ciple (MDL). Most often used in practice.
4
4.3 Minimum Description Length Principle (MDL)
• Bayes Theorem:
P(Hi|E) =
P(Hi)P(E|Hi)
Pn
i=1 P(Hi)P(E|Hi)
• Take a − log of both sides of Bayes (C is a constant):
− log2 P(Hi|E) = − log2 P(Hi) − log2 P(E|Hi) + C
• I(A) – information in message A, L(A) – min length of A in bits:
log2 P(A) = I(A) = L(A)
• Then: L(Hi|E) = L(Hi) + L(E|Hi) + C
• MDL: The hypothesis must reduce the information needed to en-
code the data, i.e.
L(E) > L(Hi) + L(E|Hi)
• The best hypothesis must maximize information compression:
Hbest = argmaxi (L(E) − L(Hi) − L(E|Hi))
4.4 Certainty
• Confidence of association ”if A then B”:
P(B|A) =
# of tuples containing both A and B
# of tupples containing A
5
• Classification accuracy: Use a training set to generate the hypoth-
esis, then test it on a separate test set.
Accuracy =
# of correct classifications
# of tuples in the test set
• Utility (support) of association ”if A then B”:
P(A, B) =
# of tupples containing both A and B
total # of tupples
5 Representing input data and output knowledge
5.1 Concepts (classes, categories, hypotheses): things to be mined/learned
• Classification mining/learning: predicting a discrete class, a kind
of supervised learning, success is measured on new data for which
class labels are known (test data).
• Association mining/learning: detecting associations between at-
tributes, can be used to predict any attribute value and more than
one attribute values, hence more rules can be generated, therefore
we need constraints (minimum support and minimum confidence).
• Clustering: grouping similar instances into clusters, a kind of unsu-
pervised learning, success is measured subjectively or by objective
functions.
• Numeric prediction: predicting a numeric quantity, a kind of su-
pervised learning, success is measured on test data.
• Concept description: output of the learning scheme
6
5.2 Instances (examples, tuples, transactions)
• Things to be classified, associated, or clustered.
• Individual, independent examples of the concept to be learned (tar-
get concept).
• Described by predetermined set of attributes.
• Input to the learning scheme: set of instances (dataset), represented
as a single relation (table).
• Independence assumption: no relationships between attributes.
• Positive and negative examples for a concept, Closed World As-
sumption (CWA): {negative} = {all}{positive}.
• Relational (First Order Logic) descriptions:
– Using variables (more compact representation). For example:
< a, b, b >, < a, c, c >, < b, a, a > can be represented as one
relational tuple < X, Y, Y >.
– Multiple relation concepts (FOIL, Inductive Logic Program-
ming, see Lecture Notes - Chapter 11). Example:
grandfather(X, Z) ← father(X, Y )∧(father(Y, Z)∨mother(Y, Z))
5.3 Attributes (features)
• Predefined set of features to describe an instance.
• Nominal (categorical, enumerated, discrete) attributes:
– Values are distinct symbols.
– No relation among nominal values.
7
– Only equality test can be performed.
– Special case: boolean attributes, transforming nominal to boolean.
• Structured:
– Partial order among nominal values
– Example: concept hierarchy
• Numeric:
– Continuous: full order (e.g. integer or real numbers).
– Interval: partial order.
5.4 Output knowledge representation
• Association rules
• Decision trees
• Classification rules
• Rules with relations
• Prediction schemes:
– Nearest neighbor
– Bayesian classification
– Neural networks
– Regression
• Clusters:
– Type of grouping: partitions/hierarchical
– Grouping or describing: agglomerative/conceptual
– Type of descriptions: statistical/structural
8
6 Visualization techniques: Why visualize data?
• Identifying problems:
– Histograms for nominal attributes: is the distribution consistent
with background knowledge?
– Graphs for numeric values: detecting outliers.
• Visualization show dependencies
• Consulting domain experts
• If data are too much, take a sample
9
Ad

More Related Content

What's hot (20)

2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
Krish_ver2
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Transport protocols
Transport protocolsTransport protocols
Transport protocols
Online
 
Kmeans
KmeansKmeans
Kmeans
Nikita Goyal
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
Dr. C.V. Suresh Babu
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptx
JebaRaj26
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
Krish_ver2
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Crisp set
Crisp setCrisp set
Crisp set
DeepikaT13
 
Characterization and Comparison
Characterization and ComparisonCharacterization and Comparison
Characterization and Comparison
Benjamin Franklin
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
02 Data Mining
02 Data Mining02 Data Mining
02 Data Mining
Institute of Technology Telkom
 
Data reduction
Data reductionData reduction
Data reduction
kalavathisugan
 
Resource Management for Computer Operating Systems
Resource Management for Computer Operating SystemsResource Management for Computer Operating Systems
Resource Management for Computer Operating Systems
inside-BigData.com
 
Density based clustering
Density based clusteringDensity based clustering
Density based clustering
YaswanthHariKumarVud
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
Kamal Acharya
 
Cluster analysis
Cluster analysisCluster analysis
Cluster analysis
Kamalakshi Deshmukh-Samag
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
Archana Swaminathan
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
Eng Teong Cheah
 
2.5 backpropagation
2.5 backpropagation2.5 backpropagation
2.5 backpropagation
Krish_ver2
 
Classification in data mining
Classification in data mining Classification in data mining
Classification in data mining
Sulman Ahmed
 
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Data Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessingData Mining:  Concepts and Techniques (3rd ed.)- Chapter 3 preprocessing
Data Mining: Concepts and Techniques (3rd ed.) - Chapter 3 preprocessing
Salah Amean
 
Transport protocols
Transport protocolsTransport protocols
Transport protocols
Online
 
K-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptxK-Means Clustering Algorithm.pptx
K-Means Clustering Algorithm.pptx
JebaRaj26
 
3.6 constraint based cluster analysis
3.6 constraint based cluster analysis3.6 constraint based cluster analysis
3.6 constraint based cluster analysis
Krish_ver2
 
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; KamberChapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
error007
 
Characterization and Comparison
Characterization and ComparisonCharacterization and Comparison
Characterization and Comparison
Benjamin Franklin
 
05 Clustering in Data Mining
05 Clustering in Data Mining05 Clustering in Data Mining
05 Clustering in Data Mining
Valerii Klymchuk
 
Resource Management for Computer Operating Systems
Resource Management for Computer Operating SystemsResource Management for Computer Operating Systems
Resource Management for Computer Operating Systems
inside-BigData.com
 
Association Analysis in Data Mining
Association Analysis in Data MiningAssociation Analysis in Data Mining
Association Analysis in Data Mining
Kamal Acharya
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
Eng Teong Cheah
 

Similar to Data mining knowledge representation Notes (20)

Machine learning
Machine learningMachine learning
Machine learning
Sukhwinder Singh
 
Deep learning from mashine learning AI..
Deep learning from mashine learning AI..Deep learning from mashine learning AI..
Deep learning from mashine learning AI..
premkumarlive
 
�datamining-lect7.pptx literature of data mining and summary
�datamining-lect7.pptx literature of data mining and summary�datamining-lect7.pptx literature of data mining and summary
�datamining-lect7.pptx literature of data mining and summary
mohammedalbohiry85
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
Fundamentals of Data Science Modeling Lec
Fundamentals of Data Science Modeling LecFundamentals of Data Science Modeling Lec
Fundamentals of Data Science Modeling Lec
RBeze58
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
Sai Kiran Kadam
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
Nandakumar P
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
Krish_ver2
 
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
594503964-Introduction-to-Classification-PPT-Slides-1.ppt594503964-Introduction-to-Classification-PPT-Slides-1.ppt
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
snehajuly2004
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
NIKHILGR3
 
Module 1 Taxonomy of Machine L(1).pptx
Module 1 Taxonomy of Machine   L(1).pptxModule 1 Taxonomy of Machine   L(1).pptx
Module 1 Taxonomy of Machine L(1).pptx
angelinjeba6
 
ML_Overview.ppt
ML_Overview.pptML_Overview.ppt
ML_Overview.ppt
ParveshKumar17303
 
ML_Overview.ppt
ML_Overview.pptML_Overview.ppt
ML_Overview.ppt
vijay251387
 
ML overview
ML overviewML overview
ML overview
NoopurRathore1
 
ML_Overview.pptx
ML_Overview.pptxML_Overview.pptx
ML_Overview.pptx
ssuserb0b8ed1
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
muhammadsamroz
 
Different learning Techniques in Artificial Intelligence
Different learning Techniques in Artificial IntelligenceDifferent learning Techniques in Artificial Intelligence
Different learning Techniques in Artificial Intelligence
vipsitaswati
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
Classification Continued
Classification ContinuedClassification Continued
Classification Continued
Datamining Tools
 
Deep learning from mashine learning AI..
Deep learning from mashine learning AI..Deep learning from mashine learning AI..
Deep learning from mashine learning AI..
premkumarlive
 
�datamining-lect7.pptx literature of data mining and summary
�datamining-lect7.pptx literature of data mining and summary�datamining-lect7.pptx literature of data mining and summary
�datamining-lect7.pptx literature of data mining and summary
mohammedalbohiry85
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
AmAn Singh
 
Fundamentals of Data Science Modeling Lec
Fundamentals of Data Science Modeling LecFundamentals of Data Science Modeling Lec
Fundamentals of Data Science Modeling Lec
RBeze58
 
Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3Machine Learning: Decision Trees Chapter 18.1-18.3
Machine Learning: Decision Trees Chapter 18.1-18.3
butest
 
SVM - Functional Verification
SVM - Functional VerificationSVM - Functional Verification
SVM - Functional Verification
Sai Kiran Kadam
 
UNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data MiningUNIT 3: Data Warehousing and Data Mining
UNIT 3: Data Warehousing and Data Mining
Nandakumar P
 
2.7 other classifiers
2.7 other classifiers2.7 other classifiers
2.7 other classifiers
Krish_ver2
 
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
594503964-Introduction-to-Classification-PPT-Slides-1.ppt594503964-Introduction-to-Classification-PPT-Slides-1.ppt
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
snehajuly2004
 
ML SFCSE.pptx
ML SFCSE.pptxML SFCSE.pptx
ML SFCSE.pptx
NIKHILGR3
 
Module 1 Taxonomy of Machine L(1).pptx
Module 1 Taxonomy of Machine   L(1).pptxModule 1 Taxonomy of Machine   L(1).pptx
Module 1 Taxonomy of Machine L(1).pptx
angelinjeba6
 
Week_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptxWeek_1 Machine Learning introduction.pptx
Week_1 Machine Learning introduction.pptx
muhammadsamroz
 
Different learning Techniques in Artificial Intelligence
Different learning Techniques in Artificial IntelligenceDifferent learning Techniques in Artificial Intelligence
Different learning Techniques in Artificial Intelligence
vipsitaswati
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
SanjanaSaxena17
 
Ad

Recently uploaded (20)

How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales moduleHow To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
Celine George
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
Cultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptxCultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptx
UmeshTimilsina1
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
All About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdfAll About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdf
TechSoup
 
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
Celine George
 
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptxANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
Mayuri Chavan
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
puzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tensepuzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tense
OlgaLeonorTorresSnch
 
What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living WorkshopLDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDM & Mia eStudios
 
How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18
Celine George
 
How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18
Celine George
 
Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...
parmarjuli1412
 
Form View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo SlidesForm View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo Slides
Celine George
 
Drugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdfDrugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdf
crewot855
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
Overview Well-Being and Creative Careers
Overview Well-Being and Creative CareersOverview Well-Being and Creative Careers
Overview Well-Being and Creative Careers
University of Amsterdam
 
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales moduleHow To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
Celine George
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
Cultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptxCultivation Practice of Turmeric in Nepal.pptx
Cultivation Practice of Turmeric in Nepal.pptx
UmeshTimilsina1
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
All About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdfAll About the 990 Unlocking Its Mysteries and Its Power.pdf
All About the 990 Unlocking Its Mysteries and Its Power.pdf
TechSoup
 
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
Celine George
 
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptxANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
Mayuri Chavan
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
Origin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theoriesOrigin of Brahmi script: A breaking down of various theories
Origin of Brahmi script: A breaking down of various theories
PrachiSontakke5
 
puzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tensepuzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tense
OlgaLeonorTorresSnch
 
What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)What is the Philosophy of Statistics? (and how I was drawn to it)
What is the Philosophy of Statistics? (and how I was drawn to it)
jemille6
 
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living WorkshopLDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDM & Mia eStudios
 
How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18How to Configure Public Holidays & Mandatory Days in Odoo 18
How to Configure Public Holidays & Mandatory Days in Odoo 18
Celine George
 
How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18How to Share Accounts Between Companies in Odoo 18
How to Share Accounts Between Companies in Odoo 18
Celine George
 
Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...
parmarjuli1412
 
Form View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo SlidesForm View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo Slides
Celine George
 
Drugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdfDrugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdf
crewot855
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
Overview Well-Being and Creative Careers
Overview Well-Being and Creative CareersOverview Well-Being and Creative Careers
Overview Well-Being and Creative Careers
University of Amsterdam
 
Ad

Data mining knowledge representation Notes

  • 1. Data mining knowledge representation 1 What Defines a Data Mining Task? • Task relevant data: where and how to retrieve the data to be used for mining • Background knowledge: Concept hierarchies • Interestingness measures: informal and formal selection techniques to be applied to the output knowledge • Representing input data and output knowledge: the structures used to represent the input of the output of the data mining techniques • Visualization techniques: needed to best view and document the results of the whole process 2 Task relevant data • Database or data warehouse name: where to find the data • Database tables or data warehouse cubes • Condition for data selection, relevant attributes or dimensions and data grouping criteria: all this is used in the SQL query to retrieve the data 1
  • 2. 3 Background knowledge: Concept hierarchies The concept hierarchies are induced by a partial order1 over the values of a given attribute. Depending on the type of the ordering relation we distinguish several types of concept hierarchies. 3.1 Schema hierarchy • Relating concept generality. The ordering reflects the generality of the attribute values, e.g. street < city < state < country. 3.2 Set-grouping hierarchy • The ordering relation is the subset relation (⊆). Applies to set values. • Example: {13, ..., 39} = young; {13, ..., 19} = teenage; {13, ..., 19} ⊆ {13, ..., 39} ⇒ teenage < young. • Theory: – power set: the set of all subsets of a set, X. – lattice (2X , ⊆), sup(X, Y ) = X ∩ Y , inf(X, Y ) = X ∪ Y . X ∩ Y X Y X ∪ Y @ @ @ @ @ @ – top element > = {} (empty set), bottom element ⊥ = X. 1Consider a set A and an ordering relation R. R is a full order if for any x, y ∈ A, xRy exists. R is a partial order if for any x ∈ A, there exists y ∈ A, such that either xRy or yRx exists. 2
  • 3. 3.3 Operation-derived hierarchy Produced by applying an operation (encoding, decoding, information extraction). For example: markovz@cs.ccsu.edu instantiates the hierarcy user−name < department < university < usa−univeristy. 3.4 Rule-based hierarchy Using rules to define the partial order, for example: if antecedent then consequent defines the order antecedent < consequent. 4 Interestingness measures Criteria to evaluate hypotheses (knowledge extracted from data when applying data mining techniques). This issue will be discussed in more detail in Lecture notes - Chapter 9: ”Evaluating what’s been learned”. 4.1 Bayesian evaluation • E - data • H = {H1, H2, ..., Hn} - hypotheses • Hbest = argmaxi P(Hi|E) • Bayes theorem: P(Hi|E) = P(Hi)P(E|Hi) Pn i=1 P(Hi)P(E|Hi) 3
  • 4. 4.2 Simplicity Occam’s Razor Consider for example, association rule length, decision tree size, num- ber and length of classification rules. The intuition suggests that the best hypotesis is the simplest (shortest) one. This is the so called Oc- cam’s Razor Principle also expressed as a mathematical theorem (Oc- cam’s Razor Theorem). Here is an example of applying this principle to grammars: • Data: E = {0, 000, 00000, 0000000, 000000000} • Hypotheses: G1 : S → 0|000|00000|0000000|000000000 G2 : S → 00S|0 • Best hypothesis: G2 (fewer and simpler rules) However, as simplicity is a subjective measure we need formal criteria to define it. Formal criteria for simplicity • Bayesian approach: need of large volume of experimental results (statistics) to define prior probabilities. • Algorithmic (Kolmogorov) complexity of an object (bit string): the length of the shortest program of Universal Turing Machine, that generates the string. Problems: computational complexity. • Information-based approches: Minimum Description Length Prin- ciple (MDL). Most often used in practice. 4
  • 5. 4.3 Minimum Description Length Principle (MDL) • Bayes Theorem: P(Hi|E) = P(Hi)P(E|Hi) Pn i=1 P(Hi)P(E|Hi) • Take a − log of both sides of Bayes (C is a constant): − log2 P(Hi|E) = − log2 P(Hi) − log2 P(E|Hi) + C • I(A) – information in message A, L(A) – min length of A in bits: log2 P(A) = I(A) = L(A) • Then: L(Hi|E) = L(Hi) + L(E|Hi) + C • MDL: The hypothesis must reduce the information needed to en- code the data, i.e. L(E) > L(Hi) + L(E|Hi) • The best hypothesis must maximize information compression: Hbest = argmaxi (L(E) − L(Hi) − L(E|Hi)) 4.4 Certainty • Confidence of association ”if A then B”: P(B|A) = # of tuples containing both A and B # of tupples containing A 5
  • 6. • Classification accuracy: Use a training set to generate the hypoth- esis, then test it on a separate test set. Accuracy = # of correct classifications # of tuples in the test set • Utility (support) of association ”if A then B”: P(A, B) = # of tupples containing both A and B total # of tupples 5 Representing input data and output knowledge 5.1 Concepts (classes, categories, hypotheses): things to be mined/learned • Classification mining/learning: predicting a discrete class, a kind of supervised learning, success is measured on new data for which class labels are known (test data). • Association mining/learning: detecting associations between at- tributes, can be used to predict any attribute value and more than one attribute values, hence more rules can be generated, therefore we need constraints (minimum support and minimum confidence). • Clustering: grouping similar instances into clusters, a kind of unsu- pervised learning, success is measured subjectively or by objective functions. • Numeric prediction: predicting a numeric quantity, a kind of su- pervised learning, success is measured on test data. • Concept description: output of the learning scheme 6
  • 7. 5.2 Instances (examples, tuples, transactions) • Things to be classified, associated, or clustered. • Individual, independent examples of the concept to be learned (tar- get concept). • Described by predetermined set of attributes. • Input to the learning scheme: set of instances (dataset), represented as a single relation (table). • Independence assumption: no relationships between attributes. • Positive and negative examples for a concept, Closed World As- sumption (CWA): {negative} = {all}{positive}. • Relational (First Order Logic) descriptions: – Using variables (more compact representation). For example: < a, b, b >, < a, c, c >, < b, a, a > can be represented as one relational tuple < X, Y, Y >. – Multiple relation concepts (FOIL, Inductive Logic Program- ming, see Lecture Notes - Chapter 11). Example: grandfather(X, Z) ← father(X, Y )∧(father(Y, Z)∨mother(Y, Z)) 5.3 Attributes (features) • Predefined set of features to describe an instance. • Nominal (categorical, enumerated, discrete) attributes: – Values are distinct symbols. – No relation among nominal values. 7
  • 8. – Only equality test can be performed. – Special case: boolean attributes, transforming nominal to boolean. • Structured: – Partial order among nominal values – Example: concept hierarchy • Numeric: – Continuous: full order (e.g. integer or real numbers). – Interval: partial order. 5.4 Output knowledge representation • Association rules • Decision trees • Classification rules • Rules with relations • Prediction schemes: – Nearest neighbor – Bayesian classification – Neural networks – Regression • Clusters: – Type of grouping: partitions/hierarchical – Grouping or describing: agglomerative/conceptual – Type of descriptions: statistical/structural 8
  • 9. 6 Visualization techniques: Why visualize data? • Identifying problems: – Histograms for nominal attributes: is the distribution consistent with background knowledge? – Graphs for numeric values: detecting outliers. • Visualization show dependencies • Consulting domain experts • If data are too much, take a sample 9
  翻译: