Data mining an introduction

1
Mrs. Dipali Meher
Modern College of Arts, Science and Commerce,
Ganeshkhind, Pune 411016
Data Mining : An Introduction

2
Bayes Thm(1763)
Regression(1805)
KDD(1989)
Support Vector Machine(1992)
Data Science(2001)
Moneyball(2003)
Turing(1963)
Neural Networks(1943)
Evolutionary Computation(1965)
Databases(1970)
Genetic Algorithms(1975)
Big Data
From Then till Now…..

3
DBMS
RDBMS
Distributed DBMS
Data Mining

4
Data Mining deals with the discovery of
hidden Knowledge , unexpected pattern
and new rules from large data sets

5
Examples of Information extracted using query
language
 List customers who use credit card to purchase
more than Rs. 10000 worth groceries
 List patients who had at least one heart attack
 List students who had at least one backlog
 List employees who have taken home loans

6
Examples of what data mining is used for
 Develop a general profile of credit card customers
 Determine patients whose lifestyle is prone to getting a
heart attack in near future
 Differentiate poor credit risk customers from good
credit card customers
 Differentiate students who had one backlogs in their
academic
 Determine employees who have taken loan for any
purpose

Data Mining differs from usual query processing in
many ways
Query Processing Data Mining
Query Wel formed as
Select…
From…
Where……
Query is not well formed.
What is found out that is
usually hidden
Data Data from online
transaction processing
systems generally in table
formats
Data is integrated from
various sources. Huge
amount of data
Output Subset of databases Not only subset but also
in analyzed and in terms
of patterns
7

8
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

•Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
•Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
9

Knowledge discovery in databases (KDD)-is a multistep
process of finding useful information and patterns in
data while Data Mining is one of the steps in KDD of
using algorithms for extraction of patterns
Steps Of KDD
1. Selection-
Data Extraction -Obtaining Data from heterogeneous data sources -
Databases, Data warehouses, World wide web or other information
repositories
2. Preprocessing-
Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned-
Missing data may be ignored or predicted, erroneous data may be deleted
or corrected
10

3. Transformation-
Data Integration- Combines data from multiple sources
into a coherent store -Data can be encoded in common
formats, normalized, reduced
4. Data mining –
Apply algorithms to transformed data an extract patterns
5. Pattern Interpretation/evaluation -
Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out
discovered patterns
Knowledge presentation- present the mined knowledge-
visualization techniques can be used
11

Transformation
KDD is the nontrivial extraction of
implicit previously unknown and
potentially useful knowledge from
data
Knowledge Discovery Process
Preprocessing
Data Mining
Pattern Interpretation and
evaluation
Selection
12

13
27
40
34
54
24 25
29
0
10
20
30
40
50
60
a b c d e f g
Graph 1
a
b
c
d
e
f
g
Graphical-bar charts, pie
charts histograms
Icon-based- using colors figures as
icons

14
Hierarchical- Hierarchically dividing
display area
Geometric-boxplot, scatter plot

15
Pixel-based- data as colored pixels
Hybrid- combination of above
approaches

Why Data Mining?—Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM), market basket
analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
16

Data Mining algorithms-All algorithms attempt to fit a model
closest to the data being examined.
Model is based on the analysis of attributes of a training data
set
The Model is than evaluated using a test data set
Data Model can be
 Predictive model makes predictions regarding data values
using the results found from available data. Thus it makes use
of historical data to make predictions
 Descriptive model identifies patterns or relationships in data. It
finds out the properties of existing data and does not predict
the new properties.
17

Data Mining
Predictive Descriptive
Classification
Regression
Time series Analysis
Prediction
Clustering
Summarization
Association rules
Sequence Discovery
18

Classification- maps data into predefined groups or classes
It uses supervised learning .
The algorithm uses learning phase to build a classifier using training
data set containing data attributes and associated class labels
Example : result of a student. In which class students result will be…
Pattern recognition is type of classification where input patter is
classified into several classes based on its similarity to predefined
classes.
Example: to identify terrorists from passengers. They are identified with
their basic pattern as distance between eyes, size and shape. Then
these patterns are compared with entries into data to see whether
any match were found.
19

21
Grade Useful Heat Value(kcal/kg)
A >6200
B 5601 - 6200
C 4941 - 5600
D 4201 - 4940
E 3361 - 4200
F 2401 - 3360
G 1301 - 2400

22
Regression-maps data into real-valued prediction variable.
Algorithm tries to find best function (linear, Non-linear that fits the
training data). Assumes that target data always fits into some
function.
Example . College professor determines his retirement plan based on
current savings and income. If professor want to do more savings
then he must alter his experiences by using simple linear regression
formula.

23
Time Series Analysis- the value of an attribute is examined as it varies over
time
It can be used to determine similarities, classify the behavior or predict future
values
Example
Share market

Prediction – predicts future values using regression, time series analysis or other
approaches
Example
To find out flood prediction of river depending on water level, rain amount time,
humidity. Sensors at different locations are placed in the river area which will
monitor flood condition and flood prediction can be done.
Whether analysis
Pollution analysis
24

25
Clustering -Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes
Interpretability and usability-results should be comprehensible
and usable-domain expert is required
Example
Students are clustered among various attributes like good
academics, area in which they live, age, height, weight, body
mass index, extra curricular activities.
Clusters do not have specific size and shape.

27
Summarization - maps data into subsets with simple descriptions- It extracts or
derives representative summary type of information
Example
Summary of student result whish give you number of students appeared for the
exam passed, failed and according to classes

Association rules–discovers relationship among data – used in
Market basket analysis to find item frequently purchased together
Example: person buying a sugar in the mall also buys milk. The thing
which person buy together will always kept together.
28

Sequence Discovery- discovers sequential patterns in
data-order in which items are purchased or data is
accessed
Example:
When TV set will be purchased by customer , sales
manager assumes that customer also buys some cds and
music system.
29

Influence from many disciplines
Data Mining
Artificial
IntelligenceInformation
Technology Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Algorithm
Visualization
Mathematical
Modeling
30

Depending on data mining approach, techniques from
other disciplines may be applied such as
•Information Retrieval
•Artificial Intelligence
•Neural networks
•Fuzzy set theory
•Knowledge representation
•Logic programming
•High performance computing
31

Data Mining issues
 Human interaction- to analyze the output and find the
correct inference after data mining step interfaces required
with both domain and technical experts
 Over fitting – It occurs when the model fits for the current
data exactly but does not fit for future data-if training
dataset will be wrong then over fitting occurs
 Outliers – The model may get distorted because of the
presence of outliers
 Interpretation of results- experts are required due to
interpretability problems
 Visualization of results- visualization helps to display
analyzed data – but for multi-dimensional data visualization
becomes problematic
32

Data Mining issues continued…
 Large datasets- scalability may arise – as algorithms do not
scale well with massive real-world datasets- sampling and
parallelization are effective tools are used to solve this problem
 High dimensionality -Conventional database may contain
many different attributes out of them all are not relevant. Some
may increases complexity and reduces efficiency. This is known
as dimensionality curse -data reduction can be done so that
dimensionality reduction will also be there.
 Multimedia data - found in GIS databases proves
conventional data mining algorithms ineffective
 Missing data -It is not always possible to ignore missing data
but in preprocessing data mining algorithms can be used to
replace missing data with estimates
33

Data Mining issues continued…
 Irrelevant data – data reduced by removing irrelevant data
 Noisy data –Invalid , incorrect data will lead to poor quality
data mining
 Changing data- Data warehouses contain non-volatile data-
Dynamic data is uploaded and then algorithms are reapplied to
check their correct working.
 Integration- KDD requests are one time needs-data mining
functions are now integrated into traditional database systems
 Applications – Effective use of output of mining algorithm is
a challenge rather than the complexity of the mining algorithm
34

Data Mining Metrics
How to measure the effectiveness of data mining process?
-KDD process is expensive- Return on investment will be the
saving due to decision process using the results
-Difficult to measure and quantify
Social Implications of Data mining
It is two sides of the coin
Data mining can be used to improve customer service and
satisfaction
Data mining can be used to confront one’s right to privacy
Omnipresent Invisible Data mining affecting everyone
35

Data mining should follow certain Guidelines
Purpose specification and use limitation
Openness
Security safeguards
Individual participation
Privacy Preserving data mining
- secure Multiparty computation
- data obscuration
36

Applications of Data Mining
Security-To find out terrorists using classification
technique
Whether- To predict whether, pollution
Finance-Share market
Ecommerce-Market basket analysis
Education-Student result preparation
Bank- Analysis of customer for buying loan
Research- Data Analysis
Fraud detection
Marketing-targeting customers
Molecular biology
Astronomy
Health- to find out disease in peoples
37

Books for Reference
Data Mining, Introduction and Advanced Topics by
Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques by Jiawei Han
and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5
.
38

Data mining an introduction

Recommended

More Related Content

What's hot (20)

Similar to Data mining an introduction (20)

More from Dr-Dipali Meher (16)

Recently uploaded (20)

Data mining an introduction