Data mining techniques unit 2

DATA MINING
TECHNIQUES
UNIT-II

Data objects and Attributes Types
• What is data object?
• What is an attribute?
-observations
-attribute vector or feature vector
-univariate, bivariate
• Types of attributes
-Nominal
-Binary
-Ordinal
-Numeric
-Discrete Vs Continuous Attributes

Nominal Attributes
• Relating to name
• Symbols or name of things
• Categorical or nominal
• Enumerations
• Example

Binary Attribute
• What is binary attribute ?
• What is Boolean attribute?
• Symmetric
• Asymmetric
• Example

Ordinal Attribute
• What is ordinal attribute?
• Example

Numeric Attributes
• What is numeric attributes?
• Interval-scaled Attributes
• Ratio-scaled Attributes
• Example

Discrete Vs Continuous Attributes
• What is discrete and continuous attributes?
• Example

Basic statistical description of data
-Purpose of basic statistical description of data:
-Three areas of basic statistical descriptions
1.Measuring the central Tendency: Mean, Median and Mode
2.Measuring the dispersion of data: Range, Quartiles, Variance, Standard Deviation and Interquartile
range
-Five-Number Summary, Boxplots and Outliers
-Variance and Standard Deviation
3.Graphic Displays of Basic Statistical Descriptions of Data
-Quantile plot
-Quantile-Quantile Plot
-Histogram
-Scatter plots and Data correlation

Measuring the central Tendency: Mean, Median and
Mode
-Measure the location of the middle or center of a data distribution.
-Measure of the central tendency include the mean, median, mode and
midrange
-The most common and effective numeric measure is ARITHMETIC
MEAN-helps to find
the center of a set of data.
-The Mean of the set of values :

Weighted Arithmetic Mean or Weighted Average

Median
-Trimmed Mean
-For asymmetric data or skewed data: a better
measure of the center of data can be
identified using Median
-The Median is expensive to compute: for
large no. of observations
-For numeric attributes can easily
approximate the value
-Median Interval

Mode
-Another measure of central tendency
-The value that occurs most
frequently in the set
-It can be used in qualitative and
quantitative attributes
-unimodal, bimodal and trimodal
-multimodal

Mid range
-It is the average of the
largest and smallest
values in the set
-This measure is easy to
compute using the SQL
aggregate
functions, Max(),Min()

Measuring the dispersion of data:range,quartiles,variance,standard
deviation and Interquartile range
- Range ,quantiles, quartiles, percentiles and the interquartile range
:measures of data dispersion
- Range: The set of difference between the largest (max()) and smallest
(min()) values
-Quantiles: points taken at the regular intervals of a data
distribution, dividing it into essentially equal size consecutive sets
-Quartiles: Each part of one-fourth of the data distribution
-Percentiles: The 100 quantiles are commonly referred as percentiles
-Interquartile range: The distance between the first and third quartiles
IQR=Q3-Q1

Five –Number
Summary,BoxPlots and
Outliers
-The Five-Number
summary:
Minimum,Q1,Median,Q3,
Maximum
-Box Plot: a popular way
of visualizing a
distribution
-It incorporates the five –
number summary

Variance and
Standard Deviation
-A low standard deviation
means that the data
observations tend to be
very close to the mean
-A high standard
deviation indicates that
the data are spread out
over a large range of
values

Graphic displays of
basic statistical
descriptions of data
-Quantile plot: univariate data distribution
-It displays all of the data for the given
attribute
-Quantile-Quantile plot or q-q plot : It graphs
the quantiles of one univariate distribution
against the corresponding quantiles of
another.

Histograms, Scatter
Plots and Data
Correlation
-Histogram: It is a chart of poles
-The resulting graph is more commonly
known as bar chart
-Scatter plot: The most effective
graphical methods for determining the
relationship, pattern or trend between
two numeric attributes
-It allows to represents the bivariate data
-Correlation: if one attribute implies the
other
-Positive correlation, Negative
Correlation, Null Correlation

Data Visualization
• Aims to communicate data clearly and effectively through graphical
representation
• Pixel oriented techniques
• Geometric projection visualization techniques
• Icon based visualization techniques
• Hierarchical visualization techniques
• Visualizing complex data and relations

Pixel oriented techniques
• To visualize the value of a dimension
• The colour of the pixel reflects-the dimension’s value
• The data records can also be ordered in a query-dependent
way
• A pixel is one above it in the window
• To solve this problem-can layout data records in space filling
curve to fill the window
• Circle segment: comparisons of dimension

Geometric projection visualization techniques
• Drawback of pixel oriented techniques: cannot help much in
understanding the distribution of data in multidimensional space.
• Geometric projection: helps user in finding interesting projections of
multidimensional data sets
• Scatter plot:2D data points ,three dimension added using different
colour or shapes to represent different data points
• Scatter plot matrix technique: for datasets with more than four
dimensions
• Parallel coordinates: can handle higher dimensionality

Icon - based visualization techniques
• Small icons used to represent multidimensional data values
• Two popular icon-based techniques: Chernoff faces and stick figures
• Chernoff faces: it displays multidimensional data up to 18 variables or
dimensions as cartoon faces
• Viewing for large tables of data is tedious
• Asymmetric Chernoff: double the no. of facial characteristics,
allowing up to 36 dimensions to be displayed
• Stick figure: maps multidimensional data to five-piece stick figures

Hierarchical visualization techniques
• It partition all dimensions into subsets(i.e., sub spaces).The subspaces
are visualized in a hierarchical manner
• Tree maps: display hierarchical data as a set of nested rectangles.

Visualizing complex data and relations
• For non-numeric data such as text and social networks –visualization
techniques are available.
• Tag cloud: visualization of statistics of user generated tags.
• Tag cloud can be done for single and multiple items
• Disease influence graph provides the visualization of correlations
between diseases

Measuring data similarity and
dissimilarity
• Measures of proximity
• Two data structures: the data matrix and dissimilarity matrix
• Data matrix Vs Dissimilarity matrix
• Proximity measures for nominal attributes
• Proximity measures for binary attributes
• Dissimilarity of numeric data: Minkowski distance
• Proximity measures for ordinal attributes
• Dissimilarity for attributes of mixed types
• Cosine similarity

Data matrix Vs Dissimilarity matrix
• Data Matrix(or object-by-attribute structure)
• Dissimilarity matrix ( or object-by- object structure)
• Sim(i , j)=1-d(i , j)
• Two - mode matrix
• One - mode matrix

Proximity measures for nominal attributes
• The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches:
d(i,j)=p-m/p
M-number of matches
P-total number of attributes describing the objects
i,j-the number of attributes for which i and j are in the same state.

Proximity measures for binary attributes
• Symmetric binary attributes
• Asymmetric binary dissimilarity
• Asymmetric binary similarity

Dissimilarity of numeric data: Minkowski
distance

Proximity measures for ordinal attributes

Dissimilarity for attributes of mixed types

Data Pre processing
• Need for data pre processing
• Data Inaccurate
• Incomplete data
• Inconsistent data
• Timeliness
• Believability
• Interpretability

Major Tasks in Data Pre processing
• Data Cleaning
• Data Integration
• Data reduction
• Data Transformation

Data Cleaning
1.Missing Values,2.Noisy Data,3.Data Cleaning
as a Process
MISSING VALUES
-Ignore the tuple
-Filling in the missing value manually
-Use a “global constant” to fill in the missing value
-Use a measure of central tendency for the attribute(e.g., mean or
median to fill in the missing value
-Use the attribute mean or median for all samples belonging to the
same class as the given tuple
-Use the most probable value to fill in the missing value

2.Noisy data-
(i)Binning,(ii)Regression,(iii)Outlier analysis
NOISY DATA
• Binning
Equal-frequency bins
Smoothing by bin means
Smoothing by bin boundaries
Ex. sorted data for price(in dollars)
4,8,15,21,21,24,25,28,34

2.Noisy data-(ii)Regression,(iii)Outlier
Analysis
REGRESSION
• Linear Regression
• Multiple Regression
OUTLIER ANALYSIS

3.Data Cleaning as a Process(i)Discrepancy
detection (ii)Data Transformation
DISCREPANCY DETECTION
• Caused by several factors
• How can we proceed with discrepancy detection?
• Field overloading
• Rules to examine data
-Unique rule
-Consecutive rule
-Null rule
Tools to aid in the discrepancy detection step:
(i) Data scrubbing tools
(ii) Data auditing tools

3.Data Cleaning as a Process (ii)Data
Transformation
DATA TRANSFORMATION
• Commercial tools for data transformation:
(i)Data migration tools
(ii)ETL(Extraction/Transformation/Loading)tools
• What is potter’s wheel?
• What are declarative languages for data cleaning?

Data Integration-1.Entity Identification
Problem ,2.Redundancy and correlation
analysis,3.Tuple Duplication
• Merging of data from multiple data sources
• Helps to reduce and avoid inconsistencies
• Helps to improve accuracy and speed of the datamining process

Data Integration-1.Entity Identification
Problem
• Customer_id-DB1
Customer_number-DB2
• How can the data analyst interpret? As it refers to same attribute
• Meta data –includes the name,meaning,data type,range of values
• Meta data-helps to avoid errors in schema integration
• Pay type:H/S-DB1
:1/2-DB2

Data Integration- 2.Redundancy and
correlation analysis
• REDUNDANT(DOB & AGE)
An attribute may be redundant, if it can be derived from another
attribute or set of attributes
• CORRELATION ANALYSIS
Some redundancies can be detected by correlation analysis
Given two attributes, correlation analysis can measure how strongly
one attribute implies the other, based on the available data.
Types of analysis:(i) Nominal data- X2 (chi )square test
(ii) Numeric data-correlation coefficient, covariance

Data Integration- 2.Redundancy and
correlation analysis
• Nominal attribute-eg.,Hair color-black,brown,red etc.,
eg.,Marital status-single,married,divorced,widowed
• Numeric attribute-Integer or real values

Correlation Coefficient for Numeric Data
• Correlation between two attributes A&B are evaluated by computing the
correlation coefficient (known as Pearson's product moment coefficient)

Covariance of numeric data
• Correlation and covariance are two simple measures for assessing how
much two attributes change together.

Tuple duplication
• Duplication should also be detected at tuple level
• Denormalized tables is a source of data redundancy
• Data value conflicts detection and resolution
• Eg.,Weight,hotel,schools,attributes

Data Reduction
• Why data reduction?
• Data reduction strategies:
1)Dimensionality reduction
Wavelet Transforms
Principal Component Analysis(PCA)
Attribute Subset Selection
2)Numerosity Reduction
Parametric-Regression and Log linear models
Non-Parametric-Histograms, Clustering, Sampling, Data cube aggregation
3)Data Compression

Dimensionality Reduction
• What is dimensionality reduction?
• What is Curse of dimensionality?
• What is Wavelet Transform?
1.Discrete Wavelet Transform(DWT)
• DWT in data reduction
2.Discrete Fourier Transform(DFT)
DWT is closely related to DFT, a signal processing technique involving
sines and cosines

Difference between DWT and DFT
DWT DFT
1.More Accurate Not accurate
2.Requires less space Requires more space
3.Several families of DWT (Haar-2,Daubechies-
4,Daubechies-6 etc.,)
Only one DFT

Hierarchical pyramid Algorithm
• Procedure for applying a discrete wavelet transform that halves the data at each iteration,
resulting in fast computational speed.
Method:
1.Length L of input data vector must be an integral power of 2.
2.Each transform involves applying two functions.
(i)Applies data smoothing
(ii)Performs weighted difference
3.The two functions are applied to pairs of data points in X
4.The two functions are recursively applied to the data sets obtained in the
previous loop, until the resulting data sets obtained are of length2.
5.Selected values from the data sets obtained in the previous iterations are the
wavelet coefficients of the transformed data.

Matrix
• A matrix multiplication can be applied to the input data in order to
obtain the wavelet coefficients.
• Matrix must be orthogonal.
• Multidimensional Data.

Principal Components Analysis(also called
as Karhunen-Loeve or K-L method)
• What is Principal Component Analysis?
• Procedure for PCA
1.Normalize input data
2.Compute “K” orthonormal vectors
3.Principal components are sorted in order of decreasing “significance”
or “strength”
4.The data size can be reduced by eliminating the weaker components.

Attribute Subset Selection
• Another way to reduce dimensionality of data.
• Irrelevant attributes
• GOAL:
1.To find a minimum set of attributes
2.Reduces the no. of attributes appearing in the discovered pattern
• How can we find a ‘good’ subset of the original attributes?

Greedy Methods
• What is Heuristic search?
• Basic heuristic methods:
1.Stepwise forward selection
Initial attribute set:{a1,a2,a3,a4,a5,a6}
Initial reduced set:{}
=>{a1}
=>{a1,a4}
=>{a1,a4,a6} Reduced attribute set

Basic Heuristic Method
2.Stepwise backward elimination:
Initial attribute set:{a1,a2,a3,a4,a5,a6}
=>{a1,a3,a4,a5,a6}
=>{a1,a3,a5,a6}
=>{a1,a3,a6}
3.Combination of forward selection and backward elimination
4.Decision tree induction
• Discrete transform in attribute subset selection

Numerosity Reduction-(i)parametric data
reduction
• Regression &log-linear models
• Regression-can be used to approximate the given data.
1.Linear (simple)Regression:
Data are modelled to fit a straight line.
Y=wx + b, x & y- numeric database attributes, w & b-regression co efficents. These
coefficients are solved by method of least square.
2.Multiple Linear Regression:
Allows a response variable ’y’ to be modelled as a linear function of two or more
predictor variables.
Y=b0+b1x1+b2x2
• Log-Linear Models: Approximate discrete multidimensional probability
distributions.

Numerosity Reduction-(i)Non-parametric
data reduction
• Histogram-use binning to approximate data distributions and ara
popular form of data reduction.
(i)Equal –width: the width of each bucket range is uniform
(ii)Equal-frequency(or equal-depth):the frequency of each bucket is
constant

Histogram-Equal-frequency, Equal-width

Numerosity Reduction-(i)Non-parametric
data reduction
• Clustering-it considers data tuples as objects. Grouping of similar objects
- Centroid distance is an alternative measure of cluster quality and is defined as the
average distance of each cluster object from the cluster centroid
• Sampling-allows a large data set to be represented by a much smaller random data
sample(or sub set)
- Simple random sample without replacement of size(SRSWOR)-all tuples are
equally likely to be sampled
- Simple random sample with replacement of size(SRSWR)-similar to SRSWOR,
after a tuple is drawn, it is placed back in D so that it may be drawn again.
- Cluster Sample:grouped into M mutually disjoint “clusters”
- Stratified sample:If D is divided into mutually disjoint parts called strata. this helps in
clustering for skewed data.

Data Cube
Aggregation
• The cube created at the
lowest abstraction level
is referred to as the base
cuboid.
• A cube at the highest
level of abstraction is
the apex cuboid.

Data Compression
• The transformation are applied so as to obtain a reduced or
“compressed” representation of the original data.
• If the original data can be reconstructed from the compressed data
without any information loss, the data reduction is called lossless.
• If we construct only an approximation of the original data then the
data is called lossy.
• The dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression

Data transformation and Data
Discretization
• The data are transformed or consolidated so that the resulting mining
process may be more efficient and the patterns found may be easier to
understand.
• Data transformation strategies overview:
Smoothing
Attribute Construction
Aggregation
Normalization
Discretization
Concept hierarchy generation for nominal data

Data Transformation by Normalization
• Min-Max Normalization
• Z-score Normalization
• Decimal Scaling

Discretization
• Discretization by binning
• Discretization by histogram analysis
• Discretization by cluster, decision tree and correlation analyses

Concept hierarchy generation for nominal data
• Specification of a partial ordering of attributes explicity at the schema
level by users or experts
• Specification of a portion of a hierarchy by explicit data grouping
• Specification of a set of attributes but not of their partial ordering
• Specification of only a partial set of attributes

Data mining techniques unit 2

Recommended

More Related Content

What's hot (20)

Similar to Data mining techniques unit 2 (20)

More from malathieswaran29 (12)

Recently uploaded (20)

Data mining techniques unit 2