SlideShare a Scribd company logo
DATA MINING
TECHNIQUES
UNIT-II
Data objects and Attributes Types
• What is data object?
• What is an attribute?
-observations
-attribute vector or feature vector
-univariate, bivariate
• Types of attributes
-Nominal
-Binary
-Ordinal
-Numeric
-Discrete Vs Continuous Attributes
Nominal Attributes
• Relating to name
• Symbols or name of things
• Categorical or nominal
• Enumerations
• Example
Binary Attribute
• What is binary attribute ?
• What is Boolean attribute?
• Symmetric
• Asymmetric
• Example
Ordinal Attribute
• What is ordinal attribute?
• Example
Numeric Attributes
• What is numeric attributes?
• Interval-scaled Attributes
• Ratio-scaled Attributes
• Example
Discrete Vs Continuous Attributes
• What is discrete and continuous attributes?
• Example
Basic statistical description of data
-Purpose of basic statistical description of data:
-Three areas of basic statistical descriptions
1.Measuring the central Tendency: Mean, Median and Mode
2.Measuring the dispersion of data: Range, Quartiles, Variance, Standard Deviation and Interquartile
range
-Five-Number Summary, Boxplots and Outliers
-Variance and Standard Deviation
3.Graphic Displays of Basic Statistical Descriptions of Data
-Quantile plot
-Quantile-Quantile Plot
-Histogram
-Scatter plots and Data correlation
Measuring the central Tendency: Mean, Median and
Mode
-Measure the location of the middle or center of a data distribution.
-Measure of the central tendency include the mean, median, mode and
midrange
-The most common and effective numeric measure is ARITHMETIC
MEAN-helps to find
the center of a set of data.
-The Mean of the set of values :
Example: Mean
Weighted Arithmetic Mean or Weighted Average
Median
-Trimmed Mean
-For asymmetric data or skewed data: a better
measure of the center of data can be
identified using Median
-The Median is expensive to compute: for
large no. of observations
-For numeric attributes can easily
approximate the value
-Median Interval
Mode
-Another measure of central tendency
-The value that occurs most
frequently in the set
-It can be used in qualitative and
quantitative attributes
-unimodal, bimodal and trimodal
-multimodal
Mid range
-It is the average of the
largest and smallest
values in the set
-This measure is easy to
compute using the SQL
aggregate
functions, Max(),Min()
Data mining techniques unit 2
Measuring the dispersion of data:range,quartiles,variance,standard
deviation and Interquartile range
- Range ,quantiles, quartiles, percentiles and the interquartile range
:measures of data dispersion
- Range: The set of difference between the largest (max()) and smallest
(min()) values
-Quantiles: points taken at the regular intervals of a data
distribution, dividing it into essentially equal size consecutive sets
-Quartiles: Each part of one-fourth of the data distribution
-Percentiles: The 100 quantiles are commonly referred as percentiles
-Interquartile range: The distance between the first and third quartiles
IQR=Q3-Q1
Data mining techniques unit 2
Five –Number
Summary,BoxPlots and
Outliers
-The Five-Number
summary:
Minimum,Q1,Median,Q3,
Maximum
-Box Plot: a popular way
of visualizing a
distribution
-It incorporates the five –
number summary
Variance and
Standard Deviation
-A low standard deviation
means that the data
observations tend to be
very close to the mean
-A high standard
deviation indicates that
the data are spread out
over a large range of
values
Graphic displays of
basic statistical
descriptions of data
-Quantile plot: univariate data distribution
-It displays all of the data for the given
attribute
-Quantile-Quantile plot or q-q plot : It graphs
the quantiles of one univariate distribution
against the corresponding quantiles of
another.
Histograms, Scatter
Plots and Data
Correlation
-Histogram: It is a chart of poles
-The resulting graph is more commonly
known as bar chart
-Scatter plot: The most effective
graphical methods for determining the
relationship, pattern or trend between
two numeric attributes
-It allows to represents the bivariate data
-Correlation: if one attribute implies the
other
-Positive correlation, Negative
Correlation, Null Correlation
Data Visualization
• Aims to communicate data clearly and effectively through graphical
representation
• Pixel oriented techniques
• Geometric projection visualization techniques
• Icon based visualization techniques
• Hierarchical visualization techniques
• Visualizing complex data and relations
Pixel oriented techniques
• To visualize the value of a dimension
• The colour of the pixel reflects-the dimension’s value
• The data records can also be ordered in a query-dependent
way
• A pixel is one above it in the window
• To solve this problem-can layout data records in space filling
curve to fill the window
• Circle segment: comparisons of dimension
Data mining techniques unit 2
Geometric projection visualization techniques
• Drawback of pixel oriented techniques: cannot help much in
understanding the distribution of data in multidimensional space.
• Geometric projection: helps user in finding interesting projections of
multidimensional data sets
• Scatter plot:2D data points ,three dimension added using different
colour or shapes to represent different data points
• Scatter plot matrix technique: for datasets with more than four
dimensions
• Parallel coordinates: can handle higher dimensionality
Data mining techniques unit 2
Icon - based visualization techniques
• Small icons used to represent multidimensional data values
• Two popular icon-based techniques: Chernoff faces and stick figures
• Chernoff faces: it displays multidimensional data up to 18 variables or
dimensions as cartoon faces
• Viewing for large tables of data is tedious
• Asymmetric Chernoff: double the no. of facial characteristics,
allowing up to 36 dimensions to be displayed
• Stick figure: maps multidimensional data to five-piece stick figures
Data mining techniques unit 2
Hierarchical visualization techniques
• It partition all dimensions into subsets(i.e., sub spaces).The subspaces
are visualized in a hierarchical manner
• Tree maps: display hierarchical data as a set of nested rectangles.
Visualizing complex data and relations
• For non-numeric data such as text and social networks –visualization
techniques are available.
• Tag cloud: visualization of statistics of user generated tags.
• Tag cloud can be done for single and multiple items
• Disease influence graph provides the visualization of correlations
between diseases
Data mining techniques unit 2
Measuring data similarity and
dissimilarity
• Measures of proximity
• Two data structures: the data matrix and dissimilarity matrix
• Data matrix Vs Dissimilarity matrix
• Proximity measures for nominal attributes
• Proximity measures for binary attributes
• Dissimilarity of numeric data: Minkowski distance
• Proximity measures for ordinal attributes
• Dissimilarity for attributes of mixed types
• Cosine similarity
Data matrix Vs Dissimilarity matrix
• Data Matrix(or object-by-attribute structure)
• Dissimilarity matrix ( or object-by- object structure)
• Sim(i , j)=1-d(i , j)
• Two - mode matrix
• One - mode matrix
Proximity measures for nominal attributes
• The dissimilarity between two objects i and j can be computed based
on the ratio of mismatches:
d(i,j)=p-m/p
M-number of matches
P-total number of attributes describing the objects
i,j-the number of attributes for which i and j are in the same state.
Data mining techniques unit 2
Proximity measures for binary attributes
• Symmetric binary attributes
• Asymmetric binary dissimilarity
• Asymmetric binary similarity
Data mining techniques unit 2
Dissimilarity of numeric data: Minkowski
distance
Data mining techniques unit 2
Data mining techniques unit 2
Proximity measures for ordinal attributes
Data mining techniques unit 2
Dissimilarity for attributes of mixed types
Data mining techniques unit 2
Cosine similarity
Data mining techniques unit 2
Data Pre processing
• Need for data pre processing
• Data Inaccurate
• Incomplete data
• Inconsistent data
• Timeliness
• Believability
• Interpretability
Major Tasks in Data Pre processing
• Data Cleaning
• Data Integration
• Data reduction
• Data Transformation
Data Cleaning
1.Missing Values,2.Noisy Data,3.Data Cleaning
as a Process
MISSING VALUES
-Ignore the tuple
-Filling in the missing value manually
-Use a “global constant” to fill in the missing value
-Use a measure of central tendency for the attribute(e.g., mean or
median to fill in the missing value
-Use the attribute mean or median for all samples belonging to the
same class as the given tuple
-Use the most probable value to fill in the missing value
2.Noisy data-
(i)Binning,(ii)Regression,(iii)Outlier analysis
NOISY DATA
• Binning
Equal-frequency bins
Smoothing by bin means
Smoothing by bin boundaries
Ex. sorted data for price(in dollars)
4,8,15,21,21,24,25,28,34
Data mining techniques unit 2
2.Noisy data-(ii)Regression,(iii)Outlier
Analysis
REGRESSION
• Linear Regression
• Multiple Regression
OUTLIER ANALYSIS
3.Data Cleaning as a Process(i)Discrepancy
detection (ii)Data Transformation
DISCREPANCY DETECTION
• Caused by several factors
• How can we proceed with discrepancy detection?
• Field overloading
• Rules to examine data
-Unique rule
-Consecutive rule
-Null rule
Tools to aid in the discrepancy detection step:
(i) Data scrubbing tools
(ii) Data auditing tools
3.Data Cleaning as a Process (ii)Data
Transformation
DATA TRANSFORMATION
• Commercial tools for data transformation:
(i)Data migration tools
(ii)ETL(Extraction/Transformation/Loading)tools
• What is potter’s wheel?
• What are declarative languages for data cleaning?
Data Integration-1.Entity Identification
Problem ,2.Redundancy and correlation
analysis,3.Tuple Duplication
• Merging of data from multiple data sources
• Helps to reduce and avoid inconsistencies
• Helps to improve accuracy and speed of the datamining process
Data Integration-1.Entity Identification
Problem
• Customer_id-DB1
Customer_number-DB2
• How can the data analyst interpret? As it refers to same attribute
• Meta data –includes the name,meaning,data type,range of values
• Meta data-helps to avoid errors in schema integration
• Pay type:H/S-DB1
:1/2-DB2
Data Integration- 2.Redundancy and
correlation analysis
• REDUNDANT(DOB & AGE)
An attribute may be redundant, if it can be derived from another
attribute or set of attributes
• CORRELATION ANALYSIS
Some redundancies can be detected by correlation analysis
Given two attributes, correlation analysis can measure how strongly
one attribute implies the other, based on the available data.
Types of analysis:(i) Nominal data- X2 (chi )square test
(ii) Numeric data-correlation coefficient, covariance
Data Integration- 2.Redundancy and
correlation analysis
• Nominal attribute-eg.,Hair color-black,brown,red etc.,
eg.,Marital status-single,married,divorced,widowed
• Numeric attribute-Integer or real values
Data mining techniques unit 2
Chi-square distribution table
Correlation Coefficient for Numeric Data
• Correlation between two attributes A&B are evaluated by computing the
correlation coefficient (known as Pearson's product moment coefficient)
Data mining techniques unit 2
Covariance of numeric data
• Correlation and covariance are two simple measures for assessing how
much two attributes change together.
Data mining techniques unit 2
Tuple duplication
• Duplication should also be detected at tuple level
• Denormalized tables is a source of data redundancy
• Data value conflicts detection and resolution
• Eg.,Weight,hotel,schools,attributes
Data Reduction
• Why data reduction?
• Data reduction strategies:
1)Dimensionality reduction
Wavelet Transforms
Principal Component Analysis(PCA)
Attribute Subset Selection
2)Numerosity Reduction
Parametric-Regression and Log linear models
Non-Parametric-Histograms, Clustering, Sampling, Data cube aggregation
3)Data Compression
Dimensionality Reduction
• What is dimensionality reduction?
• What is Curse of dimensionality?
• What is Wavelet Transform?
1.Discrete Wavelet Transform(DWT)
• DWT in data reduction
2.Discrete Fourier Transform(DFT)
DWT is closely related to DFT, a signal processing technique involving
sines and cosines
Difference between DWT and DFT
DWT DFT
1.More Accurate Not accurate
2.Requires less space Requires more space
3.Several families of DWT (Haar-2,Daubechies-
4,Daubechies-6 etc.,)
Only one DFT
Hierarchical pyramid Algorithm
• Procedure for applying a discrete wavelet transform that halves the data at each iteration,
resulting in fast computational speed.
Method:
1.Length L of input data vector must be an integral power of 2.
2.Each transform involves applying two functions.
(i)Applies data smoothing
(ii)Performs weighted difference
3.The two functions are applied to pairs of data points in X
4.The two functions are recursively applied to the data sets obtained in the
previous loop, until the resulting data sets obtained are of length2.
5.Selected values from the data sets obtained in the previous iterations are the
wavelet coefficients of the transformed data.
Matrix
• A matrix multiplication can be applied to the input data in order to
obtain the wavelet coefficients.
• Matrix must be orthogonal.
• Multidimensional Data.
Principal Components Analysis(also called
as Karhunen-Loeve or K-L method)
• What is Principal Component Analysis?
• Procedure for PCA
1.Normalize input data
2.Compute “K” orthonormal vectors
3.Principal components are sorted in order of decreasing “significance”
or “strength”
4.The data size can be reduced by eliminating the weaker components.
Attribute Subset Selection
• Another way to reduce dimensionality of data.
• Irrelevant attributes
• GOAL:
1.To find a minimum set of attributes
2.Reduces the no. of attributes appearing in the discovered pattern
• How can we find a ‘good’ subset of the original attributes?
Greedy Methods
• What is Heuristic search?
• Basic heuristic methods:
1.Stepwise forward selection
Initial attribute set:{a1,a2,a3,a4,a5,a6}
Initial reduced set:{}
=>{a1}
=>{a1,a4}
=>{a1,a4,a6} Reduced attribute set
Basic Heuristic Method
2.Stepwise backward elimination:
Initial attribute set:{a1,a2,a3,a4,a5,a6}
=>{a1,a3,a4,a5,a6}
=>{a1,a3,a5,a6}
=>{a1,a3,a6}
3.Combination of forward selection and backward elimination
4.Decision tree induction
• Discrete transform in attribute subset selection
Numerosity Reduction-(i)parametric data
reduction
• Regression &log-linear models
• Regression-can be used to approximate the given data.
1.Linear (simple)Regression:
Data are modelled to fit a straight line.
Y=wx + b, x & y- numeric database attributes, w & b-regression co efficents. These
coefficients are solved by method of least square.
2.Multiple Linear Regression:
Allows a response variable ’y’ to be modelled as a linear function of two or more
predictor variables.
Y=b0+b1x1+b2x2
• Log-Linear Models: Approximate discrete multidimensional probability
distributions.
Numerosity Reduction-(i)Non-parametric
data reduction
• Histogram-use binning to approximate data distributions and ara
popular form of data reduction.
(i)Equal –width: the width of each bucket range is uniform
(ii)Equal-frequency(or equal-depth):the frequency of each bucket is
constant
Histogram-Equal-frequency, Equal-width
Numerosity Reduction-(i)Non-parametric
data reduction
• Clustering-it considers data tuples as objects. Grouping of similar objects
- Centroid distance is an alternative measure of cluster quality and is defined as the
average distance of each cluster object from the cluster centroid
• Sampling-allows a large data set to be represented by a much smaller random data
sample(or sub set)
- Simple random sample without replacement of size(SRSWOR)-all tuples are
equally likely to be sampled
- Simple random sample with replacement of size(SRSWR)-similar to SRSWOR,
after a tuple is drawn, it is placed back in D so that it may be drawn again.
- Cluster Sample:grouped into M mutually disjoint “clusters”
- Stratified sample:If D is divided into mutually disjoint parts called strata. this helps in
clustering for skewed data.
Data mining techniques unit 2
Data cube aggregation
Data Cube
Aggregation
• The cube created at the
lowest abstraction level
is referred to as the base
cuboid.
• A cube at the highest
level of abstraction is
the apex cuboid.
Data Compression
• The transformation are applied so as to obtain a reduced or
“compressed” representation of the original data.
• If the original data can be reconstructed from the compressed data
without any information loss, the data reduction is called lossless.
• If we construct only an approximation of the original data then the
data is called lossy.
• The dimensionality reduction and numerosity reduction techniques can
also be considered forms of data compression
Data transformation and Data
Discretization
• The data are transformed or consolidated so that the resulting mining
process may be more efficient and the patterns found may be easier to
understand.
• Data transformation strategies overview:
Smoothing
Attribute Construction
Aggregation
Normalization
Discretization
Concept hierarchy generation for nominal data
Data Transformation by Normalization
• Min-Max Normalization
• Z-score Normalization
• Decimal Scaling
Min-Max Normalization
Min-Max Normalization
Z-score Normalization
Decimal Scaling
Discretization
• Discretization by binning
• Discretization by histogram analysis
• Discretization by cluster, decision tree and correlation analyses
Concept hierarchy generation for nominal data
• Specification of a partial ordering of attributes explicity at the schema
level by users or experts
• Specification of a portion of a hierarchy by explicit data grouping
• Specification of a set of attributes but not of their partial ordering
• Specification of only a partial set of attributes
Data mining techniques unit 2
Ad

More Related Content

What's hot (20)

Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
Adnan Masood
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
J M
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
rajalakshmi5921
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
Linear regression
Linear regressionLinear regression
Linear regression
MartinHogg9
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
Azad public school
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Data Reduction
Data ReductionData Reduction
Data Reduction
Rajan Shah
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
kishanthkumaar
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
Rushali Deshmukh
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan969
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
FUNCTION DEPENDENCY  AND TYPES & EXAMPLEFUNCTION DEPENDENCY  AND TYPES & EXAMPLE
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
Vraj Patel
 
Data reduction
Data reductionData reduction
Data reduction
kalavathisugan
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Salah Amean
 
Bayesian Networks - A Brief Introduction
Bayesian Networks - A Brief IntroductionBayesian Networks - A Brief Introduction
Bayesian Networks - A Brief Introduction
Adnan Masood
 
3 tier data warehouse
3 tier data warehouse3 tier data warehouse
3 tier data warehouse
J M
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ankur bhalla
 
Singular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptxSingular Value Decomposition (SVD).pptx
Singular Value Decomposition (SVD).pptx
rajalakshmi5921
 
Linear regression
Linear regressionLinear regression
Linear regression
MartinHogg9
 
Machine learning session4(linear regression)
Machine learning   session4(linear regression)Machine learning   session4(linear regression)
Machine learning session4(linear regression)
Abhimanyu Dwivedi
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
Salah Amean
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
Eva Durall
 
Data Reduction
Data ReductionData Reduction
Data Reduction
Rajan Shah
 
Machine Learning-Linear regression
Machine Learning-Linear regressionMachine Learning-Linear regression
Machine Learning-Linear regression
kishanthkumaar
 
Data mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarityData mining Measuring similarity and desimilarity
Data mining Measuring similarity and desimilarity
Rushali Deshmukh
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
mrizwan969
 
Data mining techniques unit 1
Data mining techniques  unit 1Data mining techniques  unit 1
Data mining techniques unit 1
malathieswaran29
 
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
FUNCTION DEPENDENCY  AND TYPES & EXAMPLEFUNCTION DEPENDENCY  AND TYPES & EXAMPLE
FUNCTION DEPENDENCY AND TYPES & EXAMPLE
Vraj Patel
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
Davis David
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
Salah Amean
 

Similar to Data mining techniques unit 2 (20)

Types of Data in Machine Learning, Number aand Categorical
Types of Data in Machine Learning, Number aand CategoricalTypes of Data in Machine Learning, Number aand Categorical
Types of Data in Machine Learning, Number aand Categorical
msiad
 
Data preprocess
Data preprocessData preprocess
Data preprocess
srigiridharan92
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Roshan575917
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
chatbot9
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
waseemchaudhry13
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Arumugam Prakash
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
congtran88
 
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
EDAB - Principal Components Analysis and Classification -Module - 5.pptxEDAB - Principal Components Analysis and Classification -Module - 5.pptx
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
preethiBP2
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptx
gina458018
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
mmuthuraj
 
3 module 2
3 module 23 module 2
3 module 2
tafosepsdfasg
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Data-Interpretation-Workshop-Presentation.pdf
Data-Interpretation-Workshop-Presentation.pdfData-Interpretation-Workshop-Presentation.pdf
Data-Interpretation-Workshop-Presentation.pdf
SagarNitin
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
chatbot9
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
extraganesh
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Organizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptxOrganizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptx
bentrym2
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and Tabulation
International advisers
 
Types of Data in Machine Learning, Number aand Categorical
Types of Data in Machine Learning, Number aand CategoricalTypes of Data in Machine Learning, Number aand Categorical
Types of Data in Machine Learning, Number aand Categorical
msiad
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
chatbot9
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
congtran88
 
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
EDAB - Principal Components Analysis and Classification -Module - 5.pptxEDAB - Principal Components Analysis and Classification -Module - 5.pptx
EDAB - Principal Components Analysis and Classification -Module - 5.pptx
preethiBP2
 
Exploring Data (1).pptx
Exploring Data (1).pptxExploring Data (1).pptx
Exploring Data (1).pptx
gina458018
 
Introduction of data science
Introduction of data scienceIntroduction of data science
Introduction of data science
TanujaSomvanshi1
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
mmuthuraj
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Data-Interpretation-Workshop-Presentation.pdf
Data-Interpretation-Workshop-Presentation.pdfData-Interpretation-Workshop-Presentation.pdf
Data-Interpretation-Workshop-Presentation.pdf
SagarNitin
 
data clean.ppt
data clean.pptdata clean.ppt
data clean.ppt
chatbot9
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Data preprocessing 2
Data preprocessing 2Data preprocessing 2
Data preprocessing 2
extraganesh
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Organizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptxOrganizational Data Analysis by Mr Mumba.pptx
Organizational Data Analysis by Mr Mumba.pptx
bentrym2
 
Chapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and TabulationChapter 11 Data Analysis Classification and Tabulation
Chapter 11 Data Analysis Classification and Tabulation
International advisers
 
Ad

More from malathieswaran29 (12)

Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
malathieswaran29
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
malathieswaran29
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
malathieswaran29
 
Bitcoin data mining
Bitcoin data miningBitcoin data mining
Bitcoin data mining
malathieswaran29
 
Principles of management organizing & reengineering
Principles of management organizing & reengineeringPrinciples of management organizing & reengineering
Principles of management organizing & reengineering
malathieswaran29
 
Principles of management human factor & motivation
Principles of management human factor & motivationPrinciples of management human factor & motivation
Principles of management human factor & motivation
malathieswaran29
 
Software maintenance real world maintenance cost
Software maintenance real world maintenance costSoftware maintenance real world maintenance cost
Software maintenance real world maintenance cost
malathieswaran29
 
SOFTWARE MAINTENANCE -4
SOFTWARE MAINTENANCE -4SOFTWARE MAINTENANCE -4
SOFTWARE MAINTENANCE -4
malathieswaran29
 
SOFTWARE MAINTENANCE -3
SOFTWARE MAINTENANCE -3SOFTWARE MAINTENANCE -3
SOFTWARE MAINTENANCE -3
malathieswaran29
 
SOFTWARE MAINTENANCE -2
SOFTWARE MAINTENANCE -2SOFTWARE MAINTENANCE -2
SOFTWARE MAINTENANCE -2
malathieswaran29
 
SOFTWARE MAINTENANCE -1
SOFTWARE MAINTENANCE -1SOFTWARE MAINTENANCE -1
SOFTWARE MAINTENANCE -1
malathieswaran29
 
SOFTWARE MAINTENANCE- 5
SOFTWARE MAINTENANCE- 5SOFTWARE MAINTENANCE- 5
SOFTWARE MAINTENANCE- 5
malathieswaran29
 
Data mining techniques unit iv
Data mining techniques unit ivData mining techniques unit iv
Data mining techniques unit iv
malathieswaran29
 
Data mining techniques unit III
Data mining techniques unit IIIData mining techniques unit III
Data mining techniques unit III
malathieswaran29
 
Data mining techniques unit v
Data mining techniques unit vData mining techniques unit v
Data mining techniques unit v
malathieswaran29
 
Principles of management organizing & reengineering
Principles of management organizing & reengineeringPrinciples of management organizing & reengineering
Principles of management organizing & reengineering
malathieswaran29
 
Principles of management human factor & motivation
Principles of management human factor & motivationPrinciples of management human factor & motivation
Principles of management human factor & motivation
malathieswaran29
 
Software maintenance real world maintenance cost
Software maintenance real world maintenance costSoftware maintenance real world maintenance cost
Software maintenance real world maintenance cost
malathieswaran29
 
Ad

Recently uploaded (20)

Nanometer Metal-Organic-Framework Literature Comparison
Nanometer Metal-Organic-Framework  Literature ComparisonNanometer Metal-Organic-Framework  Literature Comparison
Nanometer Metal-Organic-Framework Literature Comparison
Chris Harding
 
SICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introductionSICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introduction
fabienklr
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdfML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
rameshwarchintamani
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
introduction technology technology tec.pptx
introduction technology technology tec.pptxintroduction technology technology tec.pptx
introduction technology technology tec.pptx
Iftikhar70
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdfSmart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
PawachMetharattanara
 
twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
Lecture - 7 Canals of the topic of the civil engineering
Lecture - 7  Canals of the topic of the civil engineeringLecture - 7  Canals of the topic of the civil engineering
Lecture - 7 Canals of the topic of the civil engineering
MJawadkhan1
 
Design of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdfDesign of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdf
Kamel Farid
 
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
ajayrm685
 
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control Monthly May 2025
Water Industry Process Automation & Control
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...
JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...
JRR Tolkien’s Lord of the Rings: Was It Influenced by Nordic Mythology, Homer...
Reflections on Morality, Philosophy, and History
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation RateModeling the Influence of Environmental Factors on Concrete Evaporation Rate
Modeling the Influence of Environmental Factors on Concrete Evaporation Rate
Journal of Soft Computing in Civil Engineering
 
Nanometer Metal-Organic-Framework Literature Comparison
Nanometer Metal-Organic-Framework  Literature ComparisonNanometer Metal-Organic-Framework  Literature Comparison
Nanometer Metal-Organic-Framework Literature Comparison
Chris Harding
 
SICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introductionSICPA: Fabien Keller - background introduction
SICPA: Fabien Keller - background introduction
fabienklr
 
Artificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptxArtificial intelligence and machine learning.pptx
Artificial intelligence and machine learning.pptx
rakshanatarajan005
 
Slide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptxSlide share PPT of SOx control technologies.pptx
Slide share PPT of SOx control technologies.pptx
vvsasane
 
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdfML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
ML_Unit_VI_DEEP LEARNING_Introduction to ANN.pdf
rameshwarchintamani
 
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
01.คุณลักษณะเฉพาะของอุปกรณ์_pagenumber.pdf
PawachMetharattanara
 
introduction technology technology tec.pptx
introduction technology technology tec.pptxintroduction technology technology tec.pptx
introduction technology technology tec.pptx
Iftikhar70
 
acid base ppt and their specific application in food
acid base ppt and their specific application in foodacid base ppt and their specific application in food
acid base ppt and their specific application in food
Fatehatun Noor
 
Control Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptxControl Methods of Noise Pollutions.pptx
Control Methods of Noise Pollutions.pptx
vvsasane
 
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdfSmart City is the Future EN - 2024 Thailand Modify V1.0.pdf
Smart City is the Future EN - 2024 Thailand Modify V1.0.pdf
PawachMetharattanara
 
twin tower attack 2001 new york city
twin  tower  attack  2001 new  york citytwin  tower  attack  2001 new  york city
twin tower attack 2001 new york city
harishreemavs
 
Lecture - 7 Canals of the topic of the civil engineering
Lecture - 7  Canals of the topic of the civil engineeringLecture - 7  Canals of the topic of the civil engineering
Lecture - 7 Canals of the topic of the civil engineering
MJawadkhan1
 
Design of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdfDesign of Variable Depth Single-Span Post.pdf
Design of Variable Depth Single-Span Post.pdf
Kamel Farid
 
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
sss1.pptxsss1.pptxsss1.pptxsss1.pptxsss1.pptx
ajayrm685
 
Applications of Centroid in Structural Engineering
Applications of Centroid in Structural EngineeringApplications of Centroid in Structural Engineering
Applications of Centroid in Structural Engineering
suvrojyotihalder2006
 
Agents chapter of Artificial intelligence
Agents chapter of Artificial intelligenceAgents chapter of Artificial intelligence
Agents chapter of Artificial intelligence
DebdeepMukherjee9
 
Frontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend EngineersFrontend Architecture Diagram/Guide For Frontend Engineers
Frontend Architecture Diagram/Guide For Frontend Engineers
Michael Hertzberg
 

Data mining techniques unit 2

  • 2. Data objects and Attributes Types • What is data object? • What is an attribute? -observations -attribute vector or feature vector -univariate, bivariate • Types of attributes -Nominal -Binary -Ordinal -Numeric -Discrete Vs Continuous Attributes
  • 3. Nominal Attributes • Relating to name • Symbols or name of things • Categorical or nominal • Enumerations • Example
  • 4. Binary Attribute • What is binary attribute ? • What is Boolean attribute? • Symmetric • Asymmetric • Example
  • 5. Ordinal Attribute • What is ordinal attribute? • Example
  • 6. Numeric Attributes • What is numeric attributes? • Interval-scaled Attributes • Ratio-scaled Attributes • Example
  • 7. Discrete Vs Continuous Attributes • What is discrete and continuous attributes? • Example
  • 8. Basic statistical description of data -Purpose of basic statistical description of data: -Three areas of basic statistical descriptions 1.Measuring the central Tendency: Mean, Median and Mode 2.Measuring the dispersion of data: Range, Quartiles, Variance, Standard Deviation and Interquartile range -Five-Number Summary, Boxplots and Outliers -Variance and Standard Deviation 3.Graphic Displays of Basic Statistical Descriptions of Data -Quantile plot -Quantile-Quantile Plot -Histogram -Scatter plots and Data correlation
  • 9. Measuring the central Tendency: Mean, Median and Mode -Measure the location of the middle or center of a data distribution. -Measure of the central tendency include the mean, median, mode and midrange -The most common and effective numeric measure is ARITHMETIC MEAN-helps to find the center of a set of data. -The Mean of the set of values :
  • 11. Weighted Arithmetic Mean or Weighted Average
  • 12. Median -Trimmed Mean -For asymmetric data or skewed data: a better measure of the center of data can be identified using Median -The Median is expensive to compute: for large no. of observations -For numeric attributes can easily approximate the value -Median Interval
  • 13. Mode -Another measure of central tendency -The value that occurs most frequently in the set -It can be used in qualitative and quantitative attributes -unimodal, bimodal and trimodal -multimodal
  • 14. Mid range -It is the average of the largest and smallest values in the set -This measure is easy to compute using the SQL aggregate functions, Max(),Min()
  • 16. Measuring the dispersion of data:range,quartiles,variance,standard deviation and Interquartile range - Range ,quantiles, quartiles, percentiles and the interquartile range :measures of data dispersion - Range: The set of difference between the largest (max()) and smallest (min()) values -Quantiles: points taken at the regular intervals of a data distribution, dividing it into essentially equal size consecutive sets -Quartiles: Each part of one-fourth of the data distribution -Percentiles: The 100 quantiles are commonly referred as percentiles -Interquartile range: The distance between the first and third quartiles IQR=Q3-Q1
  • 18. Five –Number Summary,BoxPlots and Outliers -The Five-Number summary: Minimum,Q1,Median,Q3, Maximum -Box Plot: a popular way of visualizing a distribution -It incorporates the five – number summary
  • 19. Variance and Standard Deviation -A low standard deviation means that the data observations tend to be very close to the mean -A high standard deviation indicates that the data are spread out over a large range of values
  • 20. Graphic displays of basic statistical descriptions of data -Quantile plot: univariate data distribution -It displays all of the data for the given attribute -Quantile-Quantile plot or q-q plot : It graphs the quantiles of one univariate distribution against the corresponding quantiles of another.
  • 21. Histograms, Scatter Plots and Data Correlation -Histogram: It is a chart of poles -The resulting graph is more commonly known as bar chart -Scatter plot: The most effective graphical methods for determining the relationship, pattern or trend between two numeric attributes -It allows to represents the bivariate data -Correlation: if one attribute implies the other -Positive correlation, Negative Correlation, Null Correlation
  • 22. Data Visualization • Aims to communicate data clearly and effectively through graphical representation • Pixel oriented techniques • Geometric projection visualization techniques • Icon based visualization techniques • Hierarchical visualization techniques • Visualizing complex data and relations
  • 23. Pixel oriented techniques • To visualize the value of a dimension • The colour of the pixel reflects-the dimension’s value • The data records can also be ordered in a query-dependent way • A pixel is one above it in the window • To solve this problem-can layout data records in space filling curve to fill the window • Circle segment: comparisons of dimension
  • 25. Geometric projection visualization techniques • Drawback of pixel oriented techniques: cannot help much in understanding the distribution of data in multidimensional space. • Geometric projection: helps user in finding interesting projections of multidimensional data sets • Scatter plot:2D data points ,three dimension added using different colour or shapes to represent different data points • Scatter plot matrix technique: for datasets with more than four dimensions • Parallel coordinates: can handle higher dimensionality
  • 27. Icon - based visualization techniques • Small icons used to represent multidimensional data values • Two popular icon-based techniques: Chernoff faces and stick figures • Chernoff faces: it displays multidimensional data up to 18 variables or dimensions as cartoon faces • Viewing for large tables of data is tedious • Asymmetric Chernoff: double the no. of facial characteristics, allowing up to 36 dimensions to be displayed • Stick figure: maps multidimensional data to five-piece stick figures
  • 29. Hierarchical visualization techniques • It partition all dimensions into subsets(i.e., sub spaces).The subspaces are visualized in a hierarchical manner • Tree maps: display hierarchical data as a set of nested rectangles.
  • 30. Visualizing complex data and relations • For non-numeric data such as text and social networks –visualization techniques are available. • Tag cloud: visualization of statistics of user generated tags. • Tag cloud can be done for single and multiple items • Disease influence graph provides the visualization of correlations between diseases
  • 32. Measuring data similarity and dissimilarity • Measures of proximity • Two data structures: the data matrix and dissimilarity matrix • Data matrix Vs Dissimilarity matrix • Proximity measures for nominal attributes • Proximity measures for binary attributes • Dissimilarity of numeric data: Minkowski distance • Proximity measures for ordinal attributes • Dissimilarity for attributes of mixed types • Cosine similarity
  • 33. Data matrix Vs Dissimilarity matrix • Data Matrix(or object-by-attribute structure) • Dissimilarity matrix ( or object-by- object structure) • Sim(i , j)=1-d(i , j) • Two - mode matrix • One - mode matrix
  • 34. Proximity measures for nominal attributes • The dissimilarity between two objects i and j can be computed based on the ratio of mismatches: d(i,j)=p-m/p M-number of matches P-total number of attributes describing the objects i,j-the number of attributes for which i and j are in the same state.
  • 36. Proximity measures for binary attributes • Symmetric binary attributes • Asymmetric binary dissimilarity • Asymmetric binary similarity
  • 38. Dissimilarity of numeric data: Minkowski distance
  • 41. Proximity measures for ordinal attributes
  • 47. Data Pre processing • Need for data pre processing • Data Inaccurate • Incomplete data • Inconsistent data • Timeliness • Believability • Interpretability
  • 48. Major Tasks in Data Pre processing • Data Cleaning • Data Integration • Data reduction • Data Transformation
  • 49. Data Cleaning 1.Missing Values,2.Noisy Data,3.Data Cleaning as a Process MISSING VALUES -Ignore the tuple -Filling in the missing value manually -Use a “global constant” to fill in the missing value -Use a measure of central tendency for the attribute(e.g., mean or median to fill in the missing value -Use the attribute mean or median for all samples belonging to the same class as the given tuple -Use the most probable value to fill in the missing value
  • 50. 2.Noisy data- (i)Binning,(ii)Regression,(iii)Outlier analysis NOISY DATA • Binning Equal-frequency bins Smoothing by bin means Smoothing by bin boundaries Ex. sorted data for price(in dollars) 4,8,15,21,21,24,25,28,34
  • 52. 2.Noisy data-(ii)Regression,(iii)Outlier Analysis REGRESSION • Linear Regression • Multiple Regression OUTLIER ANALYSIS
  • 53. 3.Data Cleaning as a Process(i)Discrepancy detection (ii)Data Transformation DISCREPANCY DETECTION • Caused by several factors • How can we proceed with discrepancy detection? • Field overloading • Rules to examine data -Unique rule -Consecutive rule -Null rule Tools to aid in the discrepancy detection step: (i) Data scrubbing tools (ii) Data auditing tools
  • 54. 3.Data Cleaning as a Process (ii)Data Transformation DATA TRANSFORMATION • Commercial tools for data transformation: (i)Data migration tools (ii)ETL(Extraction/Transformation/Loading)tools • What is potter’s wheel? • What are declarative languages for data cleaning?
  • 55. Data Integration-1.Entity Identification Problem ,2.Redundancy and correlation analysis,3.Tuple Duplication • Merging of data from multiple data sources • Helps to reduce and avoid inconsistencies • Helps to improve accuracy and speed of the datamining process
  • 56. Data Integration-1.Entity Identification Problem • Customer_id-DB1 Customer_number-DB2 • How can the data analyst interpret? As it refers to same attribute • Meta data –includes the name,meaning,data type,range of values • Meta data-helps to avoid errors in schema integration • Pay type:H/S-DB1 :1/2-DB2
  • 57. Data Integration- 2.Redundancy and correlation analysis • REDUNDANT(DOB & AGE) An attribute may be redundant, if it can be derived from another attribute or set of attributes • CORRELATION ANALYSIS Some redundancies can be detected by correlation analysis Given two attributes, correlation analysis can measure how strongly one attribute implies the other, based on the available data. Types of analysis:(i) Nominal data- X2 (chi )square test (ii) Numeric data-correlation coefficient, covariance
  • 58. Data Integration- 2.Redundancy and correlation analysis • Nominal attribute-eg.,Hair color-black,brown,red etc., eg.,Marital status-single,married,divorced,widowed • Numeric attribute-Integer or real values
  • 61. Correlation Coefficient for Numeric Data • Correlation between two attributes A&B are evaluated by computing the correlation coefficient (known as Pearson's product moment coefficient)
  • 63. Covariance of numeric data • Correlation and covariance are two simple measures for assessing how much two attributes change together.
  • 65. Tuple duplication • Duplication should also be detected at tuple level • Denormalized tables is a source of data redundancy • Data value conflicts detection and resolution • Eg.,Weight,hotel,schools,attributes
  • 66. Data Reduction • Why data reduction? • Data reduction strategies: 1)Dimensionality reduction Wavelet Transforms Principal Component Analysis(PCA) Attribute Subset Selection 2)Numerosity Reduction Parametric-Regression and Log linear models Non-Parametric-Histograms, Clustering, Sampling, Data cube aggregation 3)Data Compression
  • 67. Dimensionality Reduction • What is dimensionality reduction? • What is Curse of dimensionality? • What is Wavelet Transform? 1.Discrete Wavelet Transform(DWT) • DWT in data reduction 2.Discrete Fourier Transform(DFT) DWT is closely related to DFT, a signal processing technique involving sines and cosines
  • 68. Difference between DWT and DFT DWT DFT 1.More Accurate Not accurate 2.Requires less space Requires more space 3.Several families of DWT (Haar-2,Daubechies- 4,Daubechies-6 etc.,) Only one DFT
  • 69. Hierarchical pyramid Algorithm • Procedure for applying a discrete wavelet transform that halves the data at each iteration, resulting in fast computational speed. Method: 1.Length L of input data vector must be an integral power of 2. 2.Each transform involves applying two functions. (i)Applies data smoothing (ii)Performs weighted difference 3.The two functions are applied to pairs of data points in X 4.The two functions are recursively applied to the data sets obtained in the previous loop, until the resulting data sets obtained are of length2. 5.Selected values from the data sets obtained in the previous iterations are the wavelet coefficients of the transformed data.
  • 70. Matrix • A matrix multiplication can be applied to the input data in order to obtain the wavelet coefficients. • Matrix must be orthogonal. • Multidimensional Data.
  • 71. Principal Components Analysis(also called as Karhunen-Loeve or K-L method) • What is Principal Component Analysis? • Procedure for PCA 1.Normalize input data 2.Compute “K” orthonormal vectors 3.Principal components are sorted in order of decreasing “significance” or “strength” 4.The data size can be reduced by eliminating the weaker components.
  • 72. Attribute Subset Selection • Another way to reduce dimensionality of data. • Irrelevant attributes • GOAL: 1.To find a minimum set of attributes 2.Reduces the no. of attributes appearing in the discovered pattern • How can we find a ‘good’ subset of the original attributes?
  • 73. Greedy Methods • What is Heuristic search? • Basic heuristic methods: 1.Stepwise forward selection Initial attribute set:{a1,a2,a3,a4,a5,a6} Initial reduced set:{} =>{a1} =>{a1,a4} =>{a1,a4,a6} Reduced attribute set
  • 74. Basic Heuristic Method 2.Stepwise backward elimination: Initial attribute set:{a1,a2,a3,a4,a5,a6} =>{a1,a3,a4,a5,a6} =>{a1,a3,a5,a6} =>{a1,a3,a6} 3.Combination of forward selection and backward elimination 4.Decision tree induction • Discrete transform in attribute subset selection
  • 75. Numerosity Reduction-(i)parametric data reduction • Regression &log-linear models • Regression-can be used to approximate the given data. 1.Linear (simple)Regression: Data are modelled to fit a straight line. Y=wx + b, x & y- numeric database attributes, w & b-regression co efficents. These coefficients are solved by method of least square. 2.Multiple Linear Regression: Allows a response variable ’y’ to be modelled as a linear function of two or more predictor variables. Y=b0+b1x1+b2x2 • Log-Linear Models: Approximate discrete multidimensional probability distributions.
  • 76. Numerosity Reduction-(i)Non-parametric data reduction • Histogram-use binning to approximate data distributions and ara popular form of data reduction. (i)Equal –width: the width of each bucket range is uniform (ii)Equal-frequency(or equal-depth):the frequency of each bucket is constant
  • 78. Numerosity Reduction-(i)Non-parametric data reduction • Clustering-it considers data tuples as objects. Grouping of similar objects - Centroid distance is an alternative measure of cluster quality and is defined as the average distance of each cluster object from the cluster centroid • Sampling-allows a large data set to be represented by a much smaller random data sample(or sub set) - Simple random sample without replacement of size(SRSWOR)-all tuples are equally likely to be sampled - Simple random sample with replacement of size(SRSWR)-similar to SRSWOR, after a tuple is drawn, it is placed back in D so that it may be drawn again. - Cluster Sample:grouped into M mutually disjoint “clusters” - Stratified sample:If D is divided into mutually disjoint parts called strata. this helps in clustering for skewed data.
  • 81. Data Cube Aggregation • The cube created at the lowest abstraction level is referred to as the base cuboid. • A cube at the highest level of abstraction is the apex cuboid.
  • 82. Data Compression • The transformation are applied so as to obtain a reduced or “compressed” representation of the original data. • If the original data can be reconstructed from the compressed data without any information loss, the data reduction is called lossless. • If we construct only an approximation of the original data then the data is called lossy. • The dimensionality reduction and numerosity reduction techniques can also be considered forms of data compression
  • 83. Data transformation and Data Discretization • The data are transformed or consolidated so that the resulting mining process may be more efficient and the patterns found may be easier to understand. • Data transformation strategies overview: Smoothing Attribute Construction Aggregation Normalization Discretization Concept hierarchy generation for nominal data
  • 84. Data Transformation by Normalization • Min-Max Normalization • Z-score Normalization • Decimal Scaling
  • 89. Discretization • Discretization by binning • Discretization by histogram analysis • Discretization by cluster, decision tree and correlation analyses
  • 90. Concept hierarchy generation for nominal data • Specification of a partial ordering of attributes explicity at the schema level by users or experts • Specification of a portion of a hierarchy by explicit data grouping • Specification of a set of attributes but not of their partial ordering • Specification of only a partial set of attributes
  翻译: