SlideShare a Scribd company logo
Data pre-processing
Data pre-processing is an important step in
the data mining process. Data pre-processing
includes
cleaning, normalization, transformation,
feature extraction and selection, etc. The
product of data pre-processing is the
final training set.
Data Preprocessing&tools
Why preprocessing ?
Real world data are generally
– Incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
– Noisy: containing errors or outliers
– Inconsistent: containing discrepancies in codes or names
• Tasks in data preprocessing
– Data cleaning: fill in missing values, smooth noisy data,
identify or remove outliers, and resolve inconsistencies.
– Data integration: using multiple databases, data cubes, or
files.
– Data transformation: normalization and aggregation.
– Data reduction: reducing the volume but producing the
same or similar analytical results.
– Data discretization: part of data reduction, replacing
numerical attributes with nominal ones.
Data cleaning
• Fill in missing values (attribute or class value):
– Ignore the tuple : usually done when class label is
missing.
– Use the attribute mean (or majority nominal value) to
fill in the missing value.
– Use the attribute mean (or majority nominal value) for
all samples belonging to the same class.
– Predict the missing value by using a learning
algorithm: consider the attribute with the missing
value as a dependent (class) variable and run a
learning algorithm (usually Bayes or decision tree) to
predict the missing value.
• Identify outliers and smooth out noisy data:
– Binning
• Sort the attribute values and partition them into bins
(see "Unsupervised discretization" below);
• Then smooth by bin means, bin median, or bin
boundaries.
– Clustering: group values in clusters and then detect and
remove outliers (automatic or manual)
– Regression: smooth by fitting the data into regression
functions.
• Correct inconsistent data:
Use domain knowledge or expert decision.
Data integration and transformation
• Normalization:
– Scaling attribute values to fall within a specified range.
• Example: to transform V in [min, max] to V' in [0,1], apply V'=(V-
Min)/(Max-Min)
– Scaling by using mean and standard deviation (useful when
min and max are unknown or when there are
outliers): V'=(V-Mean)/StDev
• Aggregation: moving up in the concept hierarchy on
numeric attributes.
• Generalization: moving up in the concept hierarchy on
nominal attributes.
• Attribute construction: replacing or adding new
attributes inferred by existing attributes.
Data reduction
• Reducing the number of attributes
– Data cube aggregation: applying roll-up, slice or dice operations.
– Removing irrelevant attributes: attribute selection (filtering and
wrapper methods), searching the attribute space (see Lecture 5:
Attribute-oriented analysis).
– Principle component analysis (numeric attributes only): searching
for a lower dimensional space that can best represent the data..
• Reducing the number of attribute values
– Binning (histograms): reducing the number of attributes by
grouping them into intervals (bins).
– Clustering: grouping values in clusters.
– Aggregation or generalization
• Reducing the number of tuples
– Sampling
Data mining tools
Most data mining tools can be classified into one of three
categories:
• Traditional data mining tools,
• Dashboards,
• Text-mining tools. Below is a description of each.
1. Traditional Data Mining Tools. Traditional data mining programs
help companies establish data patterns and trends by using a
number of complex algorithms and techniques. Some of these tools
are installed on the desktop to monitor the data and highlight trends
and others capture information residing outside a database. The
majority are available in both Windows and UNIX versions, although
some specialize in one operating system only. In addition, while
some may concentrate on one database type, most will be able to
handle any data using online analytical processing or a similar
technology.
• Dashboards. Installed in computers to monitor information in
a database, dashboards reflect data changes and updates
onscreen — often in the form of a chart or table — enabling
the user to see how the business is performing. Historical data
also can be referenced, enabling the user to see where things
have changed (e.g., increase in sales from the same period
last year). This functionality makes dashboards easy to use
and particularly appealing to managers who wish to have an
overview of the company's performance.
• Text-mining Tools. The third type of data mining tool
sometimes is called a text-mining tool because of its ability to
mine data from different kinds of text — from Microsoft Word
and Acrobat PDF documents to simple text files, for example.
These tools scan content and convert the selected data into a
format that is compatible with the tool's database, thus
providing users with an easy and convenient way of accessing
data without the need to open different applications. Scanned
content can be unstructured (i.e., information is scattered
almost randomly across the document, including e-mails,
Internet pages, audio and video data) or structured (i.e., the
data's form and purpose is known, such as content found in a
database). Capturing these inputs can provide organizations
with a wealth of information that can be mined to discover
trends, concepts, and attitudes.
Ad

More Related Content

What's hot (19)

data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Jason Rodrigues
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
Data preprocess
Data preprocessData preprocess
Data preprocess
srigiridharan92
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Slideshare
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
mmuthuraj
 
Data PreProcessing
Data PreProcessingData PreProcessing
Data PreProcessing
tdharmaputhiran
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
Object-Frontier Software Pvt. Ltd
 
Database Concepts
Database ConceptsDatabase Concepts
Database Concepts
Saradha Shyam
 
Difference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional ModelingDifference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional Modeling
Abdul Aslam
 
Data Mining: Data Preprocessing
Data Mining: Data PreprocessingData Mining: Data Preprocessing
Data Mining: Data Preprocessing
Lakshmi Sarvani Videla
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
malathieswaran29
 
Data processing
Data processingData processing
Data processing
Joseph Lagod
 
Preprocess
PreprocessPreprocess
Preprocess
sharmilajohn
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ksamyMCA
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
GowriLatha1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
data warehousing & minining 1st unit
data warehousing & minining 1st unitdata warehousing & minining 1st unit
data warehousing & minining 1st unit
bhagathk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
kayathri02
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Slideshare
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
mmuthuraj
 
Difference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional ModelingDifference between ER-Modeling and Dimensional Modeling
Difference between ER-Modeling and Dimensional Modeling
Abdul Aslam
 
Data mining techniques unit 2
Data mining techniques unit 2Data mining techniques unit 2
Data mining techniques unit 2
malathieswaran29
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
ksamyMCA
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
GowriLatha1
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
Harry Potter
 

Similar to Data Preprocessing&tools (20)

Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Preprocessing_new.ppt
Preprocessing_new.pptPreprocessing_new.ppt
Preprocessing_new.ppt
YashikaSengar2
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
congtran88
 
Data1
Data1Data1
Data1
suganmca14
 
Data1
Data1Data1
Data1
suganmca14
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Roshan575917
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
chatbot9
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
waseemchaudhry13
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
Arumugam Prakash
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
Premkumar R
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
PalaniKumarR2
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 
Chapter 3.pdf
Chapter 3.pdfChapter 3.pdf
Chapter 3.pdf
DrGnaneswariG
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
Dhilsath Fathima
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
datapreprocessing
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
saranya12345
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
AnwarrChaudary
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
Dhilsath Fathima
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
Umair Shafique
 
Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
meenas06
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
congtran88
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
sumit621
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
chatbot9
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
Premkumar R
 
20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt20IT501_DWDM_PPT_Unit_II.ppt
20IT501_DWDM_PPT_Unit_II.ppt
PalaniKumarR2
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
Ujjawal
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
Dhilsath Fathima
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
saranya12345
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
AnwarrChaudary
 
Ad

Data Preprocessing&tools

  • 1. Data pre-processing Data pre-processing is an important step in the data mining process. Data pre-processing includes cleaning, normalization, transformation, feature extraction and selection, etc. The product of data pre-processing is the final training set.
  • 3. Why preprocessing ? Real world data are generally – Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data – Noisy: containing errors or outliers – Inconsistent: containing discrepancies in codes or names • Tasks in data preprocessing – Data cleaning: fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies. – Data integration: using multiple databases, data cubes, or files. – Data transformation: normalization and aggregation. – Data reduction: reducing the volume but producing the same or similar analytical results. – Data discretization: part of data reduction, replacing numerical attributes with nominal ones.
  • 4. Data cleaning • Fill in missing values (attribute or class value): – Ignore the tuple : usually done when class label is missing. – Use the attribute mean (or majority nominal value) to fill in the missing value. – Use the attribute mean (or majority nominal value) for all samples belonging to the same class. – Predict the missing value by using a learning algorithm: consider the attribute with the missing value as a dependent (class) variable and run a learning algorithm (usually Bayes or decision tree) to predict the missing value.
  • 5. • Identify outliers and smooth out noisy data: – Binning • Sort the attribute values and partition them into bins (see "Unsupervised discretization" below); • Then smooth by bin means, bin median, or bin boundaries. – Clustering: group values in clusters and then detect and remove outliers (automatic or manual) – Regression: smooth by fitting the data into regression functions. • Correct inconsistent data: Use domain knowledge or expert decision.
  • 6. Data integration and transformation • Normalization: – Scaling attribute values to fall within a specified range. • Example: to transform V in [min, max] to V' in [0,1], apply V'=(V- Min)/(Max-Min) – Scaling by using mean and standard deviation (useful when min and max are unknown or when there are outliers): V'=(V-Mean)/StDev • Aggregation: moving up in the concept hierarchy on numeric attributes. • Generalization: moving up in the concept hierarchy on nominal attributes. • Attribute construction: replacing or adding new attributes inferred by existing attributes.
  • 7. Data reduction • Reducing the number of attributes – Data cube aggregation: applying roll-up, slice or dice operations. – Removing irrelevant attributes: attribute selection (filtering and wrapper methods), searching the attribute space (see Lecture 5: Attribute-oriented analysis). – Principle component analysis (numeric attributes only): searching for a lower dimensional space that can best represent the data.. • Reducing the number of attribute values – Binning (histograms): reducing the number of attributes by grouping them into intervals (bins). – Clustering: grouping values in clusters. – Aggregation or generalization • Reducing the number of tuples – Sampling
  • 8. Data mining tools Most data mining tools can be classified into one of three categories: • Traditional data mining tools, • Dashboards, • Text-mining tools. Below is a description of each. 1. Traditional Data Mining Tools. Traditional data mining programs help companies establish data patterns and trends by using a number of complex algorithms and techniques. Some of these tools are installed on the desktop to monitor the data and highlight trends and others capture information residing outside a database. The majority are available in both Windows and UNIX versions, although some specialize in one operating system only. In addition, while some may concentrate on one database type, most will be able to handle any data using online analytical processing or a similar technology.
  • 9. • Dashboards. Installed in computers to monitor information in a database, dashboards reflect data changes and updates onscreen — often in the form of a chart or table — enabling the user to see how the business is performing. Historical data also can be referenced, enabling the user to see where things have changed (e.g., increase in sales from the same period last year). This functionality makes dashboards easy to use and particularly appealing to managers who wish to have an overview of the company's performance.
  • 10. • Text-mining Tools. The third type of data mining tool sometimes is called a text-mining tool because of its ability to mine data from different kinds of text — from Microsoft Word and Acrobat PDF documents to simple text files, for example. These tools scan content and convert the selected data into a format that is compatible with the tool's database, thus providing users with an easy and convenient way of accessing data without the need to open different applications. Scanned content can be unstructured (i.e., information is scattered almost randomly across the document, including e-mails, Internet pages, audio and video data) or structured (i.e., the data's form and purpose is known, such as content found in a database). Capturing these inputs can provide organizations with a wealth of information that can be mined to discover trends, concepts, and attitudes.
  翻译: