SlideShare a Scribd company logo
Data
Preprocessing
From Raw to
Refined
Agenda
 Data Preprocessing
 Need of data preprocessing
 Type of Data
 Objectives of data preprocessing
 Hands on
Data Preprocessing
 Data preprocessing refers to the steps and techniques involved in preparing raw data for analysis
or further processing.
 It is a crucial phase in data analysis and machine learning workflows, aiming to improve data
quality, consistency, and compatibility for effective data mining, modeling, and interpretation.
 The main objectives of data preprocessing include cleaning noisy or incomplete data,
transforming data into a suitable format for analysis, integrating data from multiple sources, and
enhancing data quality to facilitate accurate and meaningful insights
 Overall, data preprocessing plays a fundamental role in ensuring that data is well-organized,
standardized, and ready for exploration and modeling tasks.
 The quality of the data should be checked before applying machine learning or data mining
algorithms.
Need of Data Preprocessing
Preprocessing of data is mainly to check the data quality. The quality can be checked by the
following.
Accuracy: To check whether the data entered is correct or not.
Completeness: To check whether the data is available or not recorded.
Consistency: To check whether the same data is kept in all the places that do or do not match
Interpretability: The understandability of the data.
Believability: The data should be trustable.
Timeliness: The data should be updated correctly
Real world data are generally:
•Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data
•Noisy: Containing errors or outliers
•Inconsistent: Containing discrepancies in codes or names
Advance Data_Preprocessing_and_Wrangling
Objectives of Data Preprocessing
Preprocessing of data is mainly to check the data quality. The quality can be checked by the following.
 To transform the raw data into an understandable format
 To transform data for its usable format
 To eliminate inconsistencies in data
 To remove duplicates in data
 To give more accurate data for preprocessing
 To give assurance for incorrect or missing values in data
 To reduce dimensionalities in data
Python Library used for Data Preprocessing
 Pandas:
◦ Pandas is one of the most widely used libraries for data manipulation and analysis.
◦ It provides data structures like DataFrame and Series, which are highly efficient for handling structured data.
◦ Functions for data cleaning, filtering, merging, reshaping, and more are available in Pandas.
 NumPy:
◦ NumPy is fundamental for numerical computing in Python.
◦ It offers powerful array operations and mathematical functions that are crucial for data preprocessing tasks.
◦ NumPy arrays are used extensively in conjunction with Pandas DataFrames for data manipulation.
 Matplotlib and Seaborn:
These libraries are used for data visualization, which is an important aspect of data preprocessing to understand
the data's distribution and identify outliers.
Matplotlib offers a wide range of plotting functions, while Seaborn provides high-level statistical visualization
capabilities.
Hands on Notebook
Ad

More Related Content

Similar to Advance Data_Preprocessing_and_Wrangling (20)

Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
rapellisrikanth
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
ijcsa
 
Data Wrangling with Python_ Cleaning and Preparing Datasets for Analysis.pdf
Data Wrangling with Python_ Cleaning and Preparing Datasets for Analysis.pdfData Wrangling with Python_ Cleaning and Preparing Datasets for Analysis.pdf
Data Wrangling with Python_ Cleaning and Preparing Datasets for Analysis.pdf
ExcelR- Data Science, Data Analyst, Business Analyst Course Training in Delhi
 
CNN Presentation MEDICAL IMAGE PROCESSING
CNN Presentation MEDICAL IMAGE PROCESSINGCNN Presentation MEDICAL IMAGE PROCESSING
CNN Presentation MEDICAL IMAGE PROCESSING
ramdasvaa
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
bajajrishabh96tech
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
Knoldus Inc.
 
Unit2
Unit2Unit2
Unit2
AishwaryaLakshmiA
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
ShaikSikindar1
 
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
SaiM947604
 
Dm data pre processing
Dm data pre processingDm data pre processing
Dm data pre processing
SangeethaSasi1
 
ML_Lec2 introduction to data processing.pdf
ML_Lec2 introduction to data processing.pdfML_Lec2 introduction to data processing.pdf
ML_Lec2 introduction to data processing.pdf
BeshoyArnest
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
Luminous8
 
Effective Techniques for ERP the migration of data.pdf
Effective Techniques for ERP the migration of data.pdfEffective Techniques for ERP the migration of data.pdf
Effective Techniques for ERP the migration of data.pdf
Jose thomas
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
Datavalley.ai
 
Defining Data Science: A Comprehensive Overview
Defining Data Science: A Comprehensive OverviewDefining Data Science: A Comprehensive Overview
Defining Data Science: A Comprehensive Overview
IABAC
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
1) Introduction to Data Analyticszz.pptx
1) Introduction to Data Analyticszz.pptx1) Introduction to Data Analyticszz.pptx
1) Introduction to Data Analyticszz.pptx
PrajwalAuti
 
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
Image Resampling Detection Based on Convolutional Neural Network Yaohua Liang...
rapellisrikanth
 
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREEA ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
A ROBUST APPROACH FOR DATA CLEANING USED BY DECISION TREE
ijcsa
 
CNN Presentation MEDICAL IMAGE PROCESSING
CNN Presentation MEDICAL IMAGE PROCESSINGCNN Presentation MEDICAL IMAGE PROCESSING
CNN Presentation MEDICAL IMAGE PROCESSING
ramdasvaa
 
Pandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptxPandas Data Cleaning and Preprocessing PPT.pptx
Pandas Data Cleaning and Preprocessing PPT.pptx
bajajrishabh96tech
 
Data Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable TestingData Quality in Test Automation Navigating the Path to Reliable Testing
Data Quality in Test Automation Navigating the Path to Reliable Testing
Knoldus Inc.
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
ShaikSikindar1
 
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
SaiM947604
 
Dm data pre processing
Dm data pre processingDm data pre processing
Dm data pre processing
SangeethaSasi1
 
ML_Lec2 introduction to data processing.pdf
ML_Lec2 introduction to data processing.pdfML_Lec2 introduction to data processing.pdf
ML_Lec2 introduction to data processing.pdf
BeshoyArnest
 
Anwar kamal .pdf.pptx
Anwar kamal .pdf.pptxAnwar kamal .pdf.pptx
Anwar kamal .pdf.pptx
Luminous8
 
Effective Techniques for ERP the migration of data.pdf
Effective Techniques for ERP the migration of data.pdfEffective Techniques for ERP the migration of data.pdf
Effective Techniques for ERP the migration of data.pdf
Jose thomas
 
Decoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdfDecoding the Role of a Data Engineer.pdf
Decoding the Role of a Data Engineer.pdf
Datavalley.ai
 
Defining Data Science: A Comprehensive Overview
Defining Data Science: A Comprehensive OverviewDefining Data Science: A Comprehensive Overview
Defining Data Science: A Comprehensive Overview
IABAC
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
Ian Feller
 
KNOLX_Data_preprocessing
KNOLX_Data_preprocessingKNOLX_Data_preprocessing
KNOLX_Data_preprocessing
Knoldus Inc.
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar
 
1) Introduction to Data Analyticszz.pptx
1) Introduction to Data Analyticszz.pptx1) Introduction to Data Analyticszz.pptx
1) Introduction to Data Analyticszz.pptx
PrajwalAuti
 

Recently uploaded (20)

Important JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must KnowImportant JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must Know
yashikanigam1
 
End to End Process Analysis - Cox Communications
End to End Process Analysis - Cox CommunicationsEnd to End Process Analysis - Cox Communications
End to End Process Analysis - Cox Communications
Process mining Evangelist
 
Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201
GraceSolaa1
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
MLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglésMLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglés
FabianPierrePeaJacob
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
Taqyea
 
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTTYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
CA Suvidha Chaplot
 
Urban models for professional practice 03
Urban models for professional practice 03Urban models for professional practice 03
Urban models for professional practice 03
DanisseLoiDapdap
 
web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 
Ann Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdfAnn Naser Nabil- Data Scientist Portfolio.pdf
Ann Naser Nabil- Data Scientist Portfolio.pdf
আন্ নাসের নাবিল
 
Important JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must KnowImportant JavaScript Concepts Every Developer Must Know
Important JavaScript Concepts Every Developer Must Know
yashikanigam1
 
End to End Process Analysis - Cox Communications
End to End Process Analysis - Cox CommunicationsEnd to End Process Analysis - Cox Communications
End to End Process Analysis - Cox Communications
Process mining Evangelist
 
Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201Mixed Methods Research.pptx education 201
Mixed Methods Research.pptx education 201
GraceSolaa1
 
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual IntelligenceFrom Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
From Data to Insight: How News Aggregator APIs Deliver Contextual Intelligence
Contify
 
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Day 1 MS Excel Basics #.pptxDay 1 MS Excel Basics #.pptxDay 1 MS Excel Basics...
Jayantilal Bhanushali
 
What is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdfWhat is ETL? Difference between ETL and ELT?.pdf
What is ETL? Difference between ETL and ELT?.pdf
SaikatBasu37
 
Taking a customer journey with process mining
Taking a customer journey with process miningTaking a customer journey with process mining
Taking a customer journey with process mining
Process mining Evangelist
 
Introduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdfIntroduction to systems thinking tools_Eng.pdf
Introduction to systems thinking tools_Eng.pdf
AbdurahmanAbd
 
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm     mmmmmfftro.pptxlecture_13 tree in mmmmmmmm     mmmmmfftro.pptx
lecture_13 tree in mmmmmmmm mmmmmfftro.pptx
sarajafffri058
 
national income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptxnational income & related aggregates (1)(1).pptx
national income & related aggregates (1)(1).pptx
j2492618
 
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptxConcrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
Concrete_Presenbmlkvvbvvvfvbbbfcfftation.pptx
ssuserd1f4a3
 
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdfPublication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
Publication-launch-How-is-Life-for-Children-in-the-Digital-Age-15-May-2025.pdf
StatsCommunications
 
MLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglésMLOps_with_SageMaker_Template_EN idioma inglés
MLOps_with_SageMaker_Template_EN idioma inglés
FabianPierrePeaJacob
 
Dynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics DynamicsDynamics 365 Business Rules Dynamics Dynamics
Dynamics 365 Business Rules Dynamics Dynamics
heyoubro69
 
Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]Language Learning App Data Research by Globibo [2025]
Language Learning App Data Research by Globibo [2025]
globibo
 
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
英国学位证(利物浦约翰摩尔斯大学本科毕业证)LJMU文凭证书办理
Taqyea
 
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOTTYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
TYPES OF SOFTWARE_ A Visual Guide.pdf CA SUVIDHA CHAPLOT
CA Suvidha Chaplot
 
Urban models for professional practice 03
Urban models for professional practice 03Urban models for professional practice 03
Urban models for professional practice 03
DanisseLoiDapdap
 
web-roadmap developer file information..
web-roadmap developer file information..web-roadmap developer file information..
web-roadmap developer file information..
pandeyarush01
 
Ad

Advance Data_Preprocessing_and_Wrangling

  • 2. Agenda  Data Preprocessing  Need of data preprocessing  Type of Data  Objectives of data preprocessing  Hands on
  • 3. Data Preprocessing  Data preprocessing refers to the steps and techniques involved in preparing raw data for analysis or further processing.  It is a crucial phase in data analysis and machine learning workflows, aiming to improve data quality, consistency, and compatibility for effective data mining, modeling, and interpretation.  The main objectives of data preprocessing include cleaning noisy or incomplete data, transforming data into a suitable format for analysis, integrating data from multiple sources, and enhancing data quality to facilitate accurate and meaningful insights  Overall, data preprocessing plays a fundamental role in ensuring that data is well-organized, standardized, and ready for exploration and modeling tasks.  The quality of the data should be checked before applying machine learning or data mining algorithms.
  • 4. Need of Data Preprocessing Preprocessing of data is mainly to check the data quality. The quality can be checked by the following. Accuracy: To check whether the data entered is correct or not. Completeness: To check whether the data is available or not recorded. Consistency: To check whether the same data is kept in all the places that do or do not match Interpretability: The understandability of the data. Believability: The data should be trustable. Timeliness: The data should be updated correctly Real world data are generally: •Incomplete: Missing attribute values, missing certain attributes of importance, or having only aggregate data •Noisy: Containing errors or outliers •Inconsistent: Containing discrepancies in codes or names
  • 6. Objectives of Data Preprocessing Preprocessing of data is mainly to check the data quality. The quality can be checked by the following.  To transform the raw data into an understandable format  To transform data for its usable format  To eliminate inconsistencies in data  To remove duplicates in data  To give more accurate data for preprocessing  To give assurance for incorrect or missing values in data  To reduce dimensionalities in data
  • 7. Python Library used for Data Preprocessing  Pandas: ◦ Pandas is one of the most widely used libraries for data manipulation and analysis. ◦ It provides data structures like DataFrame and Series, which are highly efficient for handling structured data. ◦ Functions for data cleaning, filtering, merging, reshaping, and more are available in Pandas.  NumPy: ◦ NumPy is fundamental for numerical computing in Python. ◦ It offers powerful array operations and mathematical functions that are crucial for data preprocessing tasks. ◦ NumPy arrays are used extensively in conjunction with Pandas DataFrames for data manipulation.  Matplotlib and Seaborn: These libraries are used for data visualization, which is an important aspect of data preprocessing to understand the data's distribution and identify outliers. Matplotlib offers a wide range of plotting functions, while Seaborn provides high-level statistical visualization capabilities.
  翻译: