SlideShare a Scribd company logo
Data Preprocessing Using Python
14620313
DATA MINING
Universitas 17 Agustus 1945 Teknik Informatika
PENGAMPU
Dr. Fajar Astuti Hermawati, S.Kom.,M.Kom.
Bagus Hardiansyah, S.Kom.,M.Si
Ir. Sugiono,
MT
Naufal Abdillah, S.Kom.,
M.Kom.
Siti Mutrofin, S.Kom., M.Kom.
Sub Capaian Pembelajaran
● Mampu mengidentifikasi jenis data dan teknik-teknik
mempersiapkan data agar sesuai untuk diaplikasikan dengan
pendekatan data mining tertentu [C2,A3]
Indikator
● 2.3 Ketepatan mengidentifikasi konsep dan melakukan data
preprocessing agar sesuai dengan teknik data mining
Outline
● Data Preprocessing Using Python
1- Acquire the dataset
● Acquiring the dataset is the first step in data preprocessing in
machine learning. To build and develop Machine Learning models,
you must first acquire the relevant dataset. This dataset will be
comprised of data gathered from multiple and disparate sources
which are then combined in a proper format to form a dataset.
Dataset formats differ according to use cases. For instance, a
business dataset will be entirely different from a medical dataset.
While a business dataset will contain relevant industry and
business data, a medical dataset will include healthcare-related
data.
2- Import all the crucial libraries
● The predefined Python libraries can perform specific data
preprocessing jobs. Importing all the crucial libraries is the second
step in data preprocessing in machine learning.
3- Import the dataset
● In this step, you need to import the dataset/s that you have
gathered for the ML project at hand. Importing the dataset is one of
the important steps in data preprocessing in machine learning.
4- Identifying and handling the missing values
● In data preprocessing, it is pivotal to identify and correctly handle the missing values, failing to do this, you might draw
inaccurate and faulty conclusions and inferences from the data. Needless to say, this will hamper your ML project.
some typical reasons why data is missing:
● A. User forgot to fill in a field.
● B. Data was lost while transferring manually from a legacy database.
● C. There was a programming error.
● D. Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted.
● Basically, there are two ways to handle missing data:
● Deleting a particular row – In this method, you remove a specific row that has a null value for a feature or a particular
column where more than 75% of the values are missing. However, this method is not 100% efficient, and it is
recommended that you use it only when the dataset has adequate samples. You must ensure that after deleting the
data, there remains no addition of bias. Calculating the mean – This method is useful for features having numeric
data like age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column
or row that contains a missing value and replace the result for the missing value. This method can add variance to the
dataset, and any loss of data can be efficiently negated. Hence, it yields better results compared to the first method
(omission of rows/columns). Another way of approximation is through the deviation of neighbouring values. However,
this works best for linear data.
4- Identifying and handling the missing values
4- Identifying and handling the missing values
● Solution 1 : Dropna
4- Identifying and handling the missing values
● Solution 1 : Dropna
4- Identifying and handling the missing values
● Solution 2 : Fillna
4- Identifying and handling the missing values
● Solution 3 : SciKit Learn
5- Encoding the categorical data
Categorical data refers to the information that has specific categories within the
dataset. In the dataset cited above, there are two categorical variables – country
and purchased.
Machine Learning models are primarily based on mathematical equations. Thus,
you can intuitively understand that keeping the categorical data in the equation
will cause certain issues since you would only need numbers in the equations.
5- Encoding the categorical data
Solution 1 : ColumnTransformer
5- Encoding the categorical data
Solution 2 : Pd.get_dummies()
5- Encoding the categorical data
Solution 3 : Label Encoder
6- Splitting Dataset
Splitting the dataset is the next step in data preprocessing in
machine learning. Every dataset for Machine Learning model must
be split into two separate sets – training set and test set
7- Feature Scaling
● Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables
of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on
common grounds. Another reason why feature scaling is applied is that few algorithms like gradient descent converge much faster
with feature scaling than without it
7- Feature Scaling
● Why Feature Scaling?
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms
use Eucledian distance between two data points in their computations, this is a problem.If left alone, these algorithms only take in the magnitude of
features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in
a lot more in the distance calculations than features with low magnitudes.
7- Feature Scaling
● Min Max Scaler?
MinMax Scaler shrinks the data within the given range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales
the values to a specific value range without changing the shape of the original distribution.
7- Feature Scaling
● Min Max Scaler
7- Feature Scaling
● Standard Scaler
StandardScaler follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance.
7- Feature Scaling
When to Use Feature Scalling?
k-nearest neighbors with an Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally.
Scaling is critical, while performing Principal Component Analysis(PCA). PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This
skews the PCA towards high magnitude features.
We can speed up gradient descent by scaling. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum
when the variables are very uneven.
Tree based models are not distance based models and can handle varying ranges of features. Hence, Scaling is not required while modelling trees.
Algorithms like Linear Discriminant Analysis(LDA), Naive Bayes are by design equipped to handle this and gives weights to the features accordingly. Performing a features scaling in
these algorithms may not have much effect.
7- Feature Scaling
Normalization vs. Standardization
The two most discussed scaling methods are Normalization and Standardization. Normalization typically means rescales the values into a
range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
Ad

More Related Content

Similar to 13_Data Preprocessing in Python.pptx (1).pdf (20)

Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
AnwarrChaudary
 
1234
12341234
1234
Komal Patil
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
ijcnes
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Machine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdfMachine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdf
PhD Assistance
 
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdfAIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
MargiShah29
 
Deep learning(UNIT 3) BY Ms SURBHI SAROHA
Deep learning(UNIT 3) BY Ms SURBHI SAROHADeep learning(UNIT 3) BY Ms SURBHI SAROHA
Deep learning(UNIT 3) BY Ms SURBHI SAROHA
Dr. SURBHI SAROHA
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
IRJET Journal
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET Journal
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Deadpool120050
 
data_preprocessingknnnaiveandothera.pptx
data_preprocessingknnnaiveandothera.pptxdata_preprocessingknnnaiveandothera.pptx
data_preprocessingknnnaiveandothera.pptx
nikhilguptha06
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
Akanksha Gohil
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET Journal
 
M5.pptx
M5.pptxM5.pptx
M5.pptx
MayuraD1
 
Data reduction
Data reductionData reduction
Data reduction
GowriLatha1
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 
Intro to Data warehousing lecture 17
Intro to Data warehousing   lecture 17Intro to Data warehousing   lecture 17
Intro to Data warehousing lecture 17
AnwarrChaudary
 
Machine Learning Approaches and its Challenges
Machine Learning Approaches and its ChallengesMachine Learning Approaches and its Challenges
Machine Learning Approaches and its Challenges
ijcnes
 
Exploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdfExploratory Data Analysis - Satyajit.pdf
Exploratory Data Analysis - Satyajit.pdf
AmmarAhmedSiddiqui2
 
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Data Science  & AI Road Map by Python & Computer science tutor in MalaysiaData Science  & AI Road Map by Python & Computer science tutor in Malaysia
Data Science & AI Road Map by Python & Computer science tutor in Malaysia
Ahmed Elmalla
 
Machine Learning in the Financial Industry
Machine Learning in the Financial IndustryMachine Learning in the Financial Industry
Machine Learning in the Financial Industry
Subrat Panda, PhD
 
Machine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdfMachine Learning Algorithm for Business Strategy.pdf
Machine Learning Algorithm for Business Strategy.pdf
PhD Assistance
 
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdfAIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
AIML_UNIT 2 _PPT_HAND NOTES_MPS.pdf
MargiShah29
 
Deep learning(UNIT 3) BY Ms SURBHI SAROHA
Deep learning(UNIT 3) BY Ms SURBHI SAROHADeep learning(UNIT 3) BY Ms SURBHI SAROHA
Deep learning(UNIT 3) BY Ms SURBHI SAROHA
Dr. SURBHI SAROHA
 
Survey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy AlgorithmsSurvey paper on Big Data Imputation and Privacy Algorithms
Survey paper on Big Data Imputation and Privacy Algorithms
IRJET Journal
 
introduction to Statistical Theory.pptx
 introduction to Statistical Theory.pptx introduction to Statistical Theory.pptx
introduction to Statistical Theory.pptx
Dr.Shweta
 
Survey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction TechniquesSurvey on Feature Selection and Dimensionality Reduction Techniques
Survey on Feature Selection and Dimensionality Reduction Techniques
IRJET Journal
 
dimension reduction.ppt
dimension reduction.pptdimension reduction.ppt
dimension reduction.ppt
Deadpool120050
 
data_preprocessingknnnaiveandothera.pptx
data_preprocessingknnnaiveandothera.pptxdata_preprocessingknnnaiveandothera.pptx
data_preprocessingknnnaiveandothera.pptx
nikhilguptha06
 
Data Analytics Using R - Report
Data Analytics Using R - ReportData Analytics Using R - Report
Data Analytics Using R - Report
Akanksha Gohil
 
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET - Comparative Analysis of GUI based Prediction of Parkinson Disease usi...
IRJET Journal
 
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEYCLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
CLASSIFICATION ALGORITHM USING RANDOM CONCEPT ON A VERY LARGE DATA SET: A SURVEY
Editor IJMTER
 
Data Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data QualityData Cleaning and Preprocessing: Ensuring Data Quality
Data Cleaning and Preprocessing: Ensuring Data Quality
priyanka rajput
 

Recently uploaded (20)

How to Configure Scheduled Actions in odoo 18
How to Configure Scheduled Actions in odoo 18How to Configure Scheduled Actions in odoo 18
How to Configure Scheduled Actions in odoo 18
Celine George
 
Myasthenia gravis (Neuromuscular disorder)
Myasthenia gravis (Neuromuscular disorder)Myasthenia gravis (Neuromuscular disorder)
Myasthenia gravis (Neuromuscular disorder)
Mohamed Rizk Khodair
 
puzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tensepuzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tense
OlgaLeonorTorresSnch
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptxANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
Mayuri Chavan
 
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
Celine George
 
antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx
mansk2
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
Ancient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian HistoryAncient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian History
Virag Sontakke
 
Form View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo SlidesForm View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo Slides
Celine George
 
*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
Drugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdfDrugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdf
crewot855
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
How to Configure Scheduled Actions in odoo 18
How to Configure Scheduled Actions in odoo 18How to Configure Scheduled Actions in odoo 18
How to Configure Scheduled Actions in odoo 18
Celine George
 
Myasthenia gravis (Neuromuscular disorder)
Myasthenia gravis (Neuromuscular disorder)Myasthenia gravis (Neuromuscular disorder)
Myasthenia gravis (Neuromuscular disorder)
Mohamed Rizk Khodair
 
puzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tensepuzzle Irregular Verbs- Simple Past Tense
puzzle Irregular Verbs- Simple Past Tense
OlgaLeonorTorresSnch
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptxANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
Mayuri Chavan
 
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18
Celine George
 
antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx2025 The Senior Landscape and SET plan preparations.pptx
2025 The Senior Landscape and SET plan preparations.pptx
mansk2
 
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
MCQ PHYSIOLOGY II (DR. NASIR MUSTAFA) MCQS)
Dr. Nasir Mustafa
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
Ancient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian HistoryAncient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian History
Virag Sontakke
 
Form View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo SlidesForm View Attributes in Odoo 18 - Odoo Slides
Form View Attributes in Odoo 18 - Odoo Slides
Celine George
 
*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
Botany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic ExcellenceBotany Assignment Help Guide - Academic Excellence
Botany Assignment Help Guide - Academic Excellence
online college homework help
 
Drugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdfDrugs in Anaesthesia and Intensive Care,.pdf
Drugs in Anaesthesia and Intensive Care,.pdf
crewot855
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Ad

13_Data Preprocessing in Python.pptx (1).pdf

  • 1. Data Preprocessing Using Python 14620313 DATA MINING
  • 2. Universitas 17 Agustus 1945 Teknik Informatika PENGAMPU Dr. Fajar Astuti Hermawati, S.Kom.,M.Kom. Bagus Hardiansyah, S.Kom.,M.Si Ir. Sugiono, MT Naufal Abdillah, S.Kom., M.Kom. Siti Mutrofin, S.Kom., M.Kom.
  • 3. Sub Capaian Pembelajaran ● Mampu mengidentifikasi jenis data dan teknik-teknik mempersiapkan data agar sesuai untuk diaplikasikan dengan pendekatan data mining tertentu [C2,A3]
  • 4. Indikator ● 2.3 Ketepatan mengidentifikasi konsep dan melakukan data preprocessing agar sesuai dengan teknik data mining
  • 6. 1- Acquire the dataset ● Acquiring the dataset is the first step in data preprocessing in machine learning. To build and develop Machine Learning models, you must first acquire the relevant dataset. This dataset will be comprised of data gathered from multiple and disparate sources which are then combined in a proper format to form a dataset. Dataset formats differ according to use cases. For instance, a business dataset will be entirely different from a medical dataset. While a business dataset will contain relevant industry and business data, a medical dataset will include healthcare-related data.
  • 7. 2- Import all the crucial libraries ● The predefined Python libraries can perform specific data preprocessing jobs. Importing all the crucial libraries is the second step in data preprocessing in machine learning.
  • 8. 3- Import the dataset ● In this step, you need to import the dataset/s that you have gathered for the ML project at hand. Importing the dataset is one of the important steps in data preprocessing in machine learning.
  • 9. 4- Identifying and handling the missing values ● In data preprocessing, it is pivotal to identify and correctly handle the missing values, failing to do this, you might draw inaccurate and faulty conclusions and inferences from the data. Needless to say, this will hamper your ML project. some typical reasons why data is missing: ● A. User forgot to fill in a field. ● B. Data was lost while transferring manually from a legacy database. ● C. There was a programming error. ● D. Users chose not to fill out a field tied to their beliefs about how the results would be used or interpreted. ● Basically, there are two ways to handle missing data: ● Deleting a particular row – In this method, you remove a specific row that has a null value for a feature or a particular column where more than 75% of the values are missing. However, this method is not 100% efficient, and it is recommended that you use it only when the dataset has adequate samples. You must ensure that after deleting the data, there remains no addition of bias. Calculating the mean – This method is useful for features having numeric data like age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column or row that contains a missing value and replace the result for the missing value. This method can add variance to the dataset, and any loss of data can be efficiently negated. Hence, it yields better results compared to the first method (omission of rows/columns). Another way of approximation is through the deviation of neighbouring values. However, this works best for linear data.
  • 10. 4- Identifying and handling the missing values
  • 11. 4- Identifying and handling the missing values ● Solution 1 : Dropna
  • 12. 4- Identifying and handling the missing values ● Solution 1 : Dropna
  • 13. 4- Identifying and handling the missing values ● Solution 2 : Fillna
  • 14. 4- Identifying and handling the missing values ● Solution 3 : SciKit Learn
  • 15. 5- Encoding the categorical data Categorical data refers to the information that has specific categories within the dataset. In the dataset cited above, there are two categorical variables – country and purchased. Machine Learning models are primarily based on mathematical equations. Thus, you can intuitively understand that keeping the categorical data in the equation will cause certain issues since you would only need numbers in the equations.
  • 16. 5- Encoding the categorical data Solution 1 : ColumnTransformer
  • 17. 5- Encoding the categorical data Solution 2 : Pd.get_dummies()
  • 18. 5- Encoding the categorical data Solution 3 : Label Encoder
  • 19. 6- Splitting Dataset Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset for Machine Learning model must be split into two separate sets – training set and test set
  • 20. 7- Feature Scaling ● Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds. Another reason why feature scaling is applied is that few algorithms like gradient descent converge much faster with feature scaling than without it
  • 21. 7- Feature Scaling ● Why Feature Scaling? Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations, this is a problem.If left alone, these algorithms only take in the magnitude of features neglecting the units. The results would vary greatly between different units, 5kg and 5000gms. The features with high magnitudes will weigh in a lot more in the distance calculations than features with low magnitudes.
  • 22. 7- Feature Scaling ● Min Max Scaler? MinMax Scaler shrinks the data within the given range, usually of 0 to 1. It transforms data by scaling features to a given range. It scales the values to a specific value range without changing the shape of the original distribution.
  • 23. 7- Feature Scaling ● Min Max Scaler
  • 24. 7- Feature Scaling ● Standard Scaler StandardScaler follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance.
  • 25. 7- Feature Scaling When to Use Feature Scalling? k-nearest neighbors with an Euclidean distance measure is sensitive to magnitudes and hence should be scaled for all features to weigh in equally. Scaling is critical, while performing Principal Component Analysis(PCA). PCA tries to get the features with maximum variance and the variance is high for high magnitude features. This skews the PCA towards high magnitude features. We can speed up gradient descent by scaling. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. Tree based models are not distance based models and can handle varying ranges of features. Hence, Scaling is not required while modelling trees. Algorithms like Linear Discriminant Analysis(LDA), Naive Bayes are by design equipped to handle this and gives weights to the features accordingly. Performing a features scaling in these algorithms may not have much effect.
  • 26. 7- Feature Scaling Normalization vs. Standardization The two most discussed scaling methods are Normalization and Standardization. Normalization typically means rescales the values into a range of [0,1]. Standardization typically means rescales data to have a mean of 0 and a standard deviation of 1 (unit variance).
  翻译: