Important analytical steps to follow in Data Science problems
11 preprocessing tips for your data

Important analytical steps to follow in Data Science problems

Often & on, a lot of beginners doesn't understand the importance of analysis of data in a Data Science problem and starts off with applying fancy algorithms without any preprocessing & hence doesn't get expected results. They must understand that deciding ML model is the last & not the first step of any DS problem.Here I will try to highlight some crucial steps you must apply before feeding your data to a ML model.

  1. KNOW YOUR DATA using info & describe(in python).While info() highlights datatype & null values, describe() shows summarized info like min, max, std, etc for all columns.This helps us to know which fields to fill, know which fields are textual & hence to be converted to numeric form, have a look towards the distribution to know about whether scaling needed,etc.
  2. DROP constant fields(only one value) or having large number of missing values (about 70%) as to avoid CURSE OF DIMENSIONALITY
  3. CHECK FOR OUTLIERS.You can use the z-score for this purpose.Any data with z-score below -3 or greater than 3 is an outlier. Data visualization can be handy.
  4. CONVERT TEXTUAL TO NUMERIC using OneHotEncoder or LabelEncoding. Though prefer OHE as LabelEncoding is unable to depict difference between categories.

Example->Let us have 25 categories in a field.When using LabelEncoding, let 'A' be converted to 1 & 'B' to 2.As 'Y' would be converted to 25 (following the same pattern), hence A is closer to B than Y but this isn't the actual case.All categories are equally different.This problem is eliminated using OHE but also increase dimensions.

Also perform this step before filling NA's as you might have to find some logic to fill textual data with another textual value (you can use NA but if converted to numeric after filling with NA, it would be taken as another category) .It is easy to fill NA values using builtin functions like mean, median or mode when data is numeric but this won't apply on textual data.Hence first convert than fill would be my suggestions.

5. FILL MISSING VALUES is a crucial step. You can fill it using many ways like NA,mean,mode,median or even drop(as mentioned earlier).But do remember to use mode/median to fill categorical data field (converted to numeric) as using mean may provide you with decimal points which actually doesn't represent any category(for label encoding)

Example-> Let there be 2 categories that is 'A' & 'B'.Let using LabelEncoding they get converted to 1 & 2.Now mean is 1.5 which doesn't represent any category.Same problem won't arise with OHE.

6. FEATURE ENGINEERING should never be avoided.For those unaware of this term, it refers to building new features using old features.

Eg->total area= area_1st_floor +area_ground_floor where total_area is a new feature while the other two are existing in dataset.

But do remember that it shouldn't have high correlation with other existing feature(CURSE OF DIMENSIONALITY)

7. For regression problems, do check target for SKEWNESS/KURTOSIS and if found guilty , apply log transform or BoxCox transform.This is important for the assumption that data has normal distribution because of the Central Limit Theorem.

8.SCALING is necessary sometimes as without scaling, some features get major dominance than others like 'Age' & 'Income'.Here though they both weigh the same, but due to difference in scales, income may become dominant(in some ML models only).For this, RobustScaler, StandardScaler & Min-MaxScaler are available .Though RobustScaler is robust to outliers & hence my first choice.

9.Check for CORRELATION BETWEEN FEATURES.If any two features found correlated, either they both can be retained or any one of them can be chucked off (check with result of training data for both cases).

10.If number of features is more than data rows, you might need PCA for dimension reduction (like reducing number of features from 100 to 10 without any major data loss) to avoid CURSE OF DIMENSIONALITY

11.Check for IMBALANCE IN TARGET (target field) .It means having a target value in heavy majority than any other target value.

Eg->If target has Yes & No, than Yes with about 95% and No with only 5%.

This can be resolved using upsampling(replicating 'No' data ) or downsampling(decreasing 'Yes' data) in training dataset.

Apart from this, using data visualization alongside some steps would help you yield better results.Also do split your data for validation purpose as well using the 80:20 rule.And don't forget to apply all changes to both training and testing dataset.

These were some of the basic analytical steps that might get you better results in a classification or regression problem.Though for problems related to NLP, Time Series, etc some more steps may be required , the basic approach remains the same.But always remember, there is nothing as a rule in Data Science(you might get best results without any of the above steps) & hence

Explore more, Learn more!!!

It was really helpful for me to start as a beginner.

Like
Reply
Parth Chokhra

Data Science and Machine Learning | Blue Altair

5y

It's nice and intuitive.

Malyaj Mishra

AI and Computer Vision Engineer @Big Vision | IIIT'ian

5y

Good going Mr. Mehul.

To view or add a comment, sign in

More articles by Mehul Gupta

Insights from the community

Others also viewed

Explore topics