Exploratory Data Analysis 101:  Why Should You Use It and How Can You Benefit From It?

Exploratory Data Analysis 101: Why Should You Use It and How Can You Benefit From It?

Whether you are an analyst, a scientist, an engineer, or a marketer, one thing is for sure: you have to understand your data like the back of your hand. Living in a world with infinite amounts of data, it’s not enough to gather the pieces- if you do not understand the correlations between the elements, you will not be able to make the right decisions when the time comes. But how can you extract the necessary information? Of course, you can sit in front of your computer trying to find the appropriate rows and columns, but having the wanted results takes too much time and energy.

Work smarter not harder, take a step back, and start to visualize the data, find the patterns, and “feel” the information without having any hypotheses. In other words: by making an Exploratory Data Analysis.

What Is Exploratory Data Analysis?

The concept of Exploratory Data Analysis or EDA was created by James W. Tukey in 1977. Although it has many similarities with classical analysis, the approach, or more like the philosophy of data analysis is very different. Usually, the process of analysis starts with a scientific problem, or with a business goal, and the aim is to find the right conclusions based on the collected data. The difference lies in the intermediate steps, however.

In the classical analysis, data collection is followed by building a model (e. g. linearity, normality, etc.), and the steps of analysis, estimation, and testing are focusing on the parameters of the chosen model. However, in the case of EDA, data collection is followed immediately by analysis to find the right model for the right conclusions. Technically, there are six main differences between the two approaches.

Classical Analysis vs EDA

Model

The classical approach uses models (deterministic and probabilistic) on the data. Deterministic models could be, for example, regression models or ANOVA (analysis of variance) models. The most common probabilistic model, the ANOVA F test, assumes that the errors of the deterministic model are normally distributed— and this assumption affects the validity of the tests too.

Exploratory Data Analysis approach, on the contrary, does not apply deterministic or probabilistic models on the data, but it allows the data to suggest appropriate models that suit best to the data.

Focus

The two approaches differ significantly in their focus too. The classical analysis focuses mainly on the model and its estimated parameters while generating predicted values from the model. As for EDA, the focus is on the structure and the outliers of data, and the models are suggested by that.

Techniques

Classical techniques are generally quantitative in their nature, so they lead mostly to numeric or tabular output. Hypothesis testing, confidence intervals, or ANOVA are valuable and mainstream in terms of classical analysis.

The EDA approach uses mostly graphic techniques like histograms, scatter plots or box plots to get deeper insights from the data by using the natural pattern-recognizer ability of the human brain. With the usage of statistical graphics, one can better find and validate the right model, detect the outliers, identify relationships, or determine any possible factor effects.

This article continuous on our website. About the other three main differences, tools, and practices of EDA please visit the blog post.

No alt text provided for this image

To view or add a comment, sign in

More articles by Bence Borbely

Insights from the community

Explore topics