In the field of data science, statistics serves as the backbone, providing the essential tools and techniques for extracting meaningful insights from data. Understanding statistics is imperative for any data scientist, as it equips them with the necessary skills to make informed decisions, derive accurate predictions, and uncover hidden patterns within vast datasets.
This article explains the significance of statistics in data science, exploring its fundamental concepts and real-life applications.
What are Statistics?
Statistics is a branch of mathematics that is responsible for collecting, analyzing, interpreting, and presenting numerical data. It encompasses a wide array of methods and techniques used to summarize and make sense of complex datasets.
Key concepts in statistics include descriptive statistics, which involve summarizing and presenting data in a meaningful way, and inferential statistics, which allow us to make predictions or inferences about a population based on a sample of data. Probability theory, hypothesis testing, regression analysis, and Bayesian methods are among the many branches of statistics that find applications in data science.
Types of Statistics
There are commonly two types of statistics, which are discussed below:
- Descriptive Statistics -Descriptive Statistics Descriptive statistics are tools that help us simplify and organize large chunks of data, making vast amounts of information easier to understand.
- Inferential Statistics - Inferential StatisticsInferential statistics are techniques that allow us to make generalizations and predictions about a population based on a sample of data. They help us draw conclusions and make inferences about the larger group from which the sample was taken..
Descriptive Statistics
Measure of Central Tendency
Mean | Median | Mode |
---|
Mean is calculated by summing all values present in the sample divided by total number of values present in the sample.
Mean (\mu) = \frac{Sum \, of \, Values}{Number \, of \, Values} | Median is the middle of a sample when arranged from lowest to highest or highest to lowest. in order to find the median, the data must be sorted.
For odd number of data points: Median = (\frac{n+1}{2})^{th} For even number of data points: Median = Average \, of \, (\frac{n}{2})^{th} value \, and \, its \, next \, value | Mode is the most frequently occurring value in the dataset.
|
Each of these measures offers distinct perspectives on the central tendency of a dataset. The mean is influenced by extreme values (outliers), whereas the median is more resilient when outliers are present. The mode is valuable for pinpointing the most frequently occurring value(s) in a dataset.
Measure of Dispersion
Range | Mean Absolute Deviation | Standard Deviation and Variance |
---|
Interquartile Range (IQR) | Coefficient of Variation (CV)
| Z-score |
---|
Range is the difference between the maximum and minimum values of the sample. | Mean Absolute Deviation is the average of the absolute differences between each data point and the mean. It provides a measure of the average deviation from the mean.
For Mean Absolute Deviation (MAD), the formula is:
\text{MAD} = \frac{1}{n} \sum_{i=1}^{n} \left| X_i - \bar{X} \right|
Where: X_i are the individual Points. \bar{X} is the mean of data points. n is the number of data points.
| Standard Deviation is the square root of variance. The measuring unit of S.D. is same as the Sample values' unit. It indicates the average distance of data points from the mean and is widely used due to its intuitive interpretation.
\sigma=\sqrt(\sigma^2)=\sqrt(\frac{\Sigma(X-\mu)^2}{n}) Variance is a measure of how spread-out values from the mean by measuring the dispersion around the mean. \sigma^2=\frac{\Sigma(X-\mu)^2}{n} |
Interquartile Range (IQR) is the range between the first quartile (Q1) and the third quartile (Q3). It is less sensitive to extreme values than the range.
IQR = Q_3 -Q_1 To compute IQR, calculate the values of the first and third quartile by arranging the data in ascending order. Then, calculate the mean of each half of the dataset. | Coefficient of Variation (CV) is the ratio of the standard deviation to the mean, expressed as a percentage. It is useful for comparing the relative variability of different datasets.
CV = (\frac{\sigma}{\mu}) * 100 | The Z-score measures the number of standard deviations a data point is from the mean of a dataset, providing a standardized way to assess its relative position within the distribution.
Z=\frac{X-\mu}{\sigma}
|
Quartiles
Quartiles divides the dataset into four equal parts:
- Q1 is the median of the lower 25%
- Q2 is the median (50%)
- Q3 is the median of the upper 25% of the dataset.
Measure of Shape
Kurtosis
Kurtosis is a statistical measure that describes the shape of a distribution's tails in relation to its overall shape. It indicates whether data points are more or less concentrated in the tails compared to a normal distribution. High kurtosis signifies heavy tails and possibly outliers, while low kurtosis indicates light tails and a lack of outliers.
Types of KurtosisSkewness
Skewness is the measure of asymmetry of probability distribution about its mean.
Skewness is a statistical measure that describes the asymmetry of a distribution around its mean. A distribution can be:
- Positively skewed (right-skewed): The right tail (higher values) is longer or fatter than the left tail. Most of the data points are concentrated on the left, with a few larger values extending to the right.
- Negatively skewed (left-skewed): The left tail (lower values) is longer or fatter than the right tail. Most of the data points are concentrated on the right, with a few smaller values extending to the left.
Types of Skewness
Right Skew:
- Also known as positive skewness.
- Characteristics:
- Longer or fatter tail on the right-hand side (upper tail).
- More extreme values on the right side.
- Mean > Median.
- Indicates a distribution that is skewed towards the left.
Left Skew:
- Also known as negative skewness.
- Characteristics:
- Longer or fatter tail on the left-hand side (lower tail).
- More extreme values on the left side.
- Mean < Median.
- Indicates a distribution that is skewed towards the right.
Zero Skew:
- Also known as symmetrical distribution.
- Characteristics:
- Symmetric distribution.
- Left and right sides are mirror images of each other.
- Mean = Median.
- Indicates a distribution with no skewness.
Types of Skewed dataCovariance and Correlation
Covariance | Correlation |
---|
Covariance measures the degree to which two variables change together.
Cov(x,y) = \frac{\sum(X_i-\overline{X})(Y_i - \overline{Y})}{n} | Correlation measures the strength and direction of the linear relationship between two variables. It is represented by correlation coefficient which ranges from -1 to 1. A positive correlation indicates a direct relationship, while a negative correlation implies an inverse relationship.
Pearson's correlation coefficient is given by:
\rho(X, Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y} |
Regression coefficient
Regression coefficient is a value that represents the relationship between a predictor variable and the response variable in a regression model. It quantifies the change in the response variable for a one-unit change in the predictor variable, holding all other predictors constant. In a simple linear regression model, the regression coefficient indicates the slope of the line that best fits the data. In multiple regression, each coefficient represents the impact of one predictor variable while accounting for the effects of other variables in the model.
The equation is y=\alpha+ \beta * x ,
where
- y is the dependent variable,
- x is the independent variable,
- \alpha is the intercept,
- \beta is the regression coefficient.
- Regression coefficient, \beta = \frac{\sum(X_i-\overline{X})(Y_i - \overline{Y})}{\sum(X_i-\overline{X})^2}
Probability
Probability Functions
Probability Mass Functions | Probability Density Function |
---|
Probability Mass Function is a concept in probability theory that describes the probability distribution of a discrete random variable. The PMF gives the probability of each possible outcome of a discrete random variable. | The Probability Density Function describes the likelihood of a continuous random variable falling within a particular range. It's the derivative of the cumulative distribution function (CDF). |
Cumulative Distribution Function | Empirical Distribution Function |
The Cumulative Distribution Function gives the probability that a random variable will take a value less than or equal to a given value. It's the integral of the probability density function (PDF). | The Empirical Distribution Function is a non-parametric estimator of the cumulative distribution function (CDF) based on observed data. For a given set of data points, the EDF represents the proportion of observations less than or equal to a specific value. It is constructed by sorting the data and assigning a cumulative probability to each data point. |
Bayes Theorem
Bayes' Theorem is a fundamental principle in probability theory and statistics that describes how to update the probability of a hypothesis based on new evidence. The formula is as follows:
P(A∣B)=P(B)P(B∣A)⋅P(A), where
- P(A∣B): The probability of event A given that event B has occurred (posterior probability).
- P(B∣A): The probability of event B given that event A has occurred (likelihood).
- P(A): The probability of event A occurring (prior probability).
- P(B): The probability of event B occurring.
Probability Distributions
Discrete Distribution
Uniform Distribution | Binomial Distribution | Poisson Distribution |
---|
The uniform distribution represents a constant probability for all outcomes in a given range. f(X)=\frac{1}{b-a} For the same previous dataset, assuming the bus arrives uniformly between 5 and 18 minutes so the probability of waiting less than 15 minutes: P(X<15)= ∫_5^{15}\frac{1}{18-5}dx=\frac{10}{13}=0.7692 | The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials, where each trial has the same probability of success (p). Formula: P(X=k)=(^n_k)p^k(1-p)^{n-k} Assuming each trial is an independent event with a success probability of p=0.5, and we are calculating the probability of getting 3 successes in 6 trials: P(X=3)=(^6_3)(0.5)^3(1-0.5)^3=0.3125 | The Poisson distribution models the number of events that occur in a fixed interval of time or space. It's characterized by a single parameter (λ), the average rate of occurrence. P(X=k)=\frac{\epsilon^{-\lambda}\lambda^k}{k!} For the previous dataset, assuming the average rate of waiting time is λ=10, and we are calculating the probability of waiting exactly 12 minutes: P(X=12)=\frac{\varepsilon^{-10}.10^{12}}{12!}≈0.0948 |
Continuous Distribution
Normal or Gaussian Distribution
The normal distribution also known as the Gaussian distribution, is a continuous probability distribution that is symmetrical and bell-shaped. It is characterized by its mean (μ) and standard deviation (σ).
Formula: f(X|\mu,\sigma)=\frac{\epsilon^{-0.5(\frac{X-\mu}{\sigma})^2}}{\sigma\sqrt(2\pi)}
There is a empirical rule in normal distribution, which states that:
- Approximately 68% of the data falls within one standard deviation (σ) of the mean in both directions. This is often referred to as the 68-95-99.7 rule.
- About 95% of the data falls within two standard deviations (2σ) of the mean.
- Approximately 99.7% of the data falls within three standard deviations (3σ) of the mean.
These rule is used to detect outliers.

Central Limit Theorem
The Central Limit Theorem (CLT) states that, regardless of the shape of the original population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size tends to infinity.
Hypothesis Testing
Hypothesis testing makes inferences about a population parameter based on sample statistic.

Null Hypothesis (H₀) and Alternative Hypothesis (H₁)
- Null hypothesis states there is no significant difference or effect. It assumes a given statement is true unless it is proven false.
- Alternative hypothesis contradicts the null hypothesis. It raises doubt that the given statement can be false. It is what researchers aim to support with evidence.
Level of Significance
The level of significance, often denoted by the symbol (\alpha ), is a critical parameter in hypothesis testing and statistical significance testing. It defines the probability of making a Type I error, which occurs when a true null hypothesis is incorrectly rejected.
Type I Error and Type II Error
It is also known as Alpha(\alpha ) or significance level. It incorrectly rejects a true null hypothesis i.e. the given statement True but trial says it is false which is wrong output.
It is also known as Beta( 1-\alpha ) where we are failing to reject a false null hypothesis i.e. the given statement is false but trial says it is true which is basically wrong output again.
Degrees of freedom
Degrees of freedom (df) in statistics represent the number of values or quantities in the final calculation of a statistic that are free to vary. It is mainly defined as sample size - one(n-1).
Confidence Intervals
A confidence interval is a range of values that is used to estimate the true value of a population parameter with a certain level of confidence. It provides a measure of the uncertainty or margin of error associated with a sample statistic, such as the sample mean or proportion.
p-value
The p-value, short for probability value, is a fundamental concept in statistics that quantifies the evidence against a null hypothesis.
Example of Hypothesis testing:
Let us consider An e-commerce company wants to assess whether a recent website redesign has a significant impact on the average time users spend on their website.
The company collects the following data:
- Data on user session durations before and after the redesign.
- Before redesign: Sample mean (\overline{\rm x} ) = 3.5 minutes, Sample standard deviation (s) = 1.2 minutes, Sample size (n) = 50.
- After redesign: Sample mean (\overline{\rm x} = 4.2 minutes, Sample standard deviation (s) = 1.5 minutes, Sample size (n) = 60.
The Hypothesis are defined as:
Null Hypothesis (H0): The website redesign has no impact on the average user session duration \mu_{after} -\mu_{before} = 0
Alternative Hypothesis (Ha): The website redesign has a positive impact on the average user session duration \mu_{after} -\mu_{before} > 0
Significance Level:
Choose a significance level, α=0.05(commonly used)
Test Statistic and P-Value:
- Conduct a test for the difference in means.
- Calculate the test statistic and p-value.
Result:
If the p-value is less than the chosen significance level, reject the null hypothesis.
If the p-value is greater than or equal to the significance level, fail to reject the null hypothesis.
Conclusion:
Based on the analysis, the company draws conclusions about whether the website redesign has a statistically significant impact on user session duration.
Parametric Test
Parametric test are statistical methods that make assumption that the data follows normal distribution.
Z-test
Z-test is used to determine the significant difference between sample mean and known population mean or between two independent samples.
Useful when the sample size is large, and population standard deviation is known.
- One Sample Z-test: compares the mean of single sample to known population mean.
Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}
- Two Sample Z-test: compares the mean of two independent samples to determine if they are significantly different.
Z = \frac{\overline{X_1} -\overline{X_2}}{\sqrt{\frac{\sigma_{1}^{2}}{n_1} + \frac{\sigma_{2}^{2}}{n_2}}}
T-test
T-test determine if there is a significant difference between the means
- One- Sample t-test: compare the mean of a single sample to the mean of the population.
t = \frac{\overline{X}- \mu}{\frac{s}{\sqrt{n}}}
- Two-Sample t-Test: compare the means of two independent samples.
t= \frac{\overline{X_1} - \overline{X_2}}{\sqrt{\frac{s_{1}^{2}}{n_1} + \frac{s_{2}^{2}}{n_2}}}
- Paired t-Test: compare the means of two related or paired samples.
t = \frac{\overline{d}}{\frac{s_d}{\sqrt{n}}}
About Variance
F-test:
The F-test is a statistical test that is used to compare the variances of two or more groups to assess whether they are significantly different.
F = \frac{s_{1}^{2}}{s_{2}^{2}}
ANOVA (Analysis Of Varience)
One-way Anova:
Used to compare means of three or more groups to determine if there are statistically significant differences among them.
here, H0: The means of all groups are equal.
Ha: At least one group mean is different.
Two-way Anova:
It assess the influence of two categorical independent variables on a dependent variable, examining the main effects of each variable and their interaction effect.
Confidence Interval
A confidence interval (CI) provides a range of values within which we can be reasonably confident that a population parameter lies. It is an interval estimate, and it is often used to quantify the uncertainty associated with a sample estimate of a population parameter.
Confidence Interval for means (n \geq 30) | Confidence Interval for means (n<30) | Confidence Interval for proportions |
---|
(\overline{x} \pm z \frac{\sigma}{\sqrt{n}} | (\overline{x} \pm t \frac{s}{\sqrt{n}}) | (\widehat{p} \pm z \sqrt{\frac{\widehat{p} -(1- \widehat{p})}{n}}) |
Non-Parametric Test
Non-parametric test does not make assumptions about the distribution of the data. They are useful when data does not meet the assumptions required for parametric tests.
Chi-Squared Test (Goodness of Fit) | Mann-Whitney U Test | Kruskal-Wallis Test |
---|
The chi-squared test is used to determine if there is a significant association between two categorical variables. It compares the observed frequencies in a contingency table with the frequencies. X^2=\Sigma{\frac{(O_{ij}-E_{ij})^2}{E_{ij}}} This test is also performed on big data with multiple number of observations. | Mann-Whitney U Test is used to determine whether there is a difference between two independent groups when the dependent variable is ordinal or continuous. Applicable when assumptions for a t-test are not met. In it we rank all data points, combines the ranks, and calculates the test statistic. | Kruskal-Wallis Test is used to determine whether there are differences among three or more independent groups when the dependent variable is ordinal or continuous. Non-parametric alternative to one-way ANOVA. |
A/B Testing or Split Testing
A/B testing, also known as split testing, is a method used to compare two versions (A and B) of a webpage, app, or marketing asset to determine which one performs better.
Example : a product manager change a website's "Shop Now" button color from green to blue to improve the click-through rate (CTR). Formulating null and alternative hypotheses, users are divided into A and B groups, and CTRs are recorded. Statistical tests like chi-square or t-test are applied with a 5% confidence interval. If the p-value is below 5%, the manager may conclude that changing the button color significantly affects CTR, informing decisions for permanent implementation.
Similar Reads
Statistics in Spreadsheets
Microsoft office contains multiple applications that altogether serve as a complete information management system. A spreadsheet is among those tools and is very important when the analysis of data is required. A spreadsheet is defined as a large sheet that contains data and information arranged in
5 min read
Statistics For Data Science
Statistics is like a toolkit we use to understand and make sense of information. It helps us collect, organize, analyze, and interpret data to find patterns, trends, and relationships in the world around us. In this Statistics cheat sheet, you will find simplified complex statistical concepts, with
15+ min read
Statistics in Maths
Statistics is the science of collecting, organizing, analyzing, and interpreting information to uncover patterns, trends, and insights. Statistics allows us to see the bigger picture and tackle real-world problems like measuring the popularity of a new product, predicting the weather, or tracking he
3 min read
Estimation in Statistics
Estimation is a technique for calculating information about a bigger group from a smaller sample, and statistics are crucial to analyzing data. For instance, the average age of a city's population may be obtained by taking the age of a sample of 1,000 residents. While estimates aren't perfect, they
12 min read
Mean in Statistics
Mean in Mathematics is the measure of central tendency and is mostly used in Statistics. Mean is the easiest of all the measures. Data is of two types, Grouped data and ungrouped data. The method of finding the mean is also different depending on the type of data. Mean is generally the average of a
15+ min read
Application of Statistics
Statistics is a branch of mathematics that deals with collecting, analyzing, interpreting, and presenting data. It helps decision-makers to draw evidence-based conclusions from data. There are various application of Statistics mentioned in article below. Applications of StatisticsHealthcare: In the
2 min read
Descriptive Statistics
Descriptive statistics is a subfield of statistics that deals with characterizing the features of known data. Descriptive statistics give summaries of either population or sample data. Its primary aim is to define and analyze the fundamental characteristics of a dataset without making sweeping gener
13 min read
Parameters and Statistics
Statistics and parameters are two fundamental concepts in statistical theory. Although they may sound equal, there is a sharp difference between the two. One is used to represent the population, and the other is used to represent the sample. Now we will focus on the sample and population: Population
8 min read
Statistic vs Parameter
In the statistical and data analytics areas, parameters and statistics are the most widely used two terms. Statistic is a numerical value calculated from a sample of data, whereas, Parameter is a numerical value that describes a characteristic of an entire population. This article will provide the m
7 min read
Statistics Formulas
Statistics is a branch of mathematics that deals with numerical data analysis. It presents the data in an organized manner. Statistics is a study of a collection of data, analysis it, interpretation, and presentation of data in a well-organized form. It allows us to interpret various results from th
5 min read