Open In App

Statistics Cheat Sheet

Last Updated : 06 Dec, 2024
Comments
Improve
Suggest changes
Like Article
Like
Report

In the field of data science, statistics serves as the backbone, providing the essential tools and techniques for extracting meaningful insights from data. Understanding statistics is imperative for any data scientist, as it equips them with the necessary skills to make informed decisions, derive accurate predictions, and uncover hidden patterns within vast datasets.

This article explains the significance of statistics in data science, exploring its fundamental concepts and real-life applications.

Statistics-Banner-Image



What are Statistics?

Statistics is a branch of mathematics that is responsible for collecting, analyzing, interpreting, and presenting numerical data. It encompasses a wide array of methods and techniques used to summarize and make sense of complex datasets.

Key concepts in statistics include descriptive statistics, which involve summarizing and presenting data in a meaningful way, and inferential statistics, which allow us to make predictions or inferences about a population based on a sample of data. Probability theory, hypothesis testing, regression analysis, and Bayesian methods are among the many branches of statistics that find applications in data science.

Types of Statistics

There are commonly two types of statistics, which are discussed below:

  • Descriptive Statistics -De­scriptive Statistics Descriptive statistics are tools that help us simplify and organize large chunks of data, making vast amounts of information easier to understand.
  • Inferential Statistics - Inferential StatisticsInferential statistics are techniques that allow us to make generalizations and predictions about a population based on a sample of data. They help us draw conclusions and make inferences about the larger group from which the sample was taken..

Descriptive Statistics

Measure of Central Tendency

Mean

Median

Mode

Mean is calculated by summing all values present in the sample divided by total number of values present in the sample.

Mean (\mu) = \frac{Sum \, of \, Values}{Number \, of \, Values}  

Median is the middle of a sample when arranged from lowest to highest or highest to lowest. in order to find the median, the data must be sorted.

For odd number of data points:

Median = (\frac{n+1}{2})^{th}

For even number of data points:

Median = Average \, of \, (\frac{n}{2})^{th} value \, and \, its \, next \, value

Mode is the most frequently occurring value in the dataset.


Each of these measures offers distinct perspectives on the central tendency of a dataset. The mean is influenced by extreme values (outliers), whereas the median is more resilient when outliers are present. The mode is valuable for pinpointing the most frequently occurring value(s) in a dataset.

Measure of Dispersion

Range

Mean Absolute Deviation

Standard Deviation and Variance

Interquartile Range (IQR)

Coefficient of Variation (CV)

Z-score

Range is the difference between the maximum and minimum values of the sample.

Mean Absolute Deviation is the average of the absolute differences between each data point and the mean. It provides a measure of the average deviation from the mean.


For Mean Absolute Deviation (MAD), the formula is:

\text{MAD} = \frac{1}{n} \sum_{i=1}^{n} \left| X_i - \bar{X} \right|
Where:
X_i are the individual Points.

\bar{X} is the mean of data points.

n is the number of data points.

Standard Deviation is the square root of variance. The measuring unit of S.D. is same as the Sample values' unit. It indicates the average distance of data points from the mean and is widely used due to its intuitive interpretation.

\sigma=\sqrt(\sigma^2)=\sqrt(\frac{\Sigma(X-\mu)^2}{n})

Variance is a measure of how spread-out values from the mean by measuring the dispersion around the mean.

\sigma^2=\frac{\Sigma(X-\mu)^2}{n}   

Interquartile Range (IQR) is the range between the first quartile (Q1) and the third quartile (Q3). It is less sensitive to extreme values than the range.

IQR = Q_3 -Q_1

To compute IQR, calculate the values of the first and third quartile by arranging the data in ascending order. Then, calculate the mean of each half of the dataset.

Coefficient of Variation (CV) is the ratio of the standard deviation to the mean, expressed as a percentage. It is useful for comparing the relative variability of different datasets.

CV = (\frac{\sigma}{\mu}) * 100

The Z-score measures the number of standard deviations a data point is from the mean of a dataset, providing a standardized way to assess its relative position within the distribution.

 Z=\frac{X-\mu}{\sigma}


Quartiles

Quartiles divides the dataset into four equal parts:

  • Q1 is the median of the lower 25%
  • Q2 is the median (50%)
  • Q3 is the median of the upper 25% of the dataset.

Measure of Shape

Kurtosis

Kurtosis is a statistical measure that describes the shape of a distribution's tails in relation to its overall shape. It indicates whether data points are more or less concentrated in the tails compared to a normal distribution. High kurtosis signifies heavy tails and possibly outliers, while low kurtosis indicates light tails and a lack of outliers.

Kurtosis
Types of Kurtosis

Skewness

Skewness is the measure of asymmetry of probability distribution about its mean.

Skewness is a statistical measure that describes the asymmetry of a distribution around its mean. A distribution can be:

  • Positively skewed (right-skewed): The right tail (higher values) is longer or fatter than the left tail. Most of the data points are concentrated on the left, with a few larger values extending to the right.
  • Negatively skewed (left-skewed): The left tail (lower values) is longer or fatter than the right tail. Most of the data points are concentrated on the right, with a few smaller values extending to the left.

Types of Skewness

Right Skew:

  • Also known as positive skewness.
  • Characteristics:
    • Longer or fatter tail on the right-hand side (upper tail).
    • More extreme values on the right side.
    • Mean > Median.
  • Indicates a distribution that is skewed towards the left.

Left Skew:

  • Also known as negative skewness.
  • Characteristics:
    • Longer or fatter tail on the left-hand side (lower tail).
    • More extreme values on the left side.
    • Mean < Median.
  • Indicates a distribution that is skewed towards the right.

Zero Skew:

  • Also known as symmetrical distribution.
  • Characteristics:
    • Symmetric distribution.
    • Left and right sides are mirror images of each other.
    • Mean = Median.
  • Indicates a distribution with no skewness.
Skewness
Types of Skewed data

Covariance and Correlation

Covariance

Correlation

Covariance measures the degree to which two variables change together.

Cov(x,y) = \frac{\sum(X_i-\overline{X})(Y_i - \overline{Y})}{n}

Correlation measures the strength and direction of the linear relationship between two variables. It is represented by correlation coefficient which ranges from -1 to 1. A positive correlation indicates a direct relationship, while a negative correlation implies an inverse relationship.

Pearson's correlation coefficient is given by:

\rho(X, Y) = \frac{cov(X,Y)}{\sigma_X \sigma_Y}

Regression coefficient

Regression coefficient is a value that represents the relationship between a predictor variable and the response variable in a regression model. It quantifies the change in the response variable for a one-unit change in the predictor variable, holding all other predictors constant. In a simple linear regression model, the regression coefficient indicates the slope of the line that best fits the data. In multiple regression, each coefficient represents the impact of one predictor variable while accounting for the effects of other variables in the model.

The equation is y=\alpha+ \beta * x   ,

where

  • y is the dependent variable,
  • x is the independent variable,
  • \alpha   is the intercept,
  • \beta   is the regression coefficient.
  • Regression coefficient, \beta = \frac{\sum(X_i-\overline{X})(Y_i - \overline{Y})}{\sum(X_i-\overline{X})^2}

Probability

Probability Functions

Probability Mass Functions

Probability Density Function

Probability Mass Function is a concept in probability theory that describes the probability distribution of a discrete random variable. The PMF gives the probability of each possible outcome of a discrete random variable.

The Probability Density Function describes the likelihood of a continuous random variable falling within a particular range. It's the derivative of the cumulative distribution function (CDF).

Cumulative Distribution Function

Empirical Distribution Function

The Cumulative Distribution Function gives the probability that a random variable will take a value less than or equal to a given value. It's the integral of the probability density function (PDF).

The Empirical Distribution Function is a non-parametric estimator of the cumulative distribution function (CDF) based on observed data. For a given set of data points, the EDF represents the proportion of observations less than or equal to a specific value. It is constructed by sorting the data and assigning a cumulative probability to each data point.

Bayes Theorem

Bayes' Theorem is a fundamental principle in probability theory and statistics that describes how to update the probability of a hypothesis based on new evidence. The formula is as follows:

P(AB)=P(B)P(BA)⋅P(A)​, where

  • P(AB): The probability of event A given that event B has occurred (posterior probability).
  • P(BA): The probability of event B given that event A has occurred (likelihood).
  • P(A): The probability of event A occurring (prior probability).
  • P(B): The probability of event B occurring.
     

Probability Distributions

Discrete Distribution

Uniform Distribution

Binomial Distribution

Poisson Distribution

The uniform distribution represents a constant probability for all outcomes in a given range.

f(X)=\frac{1}{b-a}

For the same previous dataset, assuming the bus arrives uniformly between 5 and 18 minutes so the probability of waiting less than 15 minutes: P(X<15)= ∫_5^{15}\frac{1}{18-5}dx=\frac{10}{13}=0.7692

The binomial distribution models the number of successes in a fixed number of independent Bernoulli trials, where each trial has the same probability of success (p).

Formula: P(X=k)=(^n_k)p^k(1-p)^{n-k}

Assuming each trial is an independent event with a success probability of p=0.5, and we are calculating the probability of getting 3 successes in 6 trials: P(X=3)=(^6_3)(0.5)^3(1-0.5)^3=0.3125

The Poisson distribution models the number of events that occur in a fixed interval of time or space. It's characterized by a single parameter (λ), the average rate of occurrence.

 P(X=k)=\frac{\epsilon^{-\lambda}\lambda^k}{k!}

For the previous dataset, assuming the average rate of waiting time is λ=10, and we are calculating the probability of waiting exactly 12 minutes: P(X=12)=\frac{\varepsilon^{-10}.10^{12}}{12!}≈0.0948

Continuous Distribution

Normal or Gaussian Distribution

The normal distribution also known as the Gaussian distribution, is a continuous probability distribution that is symmetrical and bell-shaped. It is characterized by its mean (μ) and standard deviation (σ).

Formula: f(X|\mu,\sigma)=\frac{\epsilon^{-0.5(\frac{X-\mu}{\sigma})^2}}{\sigma\sqrt(2\pi)}

There is a empirical rule in normal distribution, which states that:

  • Approximately 68% of the data falls within one standard deviation (σ) of the mean in both directions. This is often referred to as the 68-95-99.7 rule.
  • About 95% of the data falls within two standard deviations (2σ) of the mean.
  • Approximately 99.7% of the data falls within three standard deviations (3σ) of the mean.

These rule is used to detect outliers.

Normal-Distribution

Central Limit Theorem

The Central Limit Theorem (CLT) states that, regardless of the shape of the original population distribution, the sampling distribution of the sample mean will be approximately normally distributed if the sample size tends to infinity.

Hypothesis Testing

Hypothesis testing makes inferences about a population parameter based on sample statistic.

type-error

Null Hypothesis (H₀) and Alternative Hypothesis (H₁)

  • Null hypothesis states there is no significant difference or effect. It assumes a given statement is true unless it is proven false.
  • Alternative hypothesis contradicts the null hypothesis. It raises doubt that the given statement can be false. It is what researchers aim to support with evidence.

Level of Significance

The level of significance, often denoted by the symbol (\alpha   ), is a critical parameter in hypothesis testing and statistical significance testing. It defines the probability of making a Type I error, which occurs when a true null hypothesis is incorrectly rejected.

Type I Error and Type II Error

It is also known as Alpha(\alpha   ) or significance level. It incorrectly rejects a true null hypothesis i.e. the given statement True but trial says it is false which is wrong output.

It is also known as Beta( 1-\alpha   ) where we are failing to reject a false null hypothesis i.e. the given statement is false but trial says it is true which is basically wrong output again.

Degrees of freedom

Degrees of freedom (df) in statistics represent the number of values or quantities in the final calculation of a statistic that are free to vary. It is mainly defined as sample size - one(n-1).

Confidence Intervals

A confidence interval is a range of values that is used to estimate the true value of a population parameter with a certain level of confidence. It provides a measure of the uncertainty or margin of error associated with a sample statistic, such as the sample mean or proportion.

p-value

The p-value, short for probability value, is a fundamental concept in statistics that quantifies the evidence against a null hypothesis.

Example of Hypothesis testing:

Let us consider An e-commerce company wants to assess whether a recent website redesign has a significant impact on the average time users spend on their website.

The company collects the following data:

  • Data on user session durations before and after the redesign.
  • Before redesign: Sample mean (\overline{\rm x}   ) = 3.5 minutes, Sample standard deviation (s) = 1.2 minutes, Sample size (n) = 50.
  • After redesign: Sample mean (\overline{\rm x}   = 4.2 minutes, Sample standard deviation (s​) = 1.5 minutes, Sample size (n) = 60.

The Hypothesis are defined as:

Null Hypothesis (H0​): The website redesign has no impact on the average user session duration \mu_{after} -\mu_{before} = 0

Alternative Hypothesis (Ha​): The website redesign has a positive impact on the average user session duration \mu_{after} -\mu_{before} > 0

Significance Level:

Choose a significance level, α=0.05(commonly used)

Test Statistic and P-Value:

  • Conduct a test for the difference in means.
  • Calculate the test statistic and p-value.

Result:

If the p-value is less than the chosen significance level, reject the null hypothesis.

If the p-value is greater than or equal to the significance level, fail to reject the null hypothesis.

Conclusion:

Based on the analysis, the company draws conclusions about whether the website redesign has a statistically significant impact on user session duration.

Parametric Test

Parametric test are statistical methods that make assumption that the data follows normal distribution.

Z-test

Z-test is used to determine the significant difference between sample mean and known population mean or between two independent samples.

Useful when the sample size is large, and population standard deviation is known.

  • One Sample Z-test: compares the mean of single sample to known population mean.

Z = \frac{\overline{X}-\mu}{\frac{\sigma}{\sqrt{n}}}

  • Two Sample Z-test: compares the mean of two independent samples to determine if they are significantly different.

Z = \frac{\overline{X_1} -\overline{X_2}}{\sqrt{\frac{\sigma_{1}^{2}}{n_1} + \frac{\sigma_{2}^{2}}{n_2}}}

T-test

T-test determine if there is a significant difference between the means

  • One- Sample t-test: compare the mean of a single sample to the mean of the population.

t = \frac{\overline{X}- \mu}{\frac{s}{\sqrt{n}}}

  • Two-Sample t-Test: compare the means of two independent samples.

t= \frac{\overline{X_1} - \overline{X_2}}{\sqrt{\frac{s_{1}^{2}}{n_1} + \frac{s_{2}^{2}}{n_2}}}

  • Paired t-Test: compare the means of two related or paired samples.

t = \frac{\overline{d}}{\frac{s_d}{\sqrt{n}}}

About Variance

F-test:

The F-test is a statistical test that is used to compare the variances of two or more groups to assess whether they are significantly different.

F = \frac{s_{1}^{2}}{s_{2}^{2}}

ANOVA (Analysis Of Varience)

One-way Anova:

Used to compare means of three or more groups to determine if there are statistically significant differences among them.

here, H0​: The means of all groups are equal.

Ha​: At least one group mean is different.

Two-way Anova:

It assess the influence of two categorical independent variables on a dependent variable, examining the main effects of each variable and their interaction effect.

Confidence Interval

A confidence interval (CI) provides a range of values within which we can be reasonably confident that a population parameter lies. It is an interval estimate, and it is often used to quantify the uncertainty associated with a sample estimate of a population parameter.

Confidence Interval for means (n \geq 30)

Confidence Interval for means (n<30)

Confidence Interval for proportions

(\overline{x} \pm z \frac{\sigma}{\sqrt{n}}

(\overline{x} \pm t \frac{s}{\sqrt{n}})

(\widehat{p} \pm z \sqrt{\frac{\widehat{p} -(1- \widehat{p})}{n}})


Non-Parametric Test

Non-parametric test does not make assumptions about the distribution of the data. They are useful when data does not meet the assumptions required for parametric tests.

Chi-Squared Test (Goodness of Fit)

Mann-Whitney U Test

Kruskal-Wallis Test

The chi-squared test is used to determine if there is a significant association between two categorical variables. It compares the observed frequencies in a contingency table with the frequencies.  X^2=\Sigma{\frac{(O_{ij}-E_{ij})^2}{E_{ij}}}  

This test is also performed on big data with multiple number of observations.

Mann-Whitney U Test is used to determine whether there is a difference between two independent groups when the dependent variable is ordinal or continuous. Applicable when assumptions for a t-test are not met. In it we rank all data points, combines the ranks, and calculates the test statistic.

Kruskal-Wallis Test is used to determine whether there are differences among three or more independent groups when the dependent variable is ordinal or continuous. Non-parametric alternative to one-way ANOVA.

A/B Testing or Split Testing

A/B testing, also known as split testing, is a method used to compare two versions (A and B) of a webpage, app, or marketing asset to determine which one performs better.

Example : a product manager change a website's "Shop Now" button color from green to blue to improve the click-through rate (CTR). Formulating null and alternative hypotheses, users are divided into A and B groups, and CTRs are recorded. Statistical tests like chi-square or t-test are applied with a 5% confidence interval. If the p-value is below 5%, the manager may conclude that changing the button color significantly affects CTR, informing decisions for permanent implementation.


Next Article

Similar Reads

  翻译: