Day 2 - Statistics in Machine Learning
▶️ Range
The range is the easiest dispersion of data or measure of variability. The range can measure by subtracting the lowest value from the massive Number. The wide range indicates high variability, and the small range specifies low variability in the distribution.
Range = Highest value - Lowest value
▶️ Interquartile Range (IQR)
Interquartile range is the amount of spread in the middle 50% of a dataset. In other words, it is the distance between the first quartile (Q1) and the third Quartile (Q3).
👉 Here's how to find the IQR:
➡️ Step 1: Put the data in order from least to greatest.
➡️ Step 2: Find the median. If the number of data points is odd, the median is the middle data point. If the number of data points is even, the median is the average of the middle two data points.
➡️ Step 3: Find the first quartile (Q1). The first quartile is the median of the data points to the left of the median in the ordered list.
➡️ Step 4: Find the third quartile (Q3). The third quartile is the median of the data points to the right of the median in the ordered list.
➡️ Step 5: Calculate IQR by subtracting (Q3) - (Q1).
▶️ Variance
Variance is a simple measure of dispersion. Variance measures how far each number in the dataset from the mean. To compute variance first, calculate the mean and squared deviations from a mean.
Population Variance
Sample Variance
▶️ Standard Deviation
Recommended by LinkedIn
Standard deviation measures the spread of a data distribution. The more spread out a data distribution is, the greater its standard deviation.
Overview of how to calculate Standard Deviation
The formula for standard deviation (SD) is
where
👉 Here's a quick preview of the steps we're about to follow:
➡️ Step 1: Find the Mean.
➡️ Step 2: For each data point, find the square of its distance to the Mean.
➡️ Step 3: Sum the values from Step 2.
➡️ Step 4: Divide by the number of Data Points.
➡️ Step 5: Take the Square Root.
👉 An Important Note
The formula above is for finding the standard deviation of a Population. If you're dealing with a sample, you'll want to use a slightly different formula (below), which uses n−1 instead of N.
▶️ Population and Sample Standard Deviation
Standard deviation measures the spread of a data distribution. It measures the typical distance between each data point and the mean.
The formula we use for standard deviation depends on whether the data is being considered a population of its own, or the data is a sample representing a larger population.