Mastering Statistical Data Analysis: Concepts, Use Cases & Python Code 🚀📊
Statistics for Machine Learning
Table of Contents
1. Descriptive Statistics
Concept Explanation
Descriptive statistics summarize and organize data using measures such as mean, median, mode, variance, and standard deviation.
Business Use Case
Businesses use descriptive statistics to analyze customer demographics, sales performance, and market trends.
Supporting Python Code
import numpy as np
import pandas as pd
# Sample Data
data = [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
# Descriptive Statistics
print("Mean:", np.mean(data))
print("Median:", np.median(data))
print("Standard Deviation:", np.std(data))
2. Inferential Statistics
Concept Explanation
Inferential statistics allow us to make predictions or generalizations about a population based on a sample using hypothesis testing and confidence intervals.
Business Use Case
Businesses use inferential statistics for A/B testing in marketing campaigns and product performance evaluation.
Supporting Python Code
from scipy import stats
# Sample Data
sample_data = [12, 14, 18, 19, 24, 26, 30, 32]
# Confidence Interval (95%)
confidence_interval = stats.norm.interval(0.95, loc=np.mean(sample_data), scale=stats.sem(sample_data))
print("95% Confidence Interval:", confidence_interval)
3. Probability Distributions
Concept Explanation
Probability distributions describe how values of a random variable are distributed. Examples include normal, binomial, and Poisson distributions.
Business Use Case
Businesses use probability distributions for risk assessment, customer behavior prediction, and quality control.
Supporting Python Code
import matplotlib.pyplot as plt
import seaborn as sns
# Normal Distribution
samples = np.random.normal(loc=50, scale=15, size=1000)
sns.histplot(samples, kde=True)
plt.title("Normal Distribution")
plt.show()
4. Correlation and Covariance
Concept Explanation
Correlation measures the strength of the relationship between two variables, while covariance indicates how two variables change together.
Business Use Case
Businesses use correlation analysis in financial market analysis, customer preferences, and feature selection in machine learning.
Supporting Python Code
# Sample Data
x = [10, 20, 30, 40, 50]
y = [15, 25, 35, 45, 55]
# Correlation and Covariance
print("Correlation:", np.corrcoef(x, y)[0, 1])
print("Covariance:", np.cov(x, y)[0, 1])
5. Regression Analysis
Concept Explanation
Regression analysis is used to model relationships between dependent and independent variables, commonly using linear regression.
Business Use Case
Used in sales forecasting, pricing strategies, and customer demand prediction.
Supporting Python Code
from sklearn.linear_model import LinearRegression
# Sample Data
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])
# Linear Regression Model
model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X)
print("Predictions:", predictions)
6. ANOVA (Analysis of Variance)
Concept Explanation
ANOVA is a statistical method used to compare means among multiple groups.
Business Use Case
Used in comparing customer satisfaction across different service levels.
Supporting Python Code
from scipy.stats import f_oneway
group1 = [20, 21, 22, 23, 24]
group2 = [30, 31, 32, 33, 34]
group3 = [40, 41, 42, 43, 44]
# ANOVA Test
stat, p = f_oneway(group1, group2, group3)
print("ANOVA Test P-Value:", p)
7. Chi-Square Tests
Concept Explanation
Chi-Square tests determine if there is a significant association between categorical variables.
Business Use Case
Used in market basket analysis to determine relationships between products purchased together.
Supporting Python Code
from scipy.stats import chi2_contingency
# Sample Contingency Table
data = [[10, 20, 30], [15, 25, 35]]
stat, p, dof, expected = chi2_contingency(data)
print("Chi-Square Test P-Value:", p)
8. K-Means Clustering
Concept Explanation
K-Means is an unsupervised learning algorithm used for clustering similar data points.
Business Use Case
Used in customer segmentation, market segmentation, and pattern recognition.
Supporting Python Code
from sklearn.cluster import KMeans
# Sample Data
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
# K-Means Clustering
kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
print("Cluster Centers:", kmeans.cluster_centers_)
Recommended by LinkedIn
9. Support Vector Machines (SVM)
Concept Explanation
SVM is used for classification and regression by finding an optimal hyperplane separating different classes.
Business Use Case
Used in text classification, spam detection, and fraud detection.
Supporting Python Code
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Load dataset
iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train SVM Model
svm_model = SVC(kernel='linear')
svm_model.fit(X_train, y_train)
# Predict and Evaluate
predictions = svm_model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("SVM Model Accuracy:", accuracy)
10. Bayesian Statistics
Concept Explanation
Bayesian statistics is a probabilistic approach that incorporates prior knowledge along with new evidence to update probabilities.
Business Use Case
Used in spam filtering, recommendation systems, and medical diagnostics for probability-based decision-making.
Supporting Python Code
from scipy.stats import beta
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 1, 100)
plt.plot(x, beta.pdf(x, 2, 5), label='Beta Distribution (2,5)')
plt.legend()
plt.show()
11. Central Limit Theorem (CLT)
Concept Explanation
The CLT states that the sampling distribution of the sample mean approaches a normal distribution as the sample size increases, regardless of the population distribution.
Business Use Case
Used in quality control, financial risk analysis, and polling predictions.
Supporting Python Code
import numpy as np
import matplotlib.pyplot as plt
means = [np.mean(np.random.randint(1, 100, 30)) for _ in range(1000)]
plt.hist(means, bins=30, density=True)
plt.show()
12. Time Series Analysis – Analyzing Data Over Time
Concept Explanation
Time series analysis focuses on trends, seasonal patterns, and forecasting in data collected over time.
Business Use Case
Used in stock market analysis, sales forecasting, and weather prediction.
Supporting Python Code
import pandas as pd
import matplotlib.pyplot as plt
# Sample Time Series Data
dates = pd.date_range(start='1/1/2020', periods=100)
data = pd.Series(np.random.randn(100).cumsum(), index=dates)
plt.plot(data)
plt.show()
13. Principal Component Analysis (PCA)
Concept Explanation
PCA is a dimensionality reduction technique used to transform correlated variables into uncorrelated principal components.
Business Use Case
Used in image compression, facial recognition, and reducing features in machine learning models.
Supporting Python Code
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
iris = load_iris()
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(iris.data)
print(reduced_data[:5])
14. Machine Learning Basics – Statistical Models for Predictions
Concept Explanation
Statistical models, such as regression and classification, form the foundation of machine learning by identifying patterns in data.
Business Use Case
Used in predictive maintenance, fraud detection, and customer segmentation.
Supporting Python Code
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression().fit(X, y)
print(model.predict([[6]]))
15. A/B Testing & Experimental Design – Optimizing Business Strategies
Concept Explanation
A/B testing compares two versions of a variable to determine which one performs better using statistical significance.
Business Use Case
Used in marketing campaigns, website optimization, and pricing strategies.
Supporting Python Code
from scipy.stats import ttest_ind
A = [20, 23, 25, 30, 40]
B = [22, 24, 28, 35, 38]
t_stat, p_val = ttest_ind(A, B)
print(f"T-statistic: {t_stat}, P-value: {p_val}")
16. Multivariate Analysis – Understanding Relationships in Complex Data
Concept Explanation
Multivariate analysis examines relationships among multiple variables simultaneously.
Business Use Case
Used in healthcare for disease risk prediction and in finance for portfolio analysis.
Supporting Python Code
import seaborn as sns
import pandas as pd
# Sample Data
df = pd.DataFrame({'X': [1, 2, 3, 4, 5], 'Y': [2, 3, 4, 5, 6], 'Z': [5, 4, 3, 2, 1]})
sns.pairplot(df)
17. Glossary of Terminologies