Building a Simple Regression Model
Regression analysis is one of the most fundamental techniques in machine learning and statistics. It is used to predict a continuous outcome variable based on one or more predictor variables. In this blog, we’ll walk through the process of building a simple linear regression model using Python. By the end, you’ll have a clear understanding of how to implement and interpret a regression model.
What is Simple Linear Regression?
Simple linear regression is a statistical method that models the relationship between a dependent variable (target) and a single independent variable (predictor). The goal is to find the best-fitting straight line that describes the relationship between the two variables. The equation of the line is:
y=mx+by=mx+b
Where:
Steps to Build a Simple Regression Model
1. Import Required Libraries
We’ll use Python libraries like pandas, numpy, matplotlib, and scikit-learn for data manipulation, visualization, and modeling.
python
Copy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
2. Load and Explore the Dataset
For this example, let’s use a simple dataset like the Boston Housing Dataset (available in scikit-learn) or a custom dataset.
python
Copy
# Load dataset
from sklearn.datasets import load_boston
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['PRICE'] = boston.target
# Display the first few rows
print(data.head())
# Basic statistics
print(data.describe())
3. Select Features and Target
For simple linear regression, we’ll use one feature (independent variable) to predict the target (dependent variable). Let’s use RM (average number of rooms per dwelling) as the predictor and PRICE as the target.
python
Copy
# Select feature and target
X = data[['RM']] # Independent variable
y = data['PRICE'] # Dependent variable
4. Visualize the Data
Before building the model, it’s helpful to visualize the relationship between the feature and the target.
python
Copy
# Scatter plot
plt.scatter(X, y, color='blue')
plt.title('Room Count vs House Price')
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('House Price (PRICE)')
plt.show()
5. Split the Data into Training and Testing Sets
We’ll split the data into a training set (to train the model) and a testing set (to evaluate the model).
python
Copy
Recommended by LinkedIn
# Split the data (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
6. Train the Regression Model
Now, we’ll create and train a simple linear regression model using the training data.
python
Copy
# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)
7. Make Predictions
Use the trained model to make predictions on the test data.
python
Copy
# Predict on the test set
y_pred = model.predict(X_test)
8. Evaluate the Model
Evaluate the model’s performance using metrics like Mean Squared Error (MSE) and R-squared (R²).
python
Copy
# Calculate metrics
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
9. Visualize the Regression Line
Plot the regression line to see how well it fits the data.
python
Copy
# Plot the regression line
plt.scatter(X_test, y_test, color='blue', label='Actual Prices')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted Prices')
plt.title('Room Count vs House Price (Test Set)')
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('House Price (PRICE)')
plt.legend()
plt.show()
10. Interpret the Results
Full Code Example
python
Copy
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
# Load dataset
from sklearn.datasets import load_boston
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
data['PRICE'] = boston.target
# Select feature and target
X = data[['RM']]
y = data['PRICE']
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the model
model = LinearRegression()
model.fit(X_train, y_train)
# Make predictions
y_pred = model.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")
# Visualize the regression line
plt.scatter(X_test, y_test, color='blue', label='Actual Prices')
plt.plot(X_test, y_pred, color='red', linewidth=2, label='Predicted Prices')
plt.title('Room Count vs House Price (Test Set)')
plt.xlabel('Average Number of Rooms (RM)')
plt.ylabel('House Price (PRICE)')
plt.legend()
plt.show()
Conclusion
Building a simple linear regression model is a great way to understand the basics of predictive modeling. By following these steps, you can create, train, and evaluate a regression model using Python. As you progress, you can explore more advanced techniques like multiple linear regression, polynomial regression, and regularization.
Happy modeling! 🚀