Open In App

Principal Component Analysis(PCA)

Last Updated : 03 Feb, 2025
Comments
Improve
Suggest changes
Like Article
Like
Report

Having too many features in data can cause problems like overfitting (good on training data but poor on new data), slower computation, and lower accuracy. This is called the curse of dimensionality, where more features exponentially increase the data needed for reliable results.

The explosion of feature combinations makes sampling harder In high-dimensional data and tasks like clustering or classification more complex and slow. 

To tackle this problem, we use Feature engineering Techniques ,such as feature selection (choosing the most important features) and feature extraction (creating new features from the original ones). One popular feature extraction method is dimensionality reduction, which reduces the number of features while keeping as much important information as possible.

One of the most widely used dimensionality reduction techniques is Principal Component Analysis (PCA).

How PCA Works for Dimensionality Reduction?

PCA is a statistical technique introduced by mathematician Karl Pearson in 1901. It works by transforming high-dimensional data into a lower-dimensional space while maximizing the variance (or spread) of the data in the new space. This helps preserve the most important patterns and relationships in the data.

Note: It prioritizes the directions where the data varies the most (because more variation = more useful information.

Let’s understand it’s working in simple terms:

Imagine you’re looking at a messy cloud of data points (like stars in the sky) and want to simplify it. PCA helps you find the “most important angles” to view this cloud so you don’t miss the big patterns. Here’s how it works, step by step:

Step 1: Standardize the Data

Make sure all features (e.g., height, weight, age) are on the same scale. Why? A feature like “salary” (ranging 0–100,000) could dominate “age” (0–100) otherwise.

Standardizing our dataset to ensures that each variable has a mean of 0 and a standard deviation of 1.

[Tex]Z = \frac{X-\mu}{\sigma}[/Tex]

Here,

  • [Tex]\mu [/Tex] is the mean of independent features  [Tex]\mu = \left \{ \mu_1, \mu_2, \cdots, \mu_m \right \} [/Tex]
  • [Tex]\sigma [/Tex] is the standard deviation of independent features  [Tex]\sigma = \left \{ \sigma_1, \sigma_2, \cdots, \sigma_m \right \} [/Tex]

Step 2: Find Relationships

Calculate how features move together using a covariance matrix. Covariance measures the strength of joint variability between two or more variables, indicating how much they change in relation to each other. To find the covariance we can use the formula:

[Tex]cov(x1,x2) = \frac{\sum_{i=1}^{n}(x1_i-\bar{x1})(x2_i-\bar{x2})}{n-1}[/Tex]

The value of covariance can be positive, negative, or zeros.

  • Positive: As the x1 increases x2 also increases.
  • Negative: As the x1 increases x2 also decreases.
  • Zeros: No direct relation.

Step 3: Find the “Magic Directions” (Principal Components)

  • PCA identifies new axes (like rotating a camera) where the data spreads out the most:
    • 1st Principal Component (PC1): The direction of maximum variance (most spread).
    • 2nd Principal Component (PC2): The next best direction, perpendicular to PC1, and so on.
  • These directions are calculated using Eigenvalues and Eigenvectors where: eigenvectors (math tools that find these axes), and their importance is ranked by eigenvalues (how much variance each captures).

For a square matrix A, an eigenvector X (a non-zero vector) and its corresponding eigenvalue λ (a scalar) satisfy:

[Tex]AX = \lambda X[/Tex]

This means:

  • When A acts on X, it only stretches or shrinks X by the scalar λ.
  • The direction of X remains unchanged (hence, eigenvectors define “stable directions” of A).

It can also be written as :

[Tex]\begin{aligned} AX-\lambda X &= 0 \\ (A-\lambda I)X &= 0 \end{aligned}[/Tex]

where I is the identity matrix of the same shape as matrix A. And the above conditions will be true only if [Tex](A – \lambda I)[/Tex] will be non-invertible (i.e. singular matrix). That means,

[Tex]|A – \lambda I| = 0[/Tex]

This determinant equation is called the characteristic equation.

  • Solving it gives the eigenvalues \lambda,
  • and therefore corresponding eigenvector can be found using the equation [Tex]AX = \lambda X[/Tex].

How This Connects to PCA?

  • In PCA, the covariance matrix C (from Step 2) acts as matrix A.
  • Eigenvectors of C are the principal components (PCs).
  • Eigenvalues represent the variance captured by each PC.

Step 4: Pick the Top Directions & Transform Data

  • Keep only the top 2–3 directions (or enough to capture ~95% of the variance).
  • Project the data onto these directions to get a simplified, lower-dimensional version.

PCA is an unsupervised learning algorithm, meaning it doesn’t require prior knowledge of target variables. It’s commonly used in exploratory data analysis and machine learning to simplify datasets without losing critical information.

We know everything sound complicated, let’s understand again with help of visual image where, x-axis (Radius) and y-axis (Area) represent two original features in the dataset.

Principal Component Analysis - Geeksforgeeks

Transform this 2D dataset into a 1D representation while preserving as much variance as possible.

Principal Components (PCs):

  • PC₁ (First Principal Component): The direction along which the data has the maximum variance. It captures the most important information.
  • PC₂ (Second Principal Component): The direction orthogonal (perpendicular) to PC₁. It captures the remaining variance but is less significant.

Now, The red dashed lines indicate the spread (variance) of data along different directions . The variance along PC₁ is greater than PC₂, which means that PC₁ carries more useful information about the dataset.

  • The data points (blue dots) are projected onto PC₁, effectively reducing the dataset from two dimensions (Radius, Area) to one dimension (PC₁).
  • This transformation simplifies the dataset while retaining most of the original variability.

The image visually explains why PCA selects the direction with the highest variance (PC₁). By removing PC₂, we reduce redundancy while keeping essential information. The transformation helps in data compression, visualization, and improved model performance.

Principal Component Analysis Implementation in Python

Hence, PCA employs a linear transformation that is based on preserving the most variance in the data using the least number of dimensions. It involves the following steps:

Python
import pandas as pd
import numpy as np

# Here we are using inbuilt dataset of scikit learn
from sklearn.datasets import load_breast_cancer

# instantiating
cancer = load_breast_cancer(as_frame=True)
# creating dataframe
df = cancer.frame

# checking shape
print('Original Dataframe shape :',df.shape)

# Input features
X = df[cancer['feature_names']]
print('Inputs Dataframe shape   :', X.shape)

Output:

Original Dataframe shape : (569, 31) Inputs Dataframe shape : (569, 30)

Now we will apply the first most step which is to standardize the data and for that, we will have to first calculate the mean and standard deviation of each feature in the feature space.

Python
# Mean
X_mean = X.mean()

# Standard deviation
X_std = X.std()

# Standardization
Z = (X - X_mean) / X_std

The covariance matrix helps us visualize how strong the dependency of two features is with each other in the feature space.

Python
# covariance
c = Z.cov()

# Plot the covariance matrix
import matplotlib.pyplot as plt
import seaborn as sns
sns.heatmap(c)
plt.show()

Output:

Covariance Matrix (PCA)-Geeksforgeeks

Now we will compute the eigenvectors and eigenvalues for our feature space which serve a great purpose in identifying the principal components for our feature space.

Python
eigenvalues, eigenvectors = np.linalg.eig(c)
print('Eigen values:\n', eigenvalues)
print('Eigen values Shape:', eigenvalues.shape)
print('Eigen Vector Shape:', eigenvectors.shape)

Output:

Eigen values: [1.32816077e+01 5.69135461e+00 2.81794898e+00 1.98064047e+00 1.64873055e+00 1.20735661e+00 6.75220114e-01 4.76617140e-01 4.16894812e-01 3.50693457e-01 2.93915696e-01 2.61161370e-01 2.41357496e-01 1.57009724e-01 9.41349650e-02 7.98628010e-02 5.93990378e-02 5.26187835e-02 4.94775918e-02 1.33044823e-04 7.48803097e-04 1.58933787e-03 6.90046388e-03 8.17763986e-03 1.54812714e-02 1.80550070e-02 2.43408378e-02 2.74394025e-02 3.11594025e-02 2.99728939e-02] Eigen values Shape: (30,) Eigen Vector Shape: (30, 30)

Sort the eigenvalues in descending order and sort the corresponding eigenvectors accordingly.

Python
# Index the eigenvalues in descending order 
idx = eigenvalues.argsort()[::-1]

# Sort the eigenvalues in descending order 
eigenvalues = eigenvalues[idx]

# sort the corresponding eigenvectors accordingly
eigenvectors = eigenvectors[:,idx]

Explained variance is the term that gives us an idea of the amount of the total variance which has been retained by selecting the principal components instead of the original feature space.

Python
explained_var = np.cumsum(eigenvalues) / np.sum(eigenvalues)
explained_var

Output:

array([0.44272026, 0.63243208, 0.72636371, 0.79238506, 0.84734274, 0.88758796, 0.9100953 , 0.92598254, 0.93987903, 0.95156881, 0.961366 , 0.97007138, 0.97811663, 0.98335029, 0.98648812, 0.98915022, 0.99113018, 0.99288414, 0.9945334 , 0.99557204, 0.99657114, 0.99748579, 0.99829715, 0.99889898, 0.99941502, 0.99968761, 0.99991763, 0.99997061, 0.99999557, 1. ])

Determine the Number of Principal Components 

Here we can either consider the number of principal components of any value of our choice or by limiting the explained variance. Here I am considering explained variance more than equal to 50%. Let’s check how many principal components come into this.

Python
n_components = np.argmax(explained_var >= 0.50) + 1
n_components

Output:

2

Project the Data onto the Selected Principal Components

  • Instead of storing full (x, y) coordinates, PCA stores only the projection values along the principal component, simplifying data processing.
  • Projection matrix: is a matrix of eigenvectors corresponding to the largest eigenvalues of the covariance matrix of the data. it projects the high-dimensional dataset onto a lower-dimensional subspace.
Python
# PCA component or unit matrix
u = eigenvectors[:,:n_components]
pca_component = pd.DataFrame(u,
                             index = cancer['feature_names'],
                             columns = ['PC1','PC2']
                            )

# plotting heatmap
plt.figure(figsize =(5, 7))
sns.heatmap(pca_component)
plt.title('PCA Component')
plt.show()

Output:

Project the feature on Principal COmponent-Geeksforgeeks

Then, we project our dataset using the formula:  

[Tex]\begin{aligned} Proj_{P_i}(u) &= \frac{P_i\cdot u}{|u|} \\ &=P_i\cdot u \end{aligned}[/Tex]

Finding Projection in PCA - Geeksforgeeks

Finding Projection in PCA

The principal component u (green vector) maximizes data variance and serves as the new axis for projection. The data point P1(x1,y1) (red vector) is an original observation, and its projection onto u (blue line) represents its transformed coordinate in the reduced dimension. This projection simplifies the data while preserving its key characteristics.

Python
# Matrix multiplication or dot Product
Z_pca = Z @ pca_component
# Rename the columns name
Z_pca.rename({'PC1': 'PCA1', 'PC2': 'PCA2'}, axis=1, inplace=True)
# Print the  Pricipal Component values
print(Z_pca)

Output:

PCA1 PCA2 0 9.184755 1.946870 1 2.385703 -3.764859 2 5.728855 -1.074229 3 7.116691 10.266556 4 3.931842 -1.946359 .. ... ... 564 6.433655 -3.573673 565 3.790048 -3.580897 566 1.255075 -1.900624 567 10.365673 1.670540 568 -5.470430 -0.670047 [569 rows x 2 columns]

The eigenvectors of the covariance matrix of the data are referred to as the principal axes of the data, and the projection of the data instances onto these principal axes are called the principal components.

Dimensionality reduction is then obtained by only retaining those axes (dimensions) that account for most of the variance, and discarding all others.

PCA using Using Sklearn

There are different libraries in which the whole process of the principal component analysis has been automated by implementing it in a package as a function and we just have to pass the number of principal components which we would like to have. Sklearn is one such library that can be used for the PCA as shown below.

Python
# Importing PCA
from sklearn.decomposition import PCA

# Let's say, components = 2
pca = PCA(n_components=2)
pca.fit(Z)
x_pca = pca.transform(Z)

# Create the dataframe
df_pca1 = pd.DataFrame(x_pca,
                       columns=['PC{}'.
                       format(i+1)
                        for i in range(n_components)])
print(df_pca1)

Output:

PC1 PC2 0 9.184755 1.946870 1 2.385703 -3.764859 2 5.728855 -1.074229 3 7.116691 10.266556 4 3.931842 -1.946359 .. ... ... 564 6.433655 -3.573673 565 3.790048 -3.580897 566 1.255075 -1.900624 567 10.365673 1.670540 568 -5.470430 -0.670047 [569 rows x 2 columns]

We can match from the above Z_pca result from it is exactly the same values.

Python
# giving a larger plot
plt.figure(figsize=(8, 6))

plt.scatter(x_pca[:, 0], x_pca[:, 1],
            c=cancer['target'],
            cmap='plasma')

# labeling x and y axes
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()

Output:

Visualizing the evaluated principal Component -Geeksforgeeks

Python
# components
pca.components_

Output:

array([[ 0.21890244, 0.10372458, 0.22753729, 0.22099499, 0.14258969, 0.23928535, 0.25840048, 0.26085376, 0.13816696, 0.06436335, 0.20597878, 0.01742803, 0.21132592, 0.20286964, 0.01453145, 0.17039345, 0.15358979, 0.1834174 , 0.04249842, 0.10256832, 0.22799663, 0.10446933, 0.23663968, 0.22487053, 0.12795256, 0.21009588, 0.22876753, 0.25088597, 0.12290456, 0.13178394], [-0.23385713, -0.05970609, -0.21518136, -0.23107671, 0.18611302, 0.15189161, 0.06016536, -0.0347675 , 0.19034877, 0.36657547, -0.10555215, 0.08997968, -0.08945723, -0.15229263, 0.20443045, 0.2327159 , 0.19720728, 0.13032156, 0.183848 , 0.28009203, -0.21986638, -0.0454673 , -0.19987843, -0.21935186, 0.17230435, 0.14359317, 0.09796411, -0.00825724, 0.14188335, 0.27533947]])

Apart from what we’ve discussed, there are many more subtle advantages and limitations to PCA.

Advantages and Disadvantages of Principal Component Analysis

Advantages of Principal Component Analysis

  1. Multicollinearity Handling: Creates new, uncorrelated variables to address issues when original features are highly correlated.
  2. Noise Reduction: Eliminates components with low variance (assumed to be noise), enhancing data clarity.
  3. Data Compression: Represents data with fewer components, reducing storage needs and speeding up processing.
  4. Outlier Detection: Identifies unusual data points by showing which ones deviate significantly in the reduced space.

Disadvantages of Principal Component Analysis

  1. Interpretation Challenges: The new components are combinations of original variables, which can be hard to explain.
  2. Data Scaling Sensitivity: Requires proper scaling of data before application, or results may be misleading.
  3. Information Loss: Reducing dimensions may lose some important information if too few components are kept.
  4. Assumption of Linearity: Works best when relationships between variables are linear, and may struggle with non-linear data.
  5. Computational Complexity: Can be slow and resource-intensive on very large datasets.
  6. Risk of Overfitting: Using too many components or working with a small dataset might lead to models that don’t generalize well.

Conclusion

In summary, PCA helps in distilling complex data into its most informative elements, making it simpler and more efficient to analyze.

  1. It identifies the directions (called principal components) where the data varies the most.
  2. It projects the data onto these directions, reducing the number of dimensions while retaining as much information as possible.
  3. The new set of uncorrelated variables (principal components) is easier to work with and can be used for tasks like regression, classification, or visualization.


Next Article

Similar Reads

  翻译: