One common mistake in applying K-Nearest Neighbor (KNN) is using the…

Research Scientist | ML & GenAI for Social Impacts | Human-Computer Interaction

1y Edited

One common mistake in applying K-Nearest Neighbor (KNN) is using the algorithm in high-dimensional spaces, such as cases involving millions of features for prediction. This situation is often described as the 'curse of dimensionality,' where 'nearness' becomes less meaningful, as all points tend to be far apart in high-dimensional spaces, leading to challenges in identifying truly close neighbors. So, what can we do? Here are some solutions: 👉 Dimensionality Reduction: Compressing high-dimensional data into fewer dimensions can make KNN more efficient and precise. A techniques like Principal Component Analysis (PCA) is commonly used for this purpose. However, a significant challenge is the potential distortion of distances. Imagine you have a three-dimensional U-Haul box with 20 balls inside, representing 20 data points. If we reduce the box's dimensions by installing a small two-dimensional mirror inside and projecting all balls onto this mirror, balls at different depths might appear misleadingly closer in the mirror's reflection. Similarly, in dimensionality reduction, the 'nearness' of data points in reduced dimensions might not accurately reflect their relationships in the original high-dimensional space. 👉 Feature Selection with a Larger Sample Size: This approach can help mitigate the challenges KNN faces in high-dimensional spaces. Selecting a subset of relevant features aligned with your project goals can reduce dimensionality while preserving meaningful relationships in the data. Additionally, increasing the sample size can provide a better chance of finding truly 'near' neighbors. However, it's important to avoid oversampling and selecting too few features, as this can lead to underfitting. 👉 Using a Different Algorithm: If dimensionality reduction or feature selection isn't feasible, it's better to use algorithms suited for high-dimensional data, such as Random Forests or Support Vector Machines (SVM). Also, thanks to ChatGPT for generating such a cool image below to illustrate my first solution!

To view or add a comment, sign in

Kay Chansiri, Ph.D.’s Post

More from this author

The Art of Gradient Boosting Machines: A Practical Approach

Understanding Logistic Regression in Machine Learning: Sigmoid Function, Log-Likelihood Estimation, Class Imbalance Adjustment, and More

Linear Regression from a Machine Learning Perspective

Explore topics