K-Nearest Neigbhors (KNN) from Scratch

Explain the K-Nearest Neighbors (KNN) algorithm, how it works, and when it is used.

What is K-Nearest Neighbors (KNN) algorithm?

K-Nearest Neighbors (KNN) is a simple classification and regression algorithm.

It predicts by looking at the "k" most similar training examples and aggregating their corresponding labels or values. Majority vote for classification. Average for regression.

It assumes that points close together in the feature space share the same labels/values.

There is no explicit training, the model is the dataset itself.

Pros: simple, interpretable via neighbors, no training time, competitive strong baseline.
Cons: slow at inference, memory-heavy, sensitive to feature scaling and irrelevant features, struggles with high-dimensional raw features (curse of dimensionality).

Hows does it work?

Choose "k" (an integer)
- Small K: low bias, high variance (noisy - overfit)
- Large K: higher bias, lower variance (smoother - underfit)
Chose a distance metric (Euclidean, Manhattan, Cosine Similarity, Hamming) - Weighting can be applied to distance values to improve accuracy

$$ | \mathbf{A} - \mathbf{B} | = \sqrt{(A_x - B_x)^2 + (A_y - B_y)^2 + (A_z - B_z)^2} $$

Normalize features (0 - 1), preventing dominance.
- one-hot encode categorical features (all categories are equidistance)
For a new point:
- Compute the distance to all training points
- Select "K" closest neighbors
- Predict:
  - Classification: majority vote
  - Regression: mean/median (middle value)

When is used?

When interpretability via similar cases is valuable.
When you have reasonably clean, scaled, informative features i.e not a million dimensions and sparse (raw bag‑of‑words with 50k tokens).
When data volume still fits in memory or you can index for speed.

Example use cases:

Image retrieval: surface similar X-rays/MRIs
Patient similarity search: cohort discovery for various outcomes like readmission or adverse events
Presidential election predictions
Email spam filtering
Recommender systems, predict music or movie preferences to suggest to users
- Movie recommendations: Find 10 closest movies in embedding space (3).
Clinical decision making support (baseline): predict likelihood of X from tabular vitals/lab results
Identify patients with similar longitudinal (pattern changes over time) patterns
- biological and social changes

Sources

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm