Clustering Without Labels: K-Means, Hierarchical, and How They See the World Differently

clustering k-means hierarchical-clustering unsupervised-learning dimensionality-reduction

Everything up to now has been supervised: models that learn from labeled data. Today I hit the first unsupervised algorithms: clustering. The task is to find groups in data that nobody labeled. No answer key.

K-Means is the most common clustering algorithm. You pick $K$ (number of clusters) upfront. The algorithm:

Places $K$ centroids randomly
Assigns each point to its nearest centroid
Moves each centroid to the average of its assigned points
Repeats until centroids stop moving

The problem: K-Means is sensitive to initial centroid placement (hence K-Means++ for smarter initialization), only finds roughly spherical clusters, and you have to know $K$ in advance. Picking $K$ uses the elbow method: plot within-cluster sum of squares against $K$ , pick where the curve bends.

Hierarchical Clustering doesn't need $K$ upfront. It builds a dendrogram: a tree showing how points merge into clusters step by step:

Agglomerative (bottom-up): Start with every point as its own cluster. Merge the two closest. Repeat until one cluster remains.
Divisive (top-down): Start with everything in one cluster. Split recursively.

You then "cut" the dendrogram at a height that gives you the number of clusters you want.

What clicked

K-Means is faster and scales better but assumes you know $K$ and assumes roughly circular clusters. Hierarchical gives more flexibility and a visual picture of structure, but is $O(n^2)$ in memory and time: doesn't scale to large datasets.

Still shaky on

How do you evaluate clustering quality when there are no labels? I know about silhouette score and within-cluster sum of squares but haven't worked through what "good" looks like in practice.

What's next

What if the problem isn't grouping but compression: reducing 50 features to 3 while keeping the most important signal? That's PCA.