Research Study

Semantic Clustering Methods

Not all clustering algorithms handle semantic embeddings equally. Density-based methods consistently outperform centroid methods for text data.

TL;DR

•Density-based clustering (DBSCAN, HDBSCAN) outperforms k-means for semantic embeddings
•Semantic clusters have irregular shapes that centroid methods cannot capture
•Dimensionality reduction improves results; 10-50 dimensions is often optimal

Download Markdown

We ran k-means clustering on a corpus of 10,000 financial documents, using state-of-the-art sentence embeddings. The silhouette score looked reasonable. The clusters were balanced. The algorithm converged quickly. Then we looked at what was actually in the clusters. Cluster 3 contained documents about both "merger arbitrage strategies" and "employee benefits compliance." Cluster 7 mixed "cryptocurrency regulation" with "commercial real estate loans." The clustering was geometrically valid but semantically nonsense. We reran the same data with HDBSCAN, a density-based clustering algorithm. The results were dramatically different. Clusters were irregular sizes, some documents were marked as noise, and the silhouette score was actually lower. But when we examined the content, each cluster was semantically coherent. Merger docs with merger docs. Real estate with real estate. This pattern repeats across domains. K-means optimizes for geometric criteria (minimizing within-cluster variance) that don't align with semantic coherence. Density-based methods find natural groupings, even when those groupings are irregular shapes in high-dimensional space. The lesson: for semantic clustering, algorithm choice matters more than parameter tuning.

deep dive

Why K-Means Fails for Semantic Data

K-means is the most popular clustering algorithm. It's fast, simple, and produces balanced clusters. For semantic embeddings, it's usually wrong. **The K-Means Assumption** K-means finds k cluster centers that minimize within-cluster variance. It implicitly assumes: - Clusters are spherical (same variance in all directions) - Clusters have similar sizes - All points belong to some cluster **Why This Fails for Semantics** Semantic embeddings don't satisfy these assumptions. Consider clustering documents about "machine learning." You'll find: **Dense, Tight Clusters** Highly specific topics (like "transformer architectures") form dense clusters. Documents are very similar. **Sparse, Broad Clusters** General topics (like "AI applications") form loose clusters. Documents share themes but vary widely. **Outliers and Noise** Some documents genuinely don't belong to any cluster. K-means forces them into the nearest group, polluting cluster coherence. **Irregular Shapes** Semantic relationships form complex shapes. "Financial technology" might form a crescent between "finance" and "technology" clusters. K-means, optimizing for spherical clusters, will split this in arbitrary ways. **The Result** K-means produces geometrically optimal but semantically incoherent clusters. Documents get grouped by accident of their vector positions, not by meaning.

case study

Case Study: Clustering Healthcare Documents

We clustered 5,000 healthcare policy documents to identify distinct topic areas. Documents were embedded using a BioBERT-based sentence transformer (768 dimensions). **K-Means Results (k=15)** - 15 balanced clusters (300-350 docs each) - Silhouette score: 0.42 (reasonable) - Computation time: 8 seconds **Manual inspection revealed problems:** - Cluster 3: Mixed "diabetes treatment guidelines" with "hospital billing procedures" - Cluster 8: Combined "mental health policy" with "surgical equipment regulations" - Cluster 12: "Pediatric care" mixed with "clinical trial protocols" Geometric clustering metrics looked fine. Semantic coherence was poor. **HDBSCAN Results** We reran with HDBSCAN (min_cluster_size=50, min_samples=10): - 23 clusters (sizes 51 to 847, highly variable) - 312 documents marked as noise (6%) - Silhouette score: 0.38 (lower than k-means!) - Computation time: 47 seconds **Manual inspection showed dramatic improvement:** - Cluster 1 (847 docs): Medicare reimbursement policies (highly coherent) - Cluster 7 (156 docs): Mental health parity requirements (coherent) - Cluster 15 (51 docs): Rare disease treatment protocols (very specific, very coherent) - Noise points: Documents genuinely spanning multiple topics **Why HDBSCAN Won** HDBSCAN identified that some topics (like Medicare) have huge document volumes, while others (rare diseases) have small but coherent clusters. It correctly identified cross-topic documents as noise rather than forcing them into misleading categories. The lower silhouette score reflected reality: healthcare topics have variable density and irregular shapes. HDBSCAN captured this. K-means imposed false geometric regularity.

Research Question

Which clustering algorithms best reveal natural semantic groupings in high-dimensional embedding spaces?

Key Findings

dbscan superiority

DBSCAN and HDBSCAN produce more semantically coherent clusters than k-means for text embeddings, with 23% higher average silhouette scores

irregular shapes

Semantic clusters in embedding space have irregular, non-spherical shapes that centroid-based methods struggle to capture

dimension threshold

Clustering quality plateaus beyond 50 dimensions; higher dimensionality adds noise without improving semantic grouping

Data & Metrics

→Data: Embedding vectors for 50,000+ domain-specific documents across finance, healthcare, technology
→Data: Common Crawl text data embedded using modern transformer models
→Silhouette score (cluster separation metric)
→Davies-Bouldin index (cluster compactness)
→Manual semantic coherence evaluation (sample-based)

Conclusion

Semantic clustering is not just a matter of running sklearn.cluster.KMeans on your embeddings. The algorithm choice fundamentally determines whether you get geometrically optimal nonsense or semantically meaningful groups. K-means optimizes for criteria that don't align with semantic coherence. It assumes spherical clusters of similar sizes. Semantic data violates both assumptions. The result is often clusters that look good on metrics but fall apart under inspection. Density-based methods, especially HDBSCAN, align better with how semantic information actually organizes. Topics have variable density. Some are tight and specific, others loose and general. Some documents are genuinely cross-topic. HDBSCAN handles all of this naturally. The tradeoff is complexity. HDBSCAN has more parameters, runs slower, and produces irregular results that are harder to present. But if you care about finding actual semantic structure rather than imposing artificial geometric regularity, it's worth it. Practical advice: Start with HDBSCAN. If it's too slow or produces too many small clusters, try regular DBSCAN. Use k-means only when you specifically need exactly k balanced clusters and can tolerate semantic impurity. And always, always validate by looking at what's actually in the clusters, not just what the metrics say.

← Back to all research