← Back to research

Research Study

Semantic Clustering Methods

Not all clustering algorithms handle semantic embeddings equally. In many semantic settings, density-based methods outperform centroid methods.

TL;DR

  • Density-based clustering (DBSCAN, HDBSCAN) often outperforms k-means for semantic embeddings
  • Semantic clusters have irregular shapes that centroid methods cannot capture
  • Dimensionality reduction improves results; 10-50 dimensions is often optimal

Scope

This page may combine literature review, internal analysis, and illustrative examples. Review the cited sources and stated limitations before treating any finding as established empirical fact.

In one internal comparison, we ran k-means clustering on a corpus of 10,000 financial documents, using state-of-the-art sentence embeddings. The silhouette score looked reasonable. The clusters were balanced. The algorithm converged quickly. Then we looked at what was actually in the clusters. Cluster 3 contained documents about both "merger arbitrage strategies" and "employee benefits compliance." Cluster 7 mixed "cryptocurrency regulation" with "commercial real estate loans." The clustering was geometrically valid but semantically nonsense. We reran the same data with HDBSCAN, a density-based clustering algorithm. The results were dramatically different. Clusters were irregular sizes, some documents were marked as noise, and the silhouette score was actually lower. But when we examined the content, each cluster was semantically coherent. Merger docs with merger docs. Real estate with real estate. We have seen versions of this pattern across multiple internal datasets. K-means optimizes for geometric criteria (minimizing within-cluster variance) that don't align cleanly with semantic coherence. Density-based methods often find more natural groupings, especially when clusters have irregular shapes in high-dimensional space. The lesson: for semantic clustering, algorithm choice matters more than parameter tuning.

deep dive

Why K-Means Fails for Semantic Data

K-means is the most popular clustering algorithm. It's fast, simple, and produces balanced clusters. For semantic embeddings, it's often a poor baseline. **The K-Means Assumption** K-means finds k cluster centers that minimize within-cluster variance. It implicitly assumes: - Clusters are spherical (same variance in all directions) - Clusters have similar sizes - All points belong to some cluster **Why This Fails for Semantics** Semantic embeddings don't satisfy these assumptions. Consider clustering documents about "machine learning." You'll find: **Dense, Tight Clusters** Highly specific topics (like "transformer architectures") form dense clusters. Documents are very similar. **Sparse, Broad Clusters** General topics (like "AI applications") form loose clusters. Documents share themes but vary widely. **Outliers and Noise** Some documents genuinely don't belong to any cluster. K-means forces them into the nearest group, polluting cluster coherence. **Irregular Shapes** Semantic relationships form complex shapes. "Financial technology" might form a crescent between "finance" and "technology" clusters. K-means, optimizing for spherical clusters, will split this in arbitrary ways. **The Result** K-means can produce geometrically tidy but semantically incoherent clusters. Documents may get grouped by accident of their vector positions, not by meaning.

case study

Internal Comparison: Clustering Healthcare Documents

We clustered 5,000 healthcare policy documents to identify distinct topic areas in an internal evaluation. Documents were embedded using a BioBERT-based sentence transformer (768 dimensions). **K-Means Results (k=15)** - 15 balanced clusters (300-350 docs each) - Silhouette score: 0.42 (reasonable) - Computation time: 8 seconds **Manual inspection revealed problems:** - Cluster 3: Mixed "diabetes treatment guidelines" with "hospital billing procedures" - Cluster 8: Combined "mental health policy" with "surgical equipment regulations" - Cluster 12: "Pediatric care" mixed with "clinical trial protocols" Geometric clustering metrics looked fine. Semantic coherence was poor. **HDBSCAN Results** We reran with HDBSCAN (min_cluster_size=50, min_samples=10): - 23 clusters (sizes 51 to 847, highly variable) - 312 documents marked as noise (6%) - Silhouette score: 0.38 (lower than k-means!) - Computation time: 47 seconds **Manual inspection showed clearer semantic separation:** - Cluster 1 (847 docs): Medicare reimbursement policies (highly coherent) - Cluster 7 (156 docs): Mental health parity requirements (coherent) - Cluster 15 (51 docs): Rare disease treatment protocols (very specific, very coherent) - Noise points: Documents genuinely spanning multiple topics **Why HDBSCAN Won** HDBSCAN identified that some topics (like Medicare) have huge document volumes, while others (rare diseases) have small but coherent clusters. It correctly identified cross-topic documents as noise rather than forcing them into misleading categories. The lower silhouette score reflected one tradeoff of this example: healthcare topics have variable density and irregular shapes, and a purely geometric metric did not capture semantic usefulness on its own. In this internal comparison, HDBSCAN handled that structure better than k-means.

Research Question

Which clustering algorithms best reveal natural semantic groupings in high-dimensional embedding spaces?

Key Findings

dbscan superiority

Empirical

In our internal evaluations, DBSCAN and HDBSCAN often produced more semantically coherent clusters than k-means for text embeddings, even when geometric metrics such as silhouette score did not improve.

irregular shapes

Empirical

Semantic clusters in embedding space have irregular, non-spherical shapes that centroid-based methods struggle to capture

dimension threshold

Empirical

Clustering quality plateaus beyond 50 dimensions; higher dimensionality adds noise without improving semantic grouping

Data & Metrics

  • Data: Embedding vectors for 50,000+ domain-specific documents across finance, healthcare, technology
  • Data: Common Crawl text data embedded using modern transformer models
  • Silhouette score (cluster separation metric)
  • Davies-Bouldin index (cluster compactness)
  • Manual semantic coherence evaluation (sample-based)

Limitations

  • Results specific to transformer-based embeddings; may differ with other embedding methods
  • Manual semantic coherence evaluation introduces subjective bias
  • Computational cost limits testing to moderate-scale datasets
  • Examples reported here are drawn from internal analyses rather than a peer-reviewed benchmark suite

Sources

Conclusion

Semantic clustering is not just a matter of running sklearn.cluster.KMeans on your embeddings. The algorithm choice fundamentally determines whether you get geometrically optimal nonsense or semantically meaningful groups. K-means optimizes for criteria that don't align with semantic coherence. It assumes spherical clusters of similar sizes. Semantic data violates both assumptions. The result is often clusters that look good on metrics but fall apart under inspection. Density-based methods, especially HDBSCAN, often align better with how semantic information organizes. Topics have variable density. Some are tight and specific, others loose and general. Some documents are genuinely cross-topic. HDBSCAN is often a better fit for that structure than k-means. The tradeoff is complexity. HDBSCAN has more parameters, runs slower, and produces irregular results that are harder to present. But if you care about finding actual semantic structure rather than imposing artificial geometric regularity, it's worth it. Practical advice: Start with HDBSCAN as a strong baseline. If it's too slow or produces too many small clusters, try regular DBSCAN. Use k-means when you specifically need exactly k balanced clusters and can tolerate some semantic impurity. And always validate by looking at what's actually in the clusters, not just what the metrics say.

← Back to all research