Understanding Clustering Models: A Simple Guide with Examples

Introduction

In the expansive domain of machine learning, It is useful for a data science practitioner to understand how to apply clustering algorithms due to their ability to discern and group similar data points without prior labeling knowledge. Clustering is instrumental in revealing hidden patterns within data, aiding in customer segmentation and anomaly detection applications.

This post will explore the theoretical and practical applications of three common clustering techniques: KMeans, DBSCAN, and Hierarchical clustering. We will use Python code examples to explore each technique’s implementation, visualization, and evaluation using simple sample data. By the end, you should be able to understand and implement these algorithms in more complex use-case scenarios.

Files and Environment

You can download the Python notebook and YAML file to create the environment in my GitHub repository: GitHub SolisAnalytics. The notebook contains all the code used in the post.

Clustering Models Explored

K-Means Clustering

Overview

K-Means clustering algorithm partitions a dataset into K distinct clusters based on the distance to the centroid of the clusters. It iteratively assigns each data point to the nearest cluster while minimizing the sum of the distances between the data points and their respective cluster centroid. The process continues until the centroids stabilize.

Code

Main Parameters to focus on a K-Means Model:

  • n_clusters = The number of clusters to form. You can choose an optimal one based on an evaluation metric

  • max_iter = Maximum number of iterations for a single run

  • tol = Tolerance for convergence, based on changes in within-cluster sum of squares criterion

Visualization

Image 1: The plot shows how the data points are placed near the closest centroid. The darker data points have the largest distances from their nearest centroid.

DBSCAN Clustering

Overview

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identifies clusters based on the density of the data points, marking them as outlier points in low-density regions. This method can detect clusters of arbitrary shapes and sizes by grouping points in close-density regions and excluding noise points.

Code

Main Parameters to focus on a DBSCAN Model:

  • eps = The max distance between two samples for one to be considered a neighbor to another.

  • min_samples = The minimum number of samples to be considered as a core point, affecting the cluster size.

Visualization

Image 2: The DBSCAN plot shows an outlier close to the top centroid in the previous plot.

Hierarchical Clustering

Overview

Hierarchical clustering builds a dendrogram of data, showing relationships between individual data points and groups. It can be executed in a bottom-up approach, where data points are progressively merged into clusters, or a top-down approach, where data points start in one cluster that is iteratively split.

Code

Main Parameters to focus on a Hierarchical Clustering Model:

  • method = Algorithm used to find the distance between sets of observations. Common methods include ‘single,’ ‘complete,’ ‘average,’ ‘ward,’ etc.

  • metric = Distance metric to use in the case that ‘y’ is a collection of observation vectors. Euclidean distance is used by default.

  • optimal_ordering = Optional parameter that modifies the order of observations in the dendrogram to minimize the crossing of dendrogram branches. It can improve clarity but adds computational cost.

Visualization

Image 3: The dendrogram shows the data's hierarchical relationships. The blue branches represent the number of clusters (2), and the other colors represent sub-clusters in a grouping.

Evaluating Clustering Models

The K-Means model will be used to walk through the evaluation metrics.

Silhouette Coefficient

Overview

The Silhouette Coefficient measures an object's similarity to its cluster compared to others. The value ranges from -1 (incorrect clustering) to +1 (highly dense clustering), with 0 meaning overlapping clusters.

Output

K-Means Silhouette Coefficient: 0.75

A coefficient of 0.75 shows a highly dense clustering of the K-Means model.

Elbow Method

Overview

The Elbow method involves plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to utilize. The elbow method is a good evaluation metric for identifying where adding another cluster does not give much better data clustering.

Output

The elbow plot displays the optimal number of clusters based on the “Within-Cluster Sum of Squares.” Watch out for the obvious hinge on the line bar at two on the x-axis. Clustering further does not significantly reduce WCSSS.

Davies-Bouldin Index

Overview

The Davies-Boulding index evaluates the clustering quality by measuring the average similarity between each cluster and the most similar one. The similarity is based on a ratio of within-cluster distances to between-cluster distances. Lower values indicate better clustering.

Output

K-Means Davies-Bouldin Index: 0.10

The index indicates good clustering since the average similarity between clusters with their most similar ones is low.

Conclusion

This article reviewed each clustering method, which has unique strengths and applications, from the straightforward partitioning of K-Means to the density-based clusters of DBSCAN and the nested structures revealed by hierarchical clustering. Visualizations play an important role in interpreting the clusters formed by the models. Additionally, we can look at evaluation metrics like the Silhouette Coefficient, Elbow Method, and Davies-Bouldin Index to further evaluate our model’s performance.

Previous
Previous

How I Passed the AWS Certified Machine Learning Specialty Without Taking Any Prior AWS Certifications

Next
Next

Understanding Key Statistical Tests for Data Scientists