There are many reasons we might want to use cluster analysis in our work. Geologists might want to sort hundreds of rock samples into a handful of rock types, a petrophysicist might want to group data points from well logs (as shown here), or a curious kitchen dweller may want to digitally classify patterns found in his (or her) linen collection.
Two algorithms worth knowing about when faced with any clustering problem are called k-means and fuzzy c-means clustering. They aren't the right solution for all clustering problems, but they are a good place to start.
k-means clustering — each data point gets assigned to one of k centroids (or centres) according to the centroid it is closest to. In the image shown here, the number of clusters is 2. The pink dots are closest to the left centroid, and the black dots are closest to the right centroid. To see how the classification is done, watch this short step-by-step video. The main disadvantage with this method is that if the clusters are poorly defined, the result seems rather arbitrary.
Fuzzy c-means clustering — each data point belongs to all clusters, but only to a certain degree. Each data point is assigned a probability of belonging to each cluster, and is thus easily assigned the class for which it has a highest probability. If a data point is midway between two clusters, it is still assigned to its closest cluster, but with lower probability. As the bottom image shows, data points on the periphery of cluster groups, such as those shown in grey may be equally likely to belong to both clusters. Fuzzy c-means clustering provides a way of capturing quantitative uncertainty, and even visualizing it.
Some observations fall naturally into clusters. It is just a matter of the observer choosing an adequate combination of attributes to characterize them. In the fabric and seismic examples shown in the previous post, only two of the four Haralick textures are needed to show a diagnostic arrangement of the data for clustering. Does the distribution of these thumbnail sections in the attribute space align with your powers of visual inspection?