Which clustering method should you choose?
Goal of this tutorial
The aim of this tutorial is to help XLSTAT users to pick an appropriate cluster analysis tool to analyze their data.
What is cluster analysis?
Cluster analysis methods allow assembling objects (observations or individuals) in classes (clusters) in such a way that objects belonging to the same class are more similar to one another than to objects belonging to other classes. Proximity between objects is based on a set of variables measured on all the objects. Cluster analysis methods are widely used in exploratory data mining techniques. Here are a few examples:
In expression data (transcriptomics, proteomics, metabolomics, etc.), those methods allow detecting individuals that have similar expression profiles, or features that have similar expression patterns.
In market research, clustering methods allow to detect different consumer profiles using survey data.
In ecology, those methods help to identify groups of sites that hold similar communities.
Available methods in XLSTAT
XLSTAT proposes four different clustering methods stored in the Analyzing data button:
Agglomerative hierarchical clustering (AHC)
And one method in the XLSTAT-LG option:
These methods only work on quantitative variables (except for latent class cluster models). Binary variables could also be used in AHC. If you need to cluster objects based on qualitative variables, we recommend running a Multiple Correspondence Analysis first and using observation scores on the first axes (factors) as a dataset for clustering.
In the same spirit, one may also use observation scores provided by any exploratory analysis, including Correspondence analysis.
What clustering method to choose
Every method has its own characteristics summarized in the table below.
|AHC||k-means||Gaussian Mixture||Univariate clustering||Latent class cluster model|
|Number of variables||1 at least||1 at least||1 at least||1 at most||1 at least|
|Input variables type||Quantitative continuous||Quantitative continuous||Quantitative continuous||Quantitative continuous||Quantitative continuous, Quantitative ordinal, nominal|
|Should the number of classes be chosen prior to computations?||Optional||Mandatory||Mandatory||Mandatory||Mandatory (but optimal number of classes can be determined by the model)|
|Results: Class membership*||Deterministic||Deterministic||Probabilistic||Deterministic||Probabilistic|
|Results: Special features||Dendrogram, profile plot||Profile plot||Parameter estimation of classes, mixture model plots, MAP classification plot||-||Variable contribution to each class, possibility to predict class membership of new observations (scoring equation|
Going furtherAfter computations, the class membership of every observation is provided in different ways according to the clustering method. The deterministic way involves the assignment of every object to a single class whereas the probabilistic way displays membership probability of an observation in each class.
Very big datasets could be handled by combining different methods. For example, clusters obtained by the k-means method could be used as observations within an agglomerative hierarchical clustering. This tutorial will guide you.
Was this article useful?