Choosing an appropriate multivariate data analysis technique
Here we define multivariate (or multidimensional) datasets as data tables containing more than 2 variables (usually stored in columns) measured on more than 2 statistical units (individuals, patients, sites…) usually stored in rows. Multidimensional data analysis techniques are used to extract interesting information in large datasets that can hardly be read in their raw format. Those tools are often referred to as data mining tools.
The following grid will guide you through the choice of an appropriate data mining method according to the type of question you want to investigate using your data (exploratory or decisional) as well as the structure of your data. The list is non-exhaustive. However, it contains the most commonly used methods, all available in XLSTAT.
We divided the questions into two types:
- Exploratory questions allow the investigation of multivariate datasets without considering any particular hypothesis to validate. Exploratory multivariate data analysis tools often imply a reduction of the dimensionality of large datasets making data exploration more convenient.
- Decisional questions imply testing the relationship between two sets of variables (correlation), or explaining a variable or a set of variables by another set (causality).
|Question||Number of tables||Data description||Tool||Remarks|
|Exploratory||1||Quantitative variables only||Principal Component Analysis(PCA)||Considers all the variance in the data; components do not necessarily reflect real phenomena|
|Exploratory||1||Quantitative variables only||Factor analysis (FA)||Considers only the covariance between variables; latent factors reflect real phenomena|
|Exploratory||1||Proximity matrix||Multidimensional scaling (MDS) /Principal Coordinate Analysis(PCoA)|
|Exploratory||1||Contingency table (2 qualitative variables)||Correspondence Analysis (CA)|
|Exploratory||1||Qualitative variables only||Multiple Correspondence Analysis(MCA)|
|Exploratory||1||Quantitative and qualitative variables||Factorial analysis of mixed data (PCAmix)||Contrary to MFA, the dataset is not structured in groups|
|Exploratory||≥2||Qualitative variables tables and-or quantitative variables tables and-or frequency table||Multiple Factor Analysis (MFA)|
|Exploratory||≥2||Quantitative variables tables||Generalized Procrustes Analysis(GPA)||Could include an inferential part: the consensus test|
|Exploratory (clustering)||1||Quantitative variables only||Clustering tools (AHC, k-means...)||Classical clustering methods could be applied on a qualitative variables table indirectly, using row scores on the dimensions of a Multiple Correspondence Analysis|
|decisional (causality)||1||One dependent variable and several quantitative and-or qualitative explanatory variables||Statistical modelling tools(regression, ANCOVA…)|
|2||Two quantitative variables tables||Canonical correlation analysis||Linear relationships between the two tables|
|2||One contingency table Y (often a site-species data matrix) and one explanatory quantitative and-or qualitative variables table (X)||Canonical correspondence analysis||Unimodal relationships between X and Y; could be used to depict species niches along environmental gradients|
|decisional (causality)||2||One dependent quantitative variables table (Y) and one quantitative and-or qualitative explanatory variables table (X)||Redundancy analysis (RDA)||Linear relationships between X and Y|
|decisional (causality)||2||One dependent quantitative variables table (Y) and one quantitative and-or qualitative explanatory variables table (X)||Partial Least Square regression(PLS)||Especially used for prediction|
|decisional (causality)||≥2||Several tables of manifest variables, each table representing a latent variable||Partial Least Square Structural Equation Modelling (PLS-PM)|