Which statistical model should you choose?
A guide to choose a statistical modeling tool according to the situation
The choice of a statistical model is not straightforward. It is erroneous to think that every data set has its own adapted model. If you are new to statistical modelling, this easy and short tutorial may be useful before exploring the following grid.
Every modelling tool answers specific questions. For example, glycaemia linked to a specific diabetes can be explained by a qualitative variable (sex for example). In this situation, the ANOVA model can be used. We may also use age data (quantitative variable) to see if there is a linear increasing or decreasing trend of glycaemia according to the age of the patients, using the same data. In this situation we would use linear regression.
The choice of a statistical model can also be guided by the shape of the relationships between the dependent and explanatory variables. A graphical exploration of these relationships may be very useful. Sometimes these shapes may be curved, so polynomial or nonlinear models may be more appropriate than linear ones.
The choice of a model can also be intimately tied to the very specific question you are investigating. For example, the estimation of the Vmax and Km parameters of the Michaelis-Menten enzyme kinetics implies the consideration of the specific Michaelis-Menten equation linking reaction rate (dependent variable) to substrate concentration (explanatory variable) in a nonlinear way.
If the purpose of the study is only to make predictions from a large set of variables, then solutions other than parametric models may be considered. The possibly correlated explanatory variables. The use of Partial Least Squares regression is very popular in chemometrics, where outputs are often predicted by a large spectrum of wavelengths.
What number of parameters should be included in the model?
Once you choose the appropriate modelling tool, in many situations you may ask how many parameters you should include in the model. The higher the number of parameters you include, the better the fit of the model to the data (i.e. the lower the residuals which implies a higher R² statistic). So should the number of parameters in the model be maximized in a way that residuals are extremely minimized? Not really. A model which fits the data too much will be too representative of the particular sample that is used, and the generalization to the whole population will be less accurate.
Model quality measured as the balance between a fair fit of the data and a minimal number of parameters can be assessed using indices such as Akaike’s Information Criterion (AIC) or the Bayesian Information Criterion (BIC or SBC). When comparing several parametric models to each other, the model with the lowest index has the best quality in the set. The interpretation of these indices does not make sense in an absolute context, in other words, when only one model is taken into consideration.
The grid below will help you choose a statistical model that may be appropriate to your situation (types and numbers of dependent and explanatory variables). The grid also includes a column with an example in each situation.
Conditions of validity of parametric models are listed in the paragraph following the grid.
The displayed solutions are the most commonly used tools in statistics. They are all available in XLSTAT. The list is not exhaustive. Many other solutions exist.
|Dependent variable||Explanatory variable(s)||Example||Parametric models||Conditions of validity||Other solutions|
|One quantitative variable||One qualitative variable (= factor) with two levels||Effect of contamination (yes / no) on the concentration of a trace element in a plant||One-way ANOVA with two levels||1 ; 2 ; 3 ; 4||Mann-Whitney test|
|One qualitative variable with k levels||Effect of the site (4 factories) on the concentration of a trace element in a plant||One-way ANOVA||1 ; 2 ; 3 ; 4||Kruskal-Wallis test|
|Several qualitative variables with several levels||Combinatory effects of site (4 factories) and plant species on the concentration of a compound in plant tissue||Multi-way ANOVA (factorial designs)||1 ; 2 ; 3 ; 4|
|One quantitative variable||Effect of temperature on the concentration of a protein||Simple linear regression ; nonlinear models (depends on the shape of the relationship between the dependent / explanatory variable)||1 - 3||nonparametric regression (*);quantile regression; regression trees (*); Random Forest(*)|
|Several quantitative variables||Effect of the concentration of several contaminants on plant biomass||Multiple linear regression ; nonlinear models||1 - 6||PLS regression (*); Lasso; Ridge; Elastic Net|
|Mixture of qualitative / quantitative variables||Combinatory effects of sex and age on glycaemia associated to a type of diabetes||ANCOVA||1 - 6||PLS regression (*); quantile regression; regression trees (*); Random Forest(*); Lasso; Ridge; Elastic Net|
|Several quantitative variables||Qualitative &/or quantitative variable(s)||Effect an environmental variables matrix on the transcriptome||MANOVA||1 ; 4 ; 7 ; 8||Redundancy analysis; PLS regression (*)|
|One qualitative variable||Qualitative &/or quantitative variable(s)||Dose effect on survival / death of mouse individuals||Logistic regression (binomial or ordinal or multinomial )||5 ; 6||PLS-DA (*); Discriminant Analysis (*); classification trees (*); classification Random Forest(*)|
|One count variable (with many zero's)||Qualitative &/or quantitative variable(s)||Dose effect on the number of necroses in mice||Log-linear regression (Poisson)||5 ; 6|
(*) solutions designed more for prediction
Conditions of validity
Validity conditions we propose are rules of thumb. There are no precise rules in literature. We strongly advise to refer to your fields’ specific recommendations.
Conditions of validity
Individuals are independent.
Variance is homogeneous.
Residuals follow a normal distribution.
At least 20 individuals (recommended).
Absence of multicollinearity (if the purpose is to estimate model parameters).
No more explanatory variables than individuals.
Multivariate normality of residuals.
Variance is homogeneous within every dependent variable. Correlations across dependent variables are homogeneous.
Was this article useful?