Which statistical model should you choose?

A guide to choose a statistical modeling tool according to the situation

The choice of a statistical model is not straightforward. It is erroneous to think that every data set has its own adapted model. If you are new to statistical modelling, this easy and short tutorial may be useful before exploring the following grid.

Every modelling tool answers specific questions. For example, glycaemia linked to a specific diabetes can be explained by a qualitative variable (sex for example). In this situation, the ANOVA model can be used. We may also use age data (quantitative variable) to see if there is a linear increasing or decreasing trend of glycaemia according to the age of the patients, using the same data. In this situation we would use linear regression.

The choice of a statistical model can also be guided by the shape of the relationships between the dependent and explanatory variables. A graphical exploration of these relationships may be very useful. Sometimes these shapes may be curved, so polynomial or nonlinear models may be more appropriate than linear ones.

The choice of a model can also be intimately tied to the very specific question you are investigating. For example, the estimation of the Vmax and Km parameters of the Michaelis-Menten enzyme kinetics implies the consideration of the specific Michaelis-Menten equation linking reaction rate (dependent variable) to substrate concentration (explanatory variable) in a nonlinear way.

If the purpose of the study is only to make predictions from a large set of variables, then solutions other than parametric models may be considered. The possibly correlated explanatory variables. The use of Partial Least Squares regression is very popular in chemometrics, where outputs are often predicted by a large spectrum of wavelengths.

What number of parameters should be included in the model?

Once you choose the appropriate modelling tool, in many situations you may ask how many parameters you should include in the model. The higher the number of parameters you include, the better the fit of the model to the data (i.e. the lower the residuals which implies a higher R² statistic). So should the number of parameters in the model be maximized in a way that residuals are extremely minimized? Not really. A model which fits the data too much will be too representative of the particular sample that is used, and the generalization to the whole population will be less accurate.

Model quality measured as the balance between a fair fit of the data and a minimal number of parameters can be assessed using indices such as Akaike’s Information Criterion (AIC) or the Bayesian Information Criterion (BIC or SBC). When comparing several parametric models to each other, the model with the lowest index has the best quality in the set. The interpretation of these indices does not make sense in an absolute context, in other words, when only one model is taken into consideration.

The grid

The grid below will help you choose a statistical model that may be appropriate to your situation (types and numbers of dependent and explanatory variables). The grid also includes a column with an example in each situation.

Conditions of validity of parametric models are listed in the paragraph following the grid.

The displayed solutions are the most commonly used tools in statistics. They are all available in XLSTAT. The list is not exhaustive. Many other solutions exist.

Dependent variable	Explanatory variable(s)	Example	Parametric models	Conditions of validity	Other solutions
One quantitative variable	One qualitative variable (= factor) with two levels	Effect of contamination (yes / no) on the concentration of a trace element in a plant	One-way ANOVA with two levels	1 ; 2 ; 3 ; 4	Mann-Whitney test
	One qualitative variable with k levels	Effect of the site (4 factories) on the concentration of a trace element in a plant	One-way ANOVA	1 ; 2 ; 3 ; 4	Kruskal-Wallis test
	Several qualitative variables with several levels	Combinatory effects of site (4 factories) and plant species on the concentration of a compound in plant tissue	Multi-way ANOVA (factorial designs)	1 ; 2 ; 3 ; 4
	One quantitative variable	Effect of temperature on the concentration of a protein	Simple linear regression ; nonlinear models (depends on the shape of the relationship between the dependent / explanatory variable)	1 - 3	nonparametric regression ();quantile regression; regression trees (); Random Forest(*)
	Several quantitative variables	Effect of the concentration of several contaminants on plant biomass	Multiple linear regression ; nonlinear models	1 - 6	PLS regression (*); Lasso; Ridge; Elastic Net
	Mixture of qualitative / quantitative variables	Combinatory effects of sex and age on glycaemia associated to a type of diabetes	ANCOVA	1 - 6	PLS regression (); quantile regression; regression trees (); Random Forest(*); Lasso; Ridge; Elastic Net
Several quantitative variables	Qualitative &/or quantitative variable(s)	Effect an environmental variables matrix on the transcriptome	MANOVA	1 ; 4 ; 7 ; 8	Redundancy analysis; PLS regression (*)
One qualitative variable	Qualitative &/or quantitative variable(s)	Dose effect on survival / death of mouse individuals	Logistic regression (binomial or ordinal or multinomial )	5 ; 6	PLS-DA (); Discriminant Analysis (); classification trees (); classification Random Forest()
One count variable (with many zero's)	Qualitative &/or quantitative variable(s)	Dose effect on the number of necroses in mice	Log-linear regression (Poisson)	5 ; 6

(*) solutions designed more for prediction

Conditions of validity

Validity conditions we propose are rules of thumb. There are no precise rules in literature. We strongly advise to refer to your fields’ specific recommendations.

Conditions of validity

Individuals are independent.
Variance is homogeneous.
Residuals follow a normal distribution.
At least 20 individuals (recommended).
Absence of multicollinearity (if the purpose is to estimate model parameters).
No more explanatory variables than individuals.
Multivariate normality of residuals.
Variance is homogeneous within every dependent variable. Correlations across dependent variables are homogeneous.

Was this article useful?

Which statistical model should you choose?

A guide to choose a statistical modeling tool according to the situation

What number of parameters should be included in the model?

The grid

Conditions of validity

Similar articles