What is statistical modeling?
What is statistical modeling?
In simple terms, statistical modeling is a simplified, mathematically-formalized way to approximate reality (i.e. what generates your data) and optionally make predictions from this approximation. The statistical model is the mathematical equation that is used.
Here is a basic example. Suppose you want to report the weight of a variety of potatoes. We will consider a hard and an easy way to do it. The hard way is spending years measuring the weight of every single potato of this variety in the world and reporting your data in an endless Excel spreadsheet. The easy way is to select a 30 potato-wide representative sample of this variety, computing its average and standard deviation and reporting only those two numbers as an approximate description of this weight. Representing a quantity by an average and a standard deviation is a very simple form of statistical modeling.
Another example is attempting to represent the height of plants according to soil water content by a straight line characterized by a slope and an intercept drawn after an experiment on a sample of plants submitted to growing soil humidity. This particular model is called simple linear regression.
What are dependent and explanatory variables?
In almost all cases, statistical models imply explanatory and dependent variables.
The dependent variable is the one we want to describe, explain, or predict. As a rule of thumb, the dependent variable is often the one we represent on the Y axis in modeling charts. In the plant height example, the dependent variable is plant height.
Explanatory variables also referred to as independent variables, are the ones we use to explain, describe, or predict the dependent variable(s). Explanatory variables are often represented on the X axis. The plant height example involves only one explanatory variable, which is quantitative: soil water content.
Both dependent and explanatory variables may be single or multiple, quantitative or qualitative. There are models adapted to different situations.
What if I have more explanatory variables than observations?
The presence of a large number of explanatory variables can be a problem for classical statistical analyses such as linear regression. To counter this problem, there are methods that assume that only a part of the available explanatory variables is actually relevant to model the dependent variable: these are the LASSO regression, RIDGE regression, and Elastic net.
What is a model parameter?
In classic, parametric models, the dependent variable(s) is linked to the explanatory ones through a mathematical equation (the model) that involves quantities called model parameters. In the plant height simple linear regression example, parameters are the intercept and the slope.
The equation may be written like this:
Height = intercept + slope*soil water content
Computations behind statistical modeling allow the estimation of model parameters and further predictions of the dependent variable.
Simple linear regression also involves a third parameter, the variance of residuals (see paragraph below).
What is a model residual?
Technically, model residuals (or errors) are the distances between data points and the model (which is represented by the straight line in the plant height linear regression example).
Model residuals represent the part of the variability in the data the model was unable to capture. The R² statistic is the part of variability that is explained by the model. So the lower the residuals, the higher the R² statistic.
What statistical model should you choose?
This grid will guide you through the choice of the most commonly used models according to the type and number of dependent and independent variables. Solutions other than parametric models are also proposed.
How to do statistical modeling in XLSTAT?
In XLSTAT, you have access to several statistical models such as One-way ANOVA & multiple comparisons in Excel tutorial, Simple linear regression but also Random components mixed model, or Nonlinear regression.
When launching a regression model, you simply have to select your variables from the datasheet. In XLSTAT, your response variables can be quantitative but also binary, ordinal, or multinomial while your explanatory variables can be quantitative or qualitative depending on the model.
Do not hesitate to visit our tutorial on how to run logistic regression.
Was this article useful?