Dataset for computing Pearson correlation coefficientsAn Excel sheet with both the data and the results can be downloaded by clicking on the button below:
Download the data
The data represents a sample of customers from an online shoe shop. Rows correspond to customers and columns to the money they spent as well as several other characteristics (e.g. shoe size, weight…).
Goal of this tutorial
The goal here is to compute the correlations between the money spent on the online shoe store and the different attributes. A correlation coefficient depicts the strength of the link between two quantitative variables, whether positive or negative. Our data consist of continuous variables so we will use the Pearson correlation coefficient. We will also test the significance of the correlations.
Finally, we will generate two types of graphs:
- A correlation map to visually explore correlations and
- A matrix of scatter plots to visualize the relationships among all possible pairs of variables.
Setting up a Pearson correlation coefficient computation in XLSTAT1. Once XLSTAT is open, select the Correlation /Association tests / Correlation tests command as shown below.
2. The Correlation tests dialog box appears.
3. In the General tab, select columns A-E in the Observations/Quantitative variables field. Then choose the Pearson correlation coefficient from the drop-down list. As the first row of the table corresponds to headers, we leave the Variable labels option checked.
4. In the Outputs tab, activate the following options.
The p-values will be computed for each coefficient in order to test the null hypothesis that the correlation coefficients are equal to 0.
The coefficients of determination correspond to the squared correlation coefficients. They measure the strength of the correlation, whether it was negative or positive. using the filter variables option, we choose to display only the 4 variables for which the sum of R2 with other variables is the highest.
We finally sort the variables using the BEA (Bond Energy Algorithm). This method applies a permutation on rows and columns of a square matrix in a way that columns having similar values on rows are close to each other. The FPC (First Principal Component) is also available.
The (1-alpha)100 % confidence intervals will be also computed. One table will display the upper bounds and another the lower bounds. We can also display both bounds in a single table.
5. In the Charts tab, activate the following options.
In the Image tab, we can choose to display the correlation matrix as an image. This option can be very useful when correlation matrices contain a high number of variables in order to see quickly which variables have the same structure.
Interpreting the results for Pearson correlation coefficientsThe first results are the descriptive statistics for all variables (mean, std.deviation, etc). The correlation matrix is then displayed followed by the 95% lower and upper confidence bounds for the correlation coefficients :
Correlation coefficients vary between -1 and 1. Negative values indicate negative correlation, and positive values indicate positive correlations. Values close to zero reflect the absence of correlation.
The correlations between the Invoice amount and the attributes Height and Weight are positive and strong (close to 1). On the other hand, we observe a negative correlation between the Time spent and the Invoice amount suggesting that the more time customers spend on the website the less money they spend.
All the coefficients appear to be significant at a 0.05 significance level (values in bold). In other words, the risk of rejecting the null hypothesis (coefficient =0) while this is true is less than 5%. This is confirmed by the table of the p-values below (p-values < 0.0001).
Note that the shoe size is not displayed in the correlation matrix. This variable was excluded because it has the lowest sum of R2 among all the variables.
The next graph is a correlation map that uses a blue-red (cold-hot) scale to display the correlations. The blue color corresponds to a correlation close to -1 (e.g. Time spent on site vs Invoice amount) and the red color corresponds to a correlation close to 1 (e.g. Height vs Invoice amount).
The following graph is a matrix of plots. A histogram is displayed for each variable (diagonal) and a scatter plot for all combinations of variables.
The color of the data points in the scatter plots reveals whether there is a positive (red) or negative correlation (blue). The patterns found in the scatter plots indicate the type but also the strength of the relationship between the two variables. For example, shoe size looks like it is poorly linked to all other attributes (last column or last row of the matrix) implying correlations close to zero.