# Pearson correlation coefficient in Excel

## Dataset for computing Pearson correlation coefficients

The data represents a sample of customers from an online shoe shop.  Rows correspond to customers and columns to the money they spent as well as several other characteristics (e.g. shoe size, weight…).

## Goal of this tutorial

The goal here is to compute the correlations between the money spent on the online shoe store and the different attributes. A correlation coefficient depicts the strength of the link between two quantitative variables, whether positive or negative. Our data consist of continuous variables so we will use the Pearson correlation coefficient.  We will also test the significance of the correlations.
Finally, we will generate two types of graphs:

• A correlation map to visually explore correlations and

• A matrix of scatter plots to visualize the relationships among all possible pairs of variables.

## A bit of theory : How to calculate the Pearson correlation coefficient?

Today, many correlation coefficient calculators are available, but you can easily calculate the linear correlation coefficient yourself.
The Pearson correlation coefficient is calculated with the following formula:

$r=\dfrac{n(\sum xy)-(\sum x)(\sum y)}{\sqrt{[n\sum x^2 - (\sum x)^2][n\sum y^2 - (\sum y)^2]}}$

### How to interpret correlation?

The correlation coefficient represents the strength of the linear relationship between two variables.
The closer its value is to -1, the stronger the negative link between the variables: when one increases, the other decreases.
The closer its value is to 1, the stronger the positive link: both variables increase or decrease at the same time. A correlation coefficient of 1 represents a perfect positive linear relationship between the variables.
If the correlation coefficient is close to 0, it means that there is no link between the two variables.

### How is the significance tested?

A test statistic is calculated with the formula below that follows a Student distribution with n-2 degrees of freedom.

$t_{n-2}=\dfrac{r}{1-r^2}\sqrt{n-2}$

If these values go out of bounds defined by the alpha = 0.05 value, then the null hypothesis is rejected and the Pearson correlation coefficient is significantly different from 0.

## Setting up a Pearson correlation coefficient computation in XLSTAT

• Open XLSTAT

• Select the Correlation /Association tests / Correlation tests command.

• In the General tab, select columns A-E in the Observations/Quantitative variables field. Then choose the Pearson correlation coefficient from the drop-down list. • In the Outputs tab, activate the display of the p-values, the coefficients of determination (R2), as well as the filtering and sorting of the variables depending on their R2. • Click on OK to start the computations.

## How to interpret the results of a Pearson correlation test?

The first results in XLSTAT are the descriptive statistics for all variables (mean, std.deviation, etc). The correlation matrix is then displayed followed by the 95% lower and upper confidence bounds for the correlation coefficients. One table will display the upper bounds and another the lower bounds. We can also display both bounds in a single table.  The correlations between the Invoice amount and the attributes Height and Weight are positive and strong (close to 1). On the other hand, we observe a negative correlation between the Time spent and the Invoice amount suggesting that the more time customers spend on the website the less money they spend.
All the coefficients appear to be significant at a 0.05 significance level (values in bold). The p-values will be computed for each coefficient in order to test the null hypothesis that the correlation coefficients are equal to 0. In other words, the risk of rejecting the null hypothesis (coefficient =0) while this is true is less than 5%. This is confirmed by the table of the p-values below (p-values < 0.0001).
Note that the shoe size is not displayed in the correlation matrix. This variable was excluded because it has the lowest sum of R2 among all the variables. The coefficients of determination correspond to the squared correlation coefficients. They measure the strength of the correlation, whether it is negative or positive. Here, using the filter variables option, we have chosen to display only the 4 variables for which the sum of R2 with other variables is the highest. Moreover, we have sorted the variables using the BEA (Bond Energy Algorithm). This method applies a permutation on rows and columns of a square matrix in a way that columns having similar values on rows are close to each other. The FPC (First Principal Component) is also available.
The next graph is a correlation map that uses a blue-red (cold-hot) scale to display the correlations. The blue color corresponds to a correlation close to -1 (e.g. Time spent on site vs Invoice amount) and the red color corresponds to a correlation close to 1 (e.g. Height vs Invoice amount). The following graph is a matrix of plots. A histogram is displayed for each variable (diagonal) and a scatter plot for all combinations of variables.
The color of the data points in the scatter plots reveals whether there is a positive (red) or negative correlation (blue).  The patterns found in the scatter plots indicate the type but also the strength of the relationship between the two variables. For example, shoe size looks like it is poorly linked to all other attributes (last column or last row of the matrix) implying correlations close to zero. ## What’s next: exploring quantitative variables with Principal Component Analysis

The Principal Coordinate Analysis in Excel is a multivariate analysis method that will allow us to further explore the relationships between the variables and relate customers to variables and to one another. 