Correspondence Analysis (CA) from a contingency table

2017-10-20

This tutorial will help you set up and interpret a Correspondence Analysis (CA) in Excel using the XLSTAT software.
Not sure if this is the right multivariate data analysis tool you need? Check out this guide.

Dataset for running a Correspondence Analysis

An Excel sheet with both the data and the results can be downloaded by clicking on the button below:

The data correspond to a survey asking moviegoers their opinion on a film they had just seen. The audience was also asked to give their age category.

Setting up a Correspondence Analysis

Once XLSTAT is open, select the XLSTAT / Analyzing data / Correspondence analysis command.

Once you have clicked on the button, the Correspondence Analysis dialog box appears.

Select the data on the Excel sheet. If your data are in a pivot table format (as in the example below), select the Two-way table format. If your data are in an Observations/variables format, select the corresponding option.

Note: If the names of the categories of the contingency table are included, select the Labels included option.

In the Options tab, select None for the Advanced analysis option.

Do not select the Non-symmetrical analysis option and choose Chi-square for the Distance. This combination of options allows to the computation of classical correspondence analysis (CA).

Note:
To run a non-symmetrical correspondence analysis (NSCA), you would select the Non-symmetrical analysis option (for which only the Chi-square distance is available).
To run a correspondence analysis based on the Hellinger distance (HD), you would not select the Non-symmetrical analysis option and choose Hellinger for the Distance.

In the Maps sub-tab of the Charts tab, three alternative ways of mapping the results are available. The rows and columns symmetric map is the most commonly used. For the purpose of the tutorial, all mapping alternatives were chosen.

In the Row options sub-tab and Column options sub-tab of the Charts tab, select the Confidence ellipses option.

Click the OK button. A dialog box appears, select the axes to be displayed on the graphical representation of the correspondence analysis (F1 and F2 in this tutorial) and validate.

Interpretation of the Correspondence Analysis

The objectives of the Correspondence Analysis are to study the association between two variables (rows and columns of a contingency table) and the similarities between the categories of each variable respectively (rows and columns respectively).

Before to start the interpretation let’s introduce the concept of profile. Indeed, Correspondence Analysis is based on the analysis of the profiles. A profile is a set of frequencies divided by their total, i.e. relative frequencies. In other words, a profile reflects the way the category of a variable changes according to the categories of the other variable.

The first displayed result is the test of independence between the rows and the columns, based on a chi-square statistic. If the chi-square observed value is greater than the critical value, then the p-value is below the chosen level alpha and one can conclude that the rows and the columns of the table are significantly associated. In our example, it is very likely that real differences exist between the age groups in terms of their appreciation profiles.

The eigenvalues correspond to the variance extracted by each factor (dimension). The quality of the analysis can be evaluated by consulting the table of the eigenvalues or the corresponding scree plot. If the sum of the two (or a few) first eigenvalues is close to the total represented, then the quality of the analysis is very high. The correspondence analysis in this example is of good quality as the sum of the first two eigenvalues adds up to 97% of the total inertia.

A list of tables is then displayed for the rows (and the columns respectively).

A first table shows the weights, distances and squared distances to the origin, inertias and relative inertias of the rows (and columns respectively). The weights are marginal proportions used to weight the point profiles when computing distances. The larger the distance to the origin, the more dissimilarity there is between the category profile and the mean profile (the more the category participates to the dependence between the two variables). The age groups 25-34, 35-44 and 45-54 have the shortest distance to the origin, indicating that these group profiles are close to the mean profile.

The row (respectively column) profiles are then displayed as well as the mean profile. In our example, the profiles of the age groups 25-34, 35-44 and 45-54 are close to each other and to the mean profile. The latter was foreseen by the short distance to the origin.

The distances between the rows (respectively columns) gives information about the similarity between categories. Again, the age groups 35-34, 35-44 and 45-54 appear to be similar, with distance below 0.2.

Then tables of the principal coordinates and standard coordinates of the rows (columns respectively) are displayed. The standard coordinates are principal coordinates divided by the square root of the corresponding factor eigenvalue. The weighted sum-of-squares of the standard coordinates equals 1 for each factor.

Following is a table of the contributions of the rows (columns respectively). The contributions correspond to the importance of each category for each factor (dimension). The sum of the contributions equals 1 for each factor. As a rule of thumb, if the contribution is greater than 1/I, I being the number of rows (respectively 1/J with J the number of columns), the category is important for the given factor. In our example, the 16-24 group is important for factor F1, groups 65-74 and 75+ are important for factor F2.

The next table shows the squared cosines of the rows (columns respectively). The squared cosines represent the importance of each factor for each category. The sum of the squared cosines equals 1 for a given category. In our example nearly all the variance of the 16-24 group is attributed to factor F1.

The different maps are then displayed.

The symmetric plot or French plot is the most commonly used. The row profiles and columns profiles are overlaid in a joint display (both in principal coordinates). This display is very convenient as both row and column points are equally spread out. The distance between the row points (respectively column points) approximate the inter-row (respectively inter-column) chi-square distance. The age groups 25-34, 35-44 and 45-54 are nearly superimposed on the symmetric map, showing very similar profiles.

The proximity between the row and column points cannot be interpreted directly.

Confidence ellipses can be added on the symmetric plots, as shown on the symmetric row plot. If the origin lies in the ellipse of a given category, this category does not contribute to the dependence between variables. In our example, the ellipses confirm that the age groups 25-34, 34-45 ad 45-54 do not contribute to the dependency between the variables. The age group 16-24 contributes to the dependency between variables.

The asymmetric row plot shows the columns represented in the row space (columns from the standard coordinates and rows from the principal coordinates). Inversely, the asymmetric column plot corresponds to the rows represented in the column space. Distance between rows and columns should be interpreted by projecting the row points on the column vectors. Whether to interpret the axes in terms of rows or columns depends on how appropriate the interpretation is. In our example, we choose to interpret the age group in appreciation space. The first dimension opposes good to bad. In the 16-24 group, a higher proportion qualified the product as good, compared to the proportions of “good” in the other age groups. However, this does not mean that the “good” qualification had the highest proportion compared to the other qualifications in the 16-24 group. The row profiles do not deviate much from the mean profile (row points are close to the origin).

The contribution coordinates of the rows and columns are then displayed. The contribution coordinates are obtained by dividing the standard coordinates by the squared root of the mass of the given category.

In the contribution biplot (rows), rows are in contribution coordinates and columns are in principal coordinates, and inversely for the contribution biplot (columns). In the row (respectively column) contribution biplot, the distances of the row (respectively columns) points to the origin is related to their contribution to the map. In our example, on the row contribution biplot, the position of the row points are unchanged compare to the asymmetric plot. The columns point however are closer to the origin (see the scales of the two maps).

Correspondence analysis is a very effective technique for analyzing 2-way tables. When more than two categorical variables are used in a survey, the best technique to use is Multiple Correspondence analysis (MCA).

The following video addresses CA theory and an implementation in XLSTAT.

1c26995d494fb3061dd0ae8571ffc0a4@xlstat.desk-mail.com
https://cdn.desk.com/
false
desk