Correspondence Analysis (CA) from a contingency table
This tutorial will help you set up and interpret a Correspondence Analysis (CA) in Excel using the XLSTAT software. Not sure if this is the right multivariate data analysis tool you need? Check out this guide.
Dataset for running a Correspondence Analysis
The data correspond to a survey asking moviegoers their opinion on a film they had just seen. The audience was also asked to give their age category.
Goal of this tutorial
The purpose of this tutorial is to learn how to set up and interpret a Correspondence Factorial Analysis. The objectives of this method are to study the association between two variables (rows and columns of a contingency table) and the similarities between the categories of each variable (rows and columns respectively).
Setting up a Correspondence Analysis in XLSTAT
Once XLSTAT is open, select the XLSTAT / Analyzing data / Correspondence analysis feature.
Once you have clicked on the button, the Correspondence Analysis dialog box appears.
In the General tab, select the entire table on the Excel sheet. If your data is in a contingency table, as in this example, select the Two-way table format. If your data is in an Individuals/Variables table, select the Observations/Variables table option.
If the categories names for each variable are present for both rows and columns, make sure the Labels included box is checked.
Set up the Options tab as follows:
Advanced analysis: select None to perform a classic CA.
Do not activate the Non-symmetric analysis option.
Distance: choose the Chi-square distance.
Note:To run a Non-Symmetrical Correspondence Analysis (NSCA), you would select the Non-symmetrical analysis option (for which only the Chi-square distance is available).To run a Correspondence Analysis based on the Hellinger distance (HD), you would not select the Non-symmetrical analysis option and choose Hellinger for the Distance.To run a Detrended Correspondence Analysis, you would select the Detrended analysis in the Advanced analysis option.
In the Outputs tab, select the results below:
In the Maps sub-tab of the Charts tab, three alternative ways of mapping the results are available. The rows and columns symmetric map is the most commonly used. For the purpose of the tutorial, all mapping alternatives were chosen.
Additionally, select Confidence Ellipses to display confidence ellipses that identify the categories which contribute to the dependency between the row and column categories of the contingency table.
The calculations start when you click OK. The results are then displayed.
Interpreting of a Correspondence Analysis
Before to start the interpretation let’s introduce the concept of profile. Correspondence Analysis is based on the analysis of the profiles. A profile is a set of frequencies divided by their total, i.e. relative frequencies. In other words, a profile reflects the way the category of a variable changes according to the categories of the other variable.
The first displayed result is the test of independence between the rows and the columns, based on a Chi-square statistic. If the Chi-square observed value is greater than the critical value, and if the p-value is below the chosen level alpha, then we may conclude that the rows and the columns of the table are significantly associated. In this example, it is very likely that real differences exist between the age groups in terms and their appreciation profiles.
The eigenvalues correspond to the variance extracted by each factor (dimension). The quality of the analysis can be evaluated by consulting the table of the eigenvalues or the corresponding scree plot. If the sum of the two (or a few) first eigenvalues is close to the total represented, then the quality of the analysis is very high. The correspondence analysis in this example is of good quality as the sum of the first two eigenvalues adds up to 97% of the total inertia.
A list of tables is then displayed for the rows (and the columns respectively).
A first table shows the weights, distances and squared distances to the origin, inertias and relative inertias of the rows (and columns respectively). The weights are marginal proportions used to weight the point profiles when computing distances. The larger the distance to the origin, the more dissimilarity there is between the category profile and the mean profile (the more the category participates to the dependence between the two variables). The age groups 25-34, 35-44 and 45-54 have the shortest distance to the origin, indicating that these group profiles are close to the mean profile.
The row (respectively column) profiles are then displayed as well as the mean profile. In our example, the profiles of the age groups 25-34, 35-44 and 45-54 are close to each other and to the mean profile. The latter was foreseen by the short distance to the origin.
The distances between the rows (respectively columns) gives information about the similarity between categories. Again, the age groups 25-34, 35-44 and 45-54 appear to be similar, with distance below 0.2.
Principal coordinates and standard coordinates of the rows (columns respectively) are also displayed. The standard coordinates are principal coordinates divided by the square root of the corresponding factor eigenvalue. The weighted sum-of-squares of the standard coordinates equals 1 for each factor.
Following is a table of the contributions of the rows (columns respectively). The contributions correspond to the importance of each category for each factor (dimension). The sum of the contributions equals 1 for each factor. As a rule of thumb, if the contribution is greater than 1/I, l being the number of rows (respectively 1/J with J the number of columns), the category is important for the given factor. In our example, the 16-24 group is important for factor F1, groups 65-74 and 75+ are important for factor F2.
The next table shows the squared cosines of the rows (columns respectively). The squared cosines represent the importance of each factor for each category. The sum of the squared cosines equals 1 for a given category. In our example nearly all the variance of the 16-24 group is attributed to factor F1.
The different maps are then displayed starting with the symmetric plot of rows and columns or French plot which is the most commonly used. The row profiles and columns profiles are overlaid in a joint display (both in principal coordinates). This display is very convenient as both row and column points are equally spread out. The distance between the row points (respectively column points) approximates the inter-row (respectively inter-column) Chi-square distance. The age groups 25-34, 35-44 and 45-54 are nearly superimposed on the symmetric map, showing very similar profiles.
The proximity between the row and column points cannot be interpreted directly on this graph.
Confidence ellipses can be added on the symmetric row or column plots, as shown on the symmetric row plot below. If the origin lies in the ellipse of a given category, this category does not contribute to the dependence between variables. In our example, the ellipses confirm that the age groups 25-34, 34-45 and 45-54 do not contribute to the dependency between the variables. The age group 16-24 contributes to the dependency between variables.
The asymmetric row plot shows the columns represented in the row space (columns from the standard coordinates and rows from the principal coordinates). Inversely, the asymmetric column plot corresponds to the rows represented in the column space. Distance between rows and columns should be interpreted by projecting the row points on the column vectors. Whether to interpret the axes in terms of rows or columns depends on how appropriate the interpretation is. In our example, we choose to interpret the age group in appreciation space. The first dimension opposes “GOOD” to “BAD”. In the 16-24 group, a higher proportion qualified the product as “GOOD”, compared to the proportions of “GOOD” in the other age groups. However, this does not mean that the “GOOD” qualification had the highest proportion compared to the other qualifications in the 16-24 group. The row profiles do not deviate much from the mean profile (row points are close to the origin).
The contribution coordinates of the rows and columns are then displayed. The contribution coordinates are obtained by dividing the standard coordinates by the squared root of the mass of the given category.
In the contribution biplot (rows), rows are in contribution coordinates and columns are in principal coordinates, and inversely for the contribution biplot (columns). In the row (respectively column) contribution biplot, the distances of the row (respectively columns) points to the origin are related to their contribution to the map. In our example, on the row contribution biplot, the positions of the row points are unchanged compared to the asymmetric plot. The columns point however are closer to the origin (see the scales of the two maps).
Correspondence Analysis is a very effective technique for analyzing 2-way tables. When more than two categorical variables are used in a survey, the best technique to use is Multiple Correspondence Analysis (MCA).
The following video addresses CA theory and an implementation in XLSTAT.
Was this article useful?