Discriminant analysis of principal components in Excel
This tutorial shows how to set up and interpret a discriminant analysis of principal components (DAPC) in Excel with XLSTAT-R.
Discriminant analysis of principal components is a method that aims to describe clusters as well as links between them using synthetic variables. It is commonly used to investigate the genetic structure of biological populations.
Dataset to run a discriminant analysis of principal components with XLSTAT-R
The data come from the adegenet package. For the purpose of this tutorial, we'll use the first dataset from dapcIllus (Jombart et al.). It contains the description of the genotypes of 600 individuals by specifying the number of 140 alleles in 30 locations as well as the group (1 to 6) to which each individual belongs. Each line represents an individual (observation) and each column represents an allele in a location.
Our goal is to describe each group of individuals as well as to explore the links within and between the groups.
Setting up a discriminant analysis of principal components with XLSTAT-R
Once XLSTAT is open, select the dapc function in the XLSTAT-R menu (adegenet package).
The DAPC dialog box appears.
In the General tab, select your data from the Excel sheet. The Data field contains all of the data from column A to EL.
The Groups field contains the groups – in our case, column EL, which specifies the number of the cluster to which each individual belongs.
Since we have selected the variable labels, we should activate the Variable labels option. Moreover, we have also selected the numbers of the individuals, so we check Observation labels (column A).
In the Options tab, we can choose the number of axes in order to run a PCA on the data, either directly (Nbr of axes(PCA)) or indirectly by specifying a minimal percentage of variance to be explained (% of variance). Here, we choose to explain 60% of the variance.
We also choose a number of axes or synthetic variables on which we will base our analysis with Nbr of axes (DA).
In the Missing data tab, we can choose not to accept missing data, remove the observations or estimate missing data by the mean, mode or even nearest neighbor.
In the Outputs tab, we may check every result that we want to display. Here, we have chosen to check everything in order to explain every possible output.
In the Charts tab, we may choose to display only the Peripherical individuals for each group or all groups along with the tree linking them.
Once you have clicked on OK, the calculations start and the results are displayed.
Interpretation of the results of a discriminant analysis of principal components with XLSTAT-R
Discriminant analysis of principal components returns several tables and graphs.
First of all, you can see the Descriptive statistics table regarding your data such as the number of observations, the number of missing observations, the minimum, maximum, mean and standard deviation for quantitative variables.
For example, we have 600 observations for the 03 allele in location 1 (loc1.03). Their number ranges from 0 to 2.
You can also see the modalities, count and effectives of the qualitative variables, as well as the proportion represented by each modality.
For instance, the groups are evenly distributed among the data.
Then, you have access to all information regarding the synthetic variables.
First, you can see the five greatest eigenvalues associated to the five first synthetic variables. Their sum is the inertia (or variance) so the greater the eigenvalue, the greater percentage of variance it explains. Thus, in our example the first synthetic variable explains the biggest percentage of variance.
Next, we can see the explained percentage of variance by the PCA (here, we had set it up to 60%).
Then, the Loadings table is displayed. The synthetic variables used in the discriminant analysis are actually a linear combination of the principal components from the PCA. Here, we have these 35 dimensions (needed in order to explain 60% of the variance) in the Observations column. In the two other columns, we have the coefficients (also called loadings) corresponding to each dimension in order to create the synthetic variables (one column per variable).
Next, we have a table with the coordinates of each individual on the new discriminant analysis axes.
We can also observe the coordinates of each group.
A table containing the posterior probabilities for each observation to belong to a group is then displayed.
Two last tables, representing the loadings and the variable contributions to each discriminant analysis component, are displayed.
As before, the loadings are the coefficients by which we have to multiply the values of the observations in order to get the discriminant analysis axes, but this time the observations represent the variables.
The variable contributions are values that indicate the percentage at which a variable has contributed to the construction of an axis. The greater the value, the more the evolution of the values of the variable will be directly readable on the axis.
Finally, two graphs are displayed. The first one represents all groups in a coordinate system defined by the two main components of the discrimination analysis. We can also see the minimal tree, linking the groups based on the squared distances between their centers. Here, we can see that the clusters are distinct in general but that groups 1 and 2 (blue and lilac) overlap sometimes as well as clusters 5 and 6 (orange and red). Cluster 4 (yellow) is well defined.
The second graph shows us the position of the peripherical individuals of each cluster as well as the clusters they belong to.
We can see on the side of the graph that this coordinate system is defined by the first two components of the discriminant analysis (we have actually chosen to keep only two). We have to keep in mind that these representations only keep 60% of the variance (distance between clusters and individuals).
To conclude, we were able to group the 600 individuals into 6 clusters depending on the number of alleles per locus. That way, we can visualize the similarities between them as well as the variables on which these similarities are based. For example, variable loc11.16 contributes at 8% to axis 2 (vertical) whereas it contributes at 0.1% to axis 1 (horizontal). We may then suggest that clusters 2 and 5, situated far from each other on axis 2, have different numbers of alleles on locus 11.16. Representing peripherical individuals also helps us know which cluster an individual belongs to when we can be confused by overlapping clusters. For example, here we can see that although clusters 5 and 6 overlap, individual 487 belongs to cluster 5.
Was this article useful?