Differential expression (OMICS) in Excel
This tutorial explains how to set up and interpret differential expression analysis in Excel using the XLSTAT statistical software.
This analysis allows to detect the feature most differentially expressed according to the explanatory variables within a feature /individuals data table that can reach very large dimensions. For example, it can be used to study the effects of explanatory variables on protein production or metabolite regulation in a context of high-speed OMICS data.
Dataset for running a differential expression analysis in XLSTAT
For this tutorial we use a simulated data table corresponding to 36 biological samples of diseased and healthy individuals belonging to three different genotypes. For each sample, the expression of 1561 genes is measured through RNA quantification.
RNAs are stored in rows and samples in columns. A genotype factor and a health status factor are added to the right of the data matrix. Factors row numbers correspond to the number of samples (number of columns of the data matrix).
The aim of this tutorial is to use the differential expression tool in XLSTAT to identify differentially expressed genes according to two factors: genotype (three levels: BB, BK, KK) and health status (two levels: healthy and diseased).
Setting up a differential expression analysis in XLSTAT
After opening XLSTAT, select the XLSTAT / Laboratory data analysis / Differential expression analysis as shown in the figure below.
Once you have clicked on the button, the Differential expression dialog box appears.
In the General tab, select the data in the Features/individuals table field. Here, each sample corresponds to an individual. We have kept the features in rows data format, as genes are stored in rows in the dataset.
Then select the explanatory variables corresponding to the genotype and state factors and select the observations labels given in the column A of the file.
In the Options tab, select the Parametric test type. This option will produce a one-way ANOVA per factor and feature. For small sample sizes, we recommend using the non-parametric method instead, which replaces the one-way ANOVAs by Kruskal-Wallis tests.
In the Post hoc corrections, choose the Benjamini-Hochberg procedure, which is very commonly used in differential expression studies. It is part of the False Discovery Rate (FDR) p-value corrections family. It is well suited for studies involving the computation of a large number of p-values as it is less stringent than corrections which are part of the Family Wise Error Rate (FWER) family, such as the Bonferroni correction.
Set the number of low p-values to keep to 30, to avoid displaying huge lists of p-values in the output (high p-values are not quite interesting in the context of our study).
Activate the multiple pairwise comparisons option and choose Tukey (HSD) in order to obtain pairwise multiple comparisons among the genotype levels for each gene.
Finally, activate the non-specific filtering option, choose %(Std. dev.) with a 50% threshold to eliminate 50% of genes based on the lowest standard deviations criterion prior to analyses.
In the Charts tab, activate both the Histogram of p-values and the volcano plot options.
The two following options represent two ways of representing biological effects on the volcano plot’s x axis.
We have chosen the Log2(means ratio), because our data are not transformed.
Activate the Identify features option. XLSTAT will color highly significant features at both statistical and biological levels according to the two following thresholds x and y.
Choose 1 for Threshold(x). A log2(means ratio) of 1 means that the mean on the numerator is twice as high as the mean at the denominator. Conversely, a log2(means ratio) of -1 means that the mean on the denominator is twice as high as the mean of the numerator. A log2(means ratio) of 2 or -2 represents a fold change of 2², and so on.
Choose a 0.001 p-value threshold in the Threshold(y) box. This means that the statistical significance threshold will be –log10(0.001).
Once you have clicked on the OK button, the computations begin and then the results are displayed.
Interpreting the results of a differential expression analysis in XLSTAT
After a summary of the different options used in the analysis, the number of features which were eliminated by non-specific filtering is displayed. Then one analysis is displayed for each factor.
First, a table displaying the 30 most significant features sorted according to increasing p-values is displayed. The table contains the feature names, penalized p-values, significance, and average RNA quantity for each factor level. If a p-value is significant, the user may be interested in multiple pairwise comparisons represented by letters associated to means. Two levels sharing the same letter are not significantly different. Two levels having no letters in common are significantly different.
For the genotype factor, there are no significant p-values at alpha = 0.05. In this case, interpreting multiple comparisons is not relevant for any of the features.
Below the histogram of the p-values shows that the p-values are distributed inhomogenously with a high peak at 1.
The healthy or diseased factor seems to affect the expression of two genes: T1157.01 and T106.02. The first one has a higher expression in healthy samples, and the second one has a higher expression in the diseased samples.
Both features can be visualized on the volcano plot:
Features located at the top-left and top-right corners of the chart are labeled. They correspond to the features that go over the biological and statistical significance thresholds (dashed lines).
Notice that the p-values used to compute the –log10(p-values) on the volcano plot are the raw, uncorrected p-values.
Was this article useful?