This tutorial will help you set up and interpret a Self-Organizing Map or SOM in Excel using the XLSTAT-R engine.
Self-Organizing Map: an unsupervised Machine Learning methodSelf-Organizing Maps are a method for unsupervised machine learning developed by Kohonen in the 1980’s. They allow reducing the dimensionality of multivariate data to low-dimensional spaces, usually 2 dimensions.
Observations are assembled in nodes of similar observations. Then nodes are spread on a 2-dimensional map with similar nodes clustered next to one another.
Each node contains information on the number of observations it carries, and on representative values of the different input variables for these observations. These values can be observed in the form of heatmaps, with one heatmap per variable. Observing heatmaps sometimes unveils cluster patterns in the nodes and thus in the observations behind. Comparing heatmaps can give useful information on node clusters characterization.
The som function developed in XLSTAT-R calls the som function from the kohonen package in R (Ron Wehrens and Johannes Kruisselbrink).
Data set for launching a SOM analysis in XLSTAT-RAn Excel sheet with both the data and the results can be downloaded by clicking on the button below:
Download the data
The data correspond to chemical characteristics (compound quantities as well as spectroscopic variables) measured on 177 wine samples from the Piedmont region in Italy [M. Forina, C. Armanino, M. Castino and M. Ubigli. Vitis, 25:189-201 (1986)]. The data are available on the UCI Machine Learning Repository [Lichman, M. (2013). UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science.]
Our goal is to quickly gain insight into the data using Self-Organizing Maps.
Setting up a SOM analysis in XLSTAT-ROpen XLSTAT-R / kohonen / Kohonen SOM(som)
In the general tab, select the data under the Data field.
In the Options tab, activate Standardize data to avoid having stronger influences of variables with important variances. Set Presentation times at 100. This corresponds to the number of iterations the SOM algorithm will perform. The Grid dimensions represent the number of nodes to represent on the SOM map X axis (xdim) and Y axis (ydim). Note that the number of nodes (xdim*ydim) should always be lower than the number of observations in the data.
Select a Hexagonal topology. This topology implies that 6 nodes will be organized in the direct neighborhood of each node. With the rectangular topology, this number is reduced to 4.
Interpretation of a Self-Organizing mapFirst, the classification results table shows each observation and the node (Class) to which it belongs.
The Training Progress chart shows if the SOM algorithm has been stabilized through iterations. If the line is still decreasing at the right end of the chart, then the SOM should be re-launched with a higher number of presentation times.
The Codes plot shows normalized values of each variable within each node in the form of fan diagrams. The readability of the codes plot is a bit complex in our case as there are many variables.
The Counts plot displays the approximate number of observations contained in each node. Usually we seek to have a Counts plot as homogeneous as possible.
Finally, the Variables influence plot or heatmaps show the normalized values of each variable within each node in the map. Here, we see a cluster of nodes in the top-left corner with poor alcohol and proline but high ash alkalinity. The bottom right clusters are characterized by an important OD280/OD315 ratio and a poor concentration of nonflavanoid phenols.
Note that the overall chart dimensions can be customized in the XLSTAT-R XML file for the som function. The file is usually stored in this address %AppData%\ADDINSOFT\XLSTAT\XLSTAT-R\groups\kohonen.
Open the file using a code editor (e.g. Notepad++). Locate the code line that generates the chart and modify the rplotwidth & rplotheight arguments. Example: <Result text="Variables influence plot" chartname="properties" charttype="r" rplotformat="emf" rplotwidth="10" rplotheight="10"…
Save the XML file. Go to XLSTAT, open the XLSTAT-R menu, click on Refresh and re-launch the analysis to obtain the chart in the new dimensions.