This tutorial will help you set up and interpret a K Nearest Neighbors (KNN) machine learning analysis in Excel with the XLSTAT software.
Not sure this is the supervised machine learning feature you are looking for? Check out this guide.
Dataset for running K Nearest Neighbors Classification
An Excel sheet with both the data and results used in this tutorial can be downloaded by clicking here.
This dataset is a subset of the dataset proposed by Dr. William H. Wolberg (University of Wisconsin Hospitals, Madison). Each row corresponds to a tissue sample described by 9 variables (columns C-K) measured on patients suffering from benign or malignant breast cancer (class defined in column B). The complete dataset can be found here.
Goal of this tutorial
The goal of this tutorial is to use the K nearest Neighbors (KNN) classification method to determine whether a mammery tissue is benign or malignant. We will use the 100 first observations as a learning dataset, and the 20 last observations as a prediction data set. Thus, cancer class was removed on purpose in the 20 last observations.
The K Nearest Neighbors classification is performed using the following explanatory learning variables:
1. Clump Thickness: Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers.
2. Uniformity of Cell Size: Cancer cells tend to vary in size.
3. Uniformity of Cell Shape: Cancer cells tend to vary in shape.
4. Marginal Adhesion: Normal cells tend to stick together. Cancer cells tends to lose this ability. So loss of adhesion is a sign of malignancy.
5. Single Epithelial Cell Size: Is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell.
6. Bare Nuclei: Nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumors.
7. Bland Chromatin: Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tend to be more coarse
8. Normal Nucleoli: Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them.
9. Mitosis: Nuclear division plus cytokinesis which produces two identical cell daughters
The dependent variable corresponds to the class of an observation (benign or malignant).
Setting up a K Nearest Neighbors Classification in XLSTAT
After opening XLSTAT, select the XLSTAT / Machine Learning / K nearest Neighbors command.
The K Nearest Neighbors dialog box appears.
Select the classes of the learning set in the Y / Qualitative variables field. The explanatory variables related to the learning set should be selected in the X / Explanatory variables / quantitative field. In the prediction set field, select the variables associated to the 20 observations at the bottom of the dataset. Set the number of neighbors to 3.
In the Option tab, several algorithm parameters can be selected and modified. For example, it is possible to specify whether to use kernel or distance functions to compute similarities.
The computations begin once you have clicked on OK.
Interpreting the results of a K nearest Neighbors Classification in XLSTAT
The second table displays statistics related the explanatory variables in the learning set, and the third one to the prediction set.
The error rate estimated through cross validation is 0.04, which is good.
Class predictions of the prediction set is displayed first per class…
…Then per observation:
Then, for each tracked observation, the classes of the neighbors and their respective distances to the tracked observation is summarized in the following table. Below, we show the tracking information corresponding to the first two observations: