K Nearest Neighbors KNN in Excel tutorial
This tutorial will help you set up and interpret a K Nearest Neighbors (KNN) machine learning analysis in Excel with the XLSTAT software. Not sure this is the supervised machine learning feature you are looking for? Check out this guide.
Dataset for running K Nearest Neighbors Classification
This dataset is a subset of the dataset proposed by Dr. William H. Wolberg (University of Wisconsin Hospitals, Madison). Each row corresponds to a tissue sample described by 9 variables (columns C-K) measured on patients suffering from benign or malignant breast cancer (class defined in column B). The complete dataset can be found here.
Goal of this tutorial
The goal of this tutorial is to use the K nearest Neighbors (KNN) classification method to determine whether a mammery tissue is benign or malignant. We will use the 100 first observations as a learning dataset, and the 20 last observations as a prediction data set. Thus, cancer class was removed on purpose in the 20 last observations.
The K Nearest Neighbors classification is performed using the following explanatory learning variables:
- Clump Thickness: Benign cells tend to be grouped in monolayers, while cancerous cells are often grouped in multilayers.
- Uniformity of Cell Size: Cancer cells tend to vary in size.
- Uniformity of Cell Shape: Cancer cells tend to vary in shape.
- Marginal Adhesion: Normal cells tend to stick together. Cancer cells tends to lose this ability. So loss of adhesion is a sign of malignancy.
- Single Epithelial Cell Size: Is related to the uniformity mentioned above. Epithelial cells that are significantly enlarged may be a malignant cell.
- Bare Nuclei: Nuclei that is not surrounded by cytoplasm (the rest of the cell). Those are typically seen in benign tumors.
- Bland Chromatin: Describes a uniform "texture" of the nucleus seen in benign cells. In cancer cells the chromatin tend to be more coarse
- Normal Nucleoli: Nucleoli are small structures seen in the nucleus. In normal cells the nucleolus is usually very small if visible at all. In cancer cells the nucleoli become more prominent, and sometimes there are more of them.
- Mitosis: Nuclear division plus cytokinesis which produces two identical cell daughters
The dependent variable corresponds to the class of an observation (benign or malignant).
Setting up a K Nearest Neighbors Classification in XLSTAT
After opening XLSTAT, select the XLSTAT / Machine Learning / K nearest Neighbors command.
The K Nearest Neighbors dialog box appears.
Select the classes of the learning set in the Y / Qualitative variable field. The explanatory variables related to the learning set should be selected in the X / Explanatory variables / quantitative field.
In the Option tab, several algorithm parameters can be selected and modified. For example, it is possible to specify whether to use Kernel or distance functions to compute similarities (General tab)
In the Neighbors tab, we can select manually the number of neighbors (User defined) or automatically (Automatic), the last one determines the optimal number in a range of value.
In the Prediction tab, select the variables associated to the 20 observations on the right of the dataset.
The computations begin once you have clicked on OK.
Interpreting the results of a K nearest Neighbors Classification in XLSTAT
The first table displays statistics for the explanatory variables in the learning set, and the second table displays statistics for the prediction set.
The next tab gives the prediction errors estimated through cross validation for each number of neighbors between upper and lower bounds, with his associated chart.
For the prediction, we will use 3 neighbors for our model.
Class predictions of the prediction set is displayed first per class…
…Then per observation:
The model, regarding the data, predicts that: - The observations 3, 9, 11, 15, 16, 17, 19, 20 have a Benign cancer.
- The others 12 observations have a Malignant cancer.
After that, for each tracked observation, the classes of the neighbors and their respective distances to the tracked observation is summarized in the following table. Below, we show the tracking information corresponding to the first three observations:
Was this article useful?