Naive Bayes classification in Excel tutorial
This tutorial will help you set up and interpret a Naive Bayes classification in Excel using the XLSTAT software. Not sure this is the supervised machine learning feature you are looking for? Check out this guide .
Dataset for setting up a Naive Bayes Classifier in Excel with XLSTAT
This tutorial uses a dataset made available by the Center for Machine Learning and Intelligent Systems. Their Machine Learning Repository is accessible at this address, and gathers many insightful datasets related to Machine Learning.
Goal of this tutorial
The Naive Bayes classifier is a supervised machine learning algorithm that allows you to classify a set of observations according to a set of rules determined by the algorithm itself. This classifier has first to be trained on a training dataset that shows which class is expected for a set of inputs. During the training phase, the algorithm elaborates the classification rules on this training dataset that will be used in the prediction phase to classify the observations of the prediction dataset.
In this tutorial, we will use a dataset entitled Zoo database that has been created by Richard Forsyth in 1990 to illustrate its PC-Beagle program. It contains a list of 101 animals in rows and their associated attributes described in 17 distinct qualitative variables (columns): hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize.
All but one of these variables are boolean values taking a value of 1 when the corresponding attribute is observed for the animal under consideration, such as a tail or teeth, and 0 otherwise. The remaining variable, the legs attribute, takes one value among 0, 2, 4, 5, 6 and 8.
Finally, the 18th column is an integer value ranging from 1 to 7 that gives the type or subgroup to which the animal belongs.
This type value is the class we want our Naive Bayes classifier to predict. The dataset will then be divided into 2 subgroups. The first one will contain the 94 first rows and will be used to train the classifier. The second one will gather only 7 observations on which we will make our prediction.
Setting up the Naive Bayes Classifier in XLSTAT
After opening XLSTAT, select the XLSTAT / Machine Learning / Naive Bayes classifier command.
The Naive Bayes classifier dialog box appears.
First, select the output class of the training set in the Y / Qualitative variables field. In our case, the output class is the type of animal listed in the 18th column of the dataset.
As mentioned above, only the first 94 rows are used as a training dataset, the selection has to be made accordingly.
Next, the X / Explanatory variables should be selected. In our case, we are using qualitative variables only. The Qualitative checkbox should be activated and the 17 attributes of our training set selected.
Then we should select the prediction dataset which is made up of the 7 animals at the bottom of the list.
In the Option tab, you may choose between several parametric distributions if you are using quantitative data or use an empirical distribution to estimate the conditional probabilities. For qualitative data however, only the empirical distribution makes sense and the distribution selection is therefore deactivated as shown in the figure below.
In order to make your classifier more robust with qualitative variables when classifying new observations, you might want to apply a Laplace Smoothing by setting the Smoothing parameter to an integer value different from 0. In our case, we will set this value to 1.
Finally, we activate all 7 outputs in the Output tab, as shown below.
The computations begin once you have clicked on OK.
Interpreting the results of a Naive Bayes classification in XLSTAT
The first two tables display the observed frequency and relative frequency distributions of the output class and the attributes in the training set.
We can see in the first table that the type 1 class of animals is by far the most frequent one in the training dataset with 41.935%. In the next table shown below, we can see that there were no instances of 5 legged animal in the training set.
However, we can see in the next table that shows the observed frequency distributions in the prediction set that a 5 legged animal is included in the prediction set.
It will be interesting to see how our classifier reacts to this new modality that wasn't present in the training set.
But first, let's have a look to the performance of our classifier on the training dataset with the two confusion matrices displayed next. We can see on the picture below, the classifier presents a high rate of true positive results on both matrices which is very encouraging.
Finally, the classification for our prediction set is displayed via two tables as shown below:
Results are given by class in the first table and for each observation in the second tab. We can see that the observation labeled PredObs4 is affected a class of type 7. This is the 5 legged animal we talked about just a little earlier. It is a starfish and it indeed belongs to the type 7, together with the crab, the clam and the crayfish. Therefore, thanks to the Laplace smoothing, our classifier successfully labeled this observation.
Finally, the posterior probability for each class is given for each observation of the prediction set:
Was this article useful?