Skip to main content

Naive Bayes classification in Excel tutorial

This tutorial will help you set up and interpret a Naive Bayes classification in Excel using the XLSTAT software.
Not sure this is the supervised machine learning feature you are looking for? Check out this guide .

Dataset for setting up a Naive Bayes Classifier in Excel with XLSTAT

This tutorial uses a dataset made available by the Center for Machine Learning and Intelligent Systems. Their Machine Learning Repository is accessible at this address, and gathers many insightful datasets related to Machine Learning.

Goal of this tutorial

The Naive Bayes classifier is a supervised machine learning algorithm that allows you to classify a set of observations according to a set of rules determined by the algorithm itself.
This classifier has first to be trained on a training dataset that shows which class is expected for a set of inputs.
During the training phase, the algorithm elaborates the classification rules on this training dataset that will be used in the prediction phase to classify the observations of the prediction dataset.

In this tutorial, we will use a dataset entitled Zoo database that has been created by Richard Forsyth in 1990 to illustrate its PC-Beagle program.
It contains a list of 101 animals in rows and their associated attributes described in 17 distinct qualitative variables (columns): hair, feathers, eggs, milk, airborne, aquatic, predator, toothed, backbone, breathes, venomous, fins, legs, tail, domestic, catsize.

All but one of these variables are boolean values taking a value of 1 when the corresponding attribute is observed for the animal under consideration, such as a tail or teeth, and 0 otherwise.
The remaining variable, the legs attribute, takes one value among 0, 2, 4, 5, 6 and 8.

Finally, the 18th column is an integer value ranging from 1 to 7 that gives the type or subgroup to which the animal belongs.

This type value is the class we want our Naive Bayes classifier to predict. The dataset will then be divided into 2 subgroups. The first one will contain the 94 first rows and will be used to train the classifier. The second one will gather only 7 observations on which we will make our prediction.

Setting up the Naive Bayes Classifier in XLSTAT

After opening XLSTAT, select the XLSTAT / Machine Learning / Naive Bayes classifier command.

The Naive Bayes classifier dialog box appears.

First, select the output class of the training set in the Y / Qualitative variables field. In our case, the output class is the type of animal listed in the 18th column of the dataset.

As mentioned above, only the first 94 rows are used as a training dataset, the selection has to be made accordingly.

Next, the X / Explanatory variables should be selected. In our case, we are using qualitative variables only. The Qualitative checkbox should be activated and the 17 attributes of our training set selected.

Then we should select the prediction dataset which is made up of the 7 animals at the bottom of the list.

In the Option tab, you may choose between several parametric distributions if you are using quantitative data or use an empirical distribution to estimate the conditional probabilities.
For qualitative data however, only the empirical distribution makes sense and the distribution selection is therefore deactivated as shown in the figure below.

In order to make your classifier more robust with qualitative variables when classifying new observations, you might want to apply a Laplace Smoothing by setting the Smoothing parameter to an integer value different from 0.
In our case, we will set this value to 1.

Finally, we activate all 7 outputs in the Output tab, as shown below.

The computations begin once you have clicked on OK.

Interpreting the results of a Naive Bayes classification in XLSTAT

The first two tables display the observed frequency and relative frequency distributions of the output class and the attributes in the training set.

We can see in the first table that the type 1 class of animals is by far the most frequent one in the training dataset with 41.935%.
In the next table shown below, we can see that there were no instances of 5 legged animal in the training set.