This tutorial will help you set up and interpret a CHAID classification tree in Excel with the XLSTAT software.
Not sure this is the supervised machine learning feature you are looking for? Check out this guide.
Dataset for creating a CHAID classification tree
An Excel sheet with both the data and the results can be downloaded by clicking on the button below:
Download the data
The data are from [Fisher M. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179 -188] and correspond to 150 Iris flowers, described by four variables (sepal length, sepal width, petal length, petal width) and their species. Three different species have been included in this study: setosa, versicolor and virginica.
Iris setosa, versicolor and virginica.
Goal of this CHAID classification tree
Our goal is to test if the four descriptive variables allow to efficiently predict to which species a flower corresponds, and in this case, to identify rules that would help classifying the flowers on the basis of the four variables.
Note: the same case is treated in the tutorial on discriminant analysis.
Setting up the dialog box to generate a CHAID classification tree
After opening XLSTAT, select the XLSTAT / Machine Learning / Classification and regression trees command.
Once you've clicked the button, the dialog box appears. The qualitative dependent variable corresponds here to the "Species" variable.
The quantitative Explanatory variables are the four descriptive variables.
We choose to use the CHAID algorithm and we set the maximum tree depth to 3 to avoid obtaining a too complex tree.
In the Options tab, several technical options allow to better control the way the tree is built.
In Charts tab we first select the Bar charts option to display the distribution of the species at each node.
As we will see later, the Pie charts option is also being used in this tutorial.
The computations begin once you have clicked on OK. The results will then be displayed.
Interpreting the results of a CHAID classification tree
Below the simple statistics for all the selected variables, XLSTAT displays information on the tree structure. This includes for each node, the p-value for the splitting, the number of objects at each node, the corresponding % the parent and son nodes, the split variable, the value(s) or intervals of the latter, and the purity that indicates what is the % of objects that belong to the dominating category of the dependent variable at this node.
The next result displayed is a part of the classification tree.
This diagram allows to visualize the successive steps during which the CHAID algorithm identifies the variables that allow to best split the categories of the dependent variable. Thus, we see that using only the petal length, the algorithm has found a rule that allows to perfectly separate the Iris flowers of the setosa species. If the petal length is between 10 and 24.5 then the species is setosa.
The information available at each node is explained below.
The algorithm stops when no additional rule can be found, or when one of the limits set by the user are reached (number of objects at a parent or son node, maximum tree depth, threshold p-value for splitting).
XLSTAT offers a second possibility to visualize the classification trees. Instead of using bar charts, it uses pie charts. The latter are easier to read when they are many nodes and many categories for the dependent variable. The inner circle of the pie corresponds to the relative frequencies of the categories to which the objects contained in the node correspond. The outer ring shows the distribution of the categories at the parent node.
The following table contains the rules built by the algorithm in a less visual but more readable way: the rules are written in natural language. The purity gives the % that corresponds to the each category at a specific node level. The number of objects corresponding to the category is also displayed.
In this way, we see that "If PETAL LENGHT is in the interval [24,5; 39,5[ then SPECIES is Versicolor in 100% of cases" this rule is verified by 11 flowers.
The rules that correspond to the leaves of the tree (the terminal nodes) allow to compute predictions for each observation, with a probability that depends on the distribution of the categories at the leaf level. These results are displayed in the "Results by object" table.
We see that 3 observations have been miss-classified by the algorithm. This result is almost identical to what is obtained with a discriminant analysis where the miss-classified observations are 5, 9, 12.
The confusion matrix summarizes the reclassification of the observations and allows to quickly see the % of well-classified observations, which is the ratio of the number of observations that have been well classified over the total number of observations. It is here equal to 98%.