This tutorial will help you set up and interpret a C&RT classification tree in Excel with the XLSTAT software.
Not sure this is the supervised machine learning feature you are looking for? Check out this guide.
Dataset for creating a C&RT classification tree
An Excel sheet with both the data and the results can be downloaded by clicking on the button below:
Download the data
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the tutorial is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. In particular, all patients here are females at least 21 years old of Pima Indian heritage. The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Setting up the dialog box to generate a C&RT classification tree
After opening XLSTAT, select the XLSTAT / Machine Learning / Classification and regression trees command.
Once you've clicked the button, the Regression trees dialog box appears.
In the General tab, select the column Diabetes in the qualitative dependent field. The value Yes indicates that the patient has diabetes. Otherwise, the value is No. Select the rest of variables as the quantitative explanatory variables. We choose to use the C&RT algorithm to build the tree. Select the Variable labels to consider the variables names provided in the first row of the data set.
In the Options tab, we set the maximum tree depth to 3 to avoid obtaining a too complex tree and we fix the value of the complexity parameter (CP) to 0,001. That means the construction of a tree does not continue unless the overall impurity is reduced by at least a factor CP.
Several technical options allow to better control the way the tree is built.
In Charts tab we select the Bar charts option to display the distribution of the species at each node.
The computations begin once you have clicked on OK. The results will then be displayed.
Interpreting the results of a C&RT classification tree
The summary statistics for all variables and the correlation matrix are first displayed followed by the confusion matrix which summarizes the reclassification of the observations. The latter one allows to quickly see the % of well-classified observations, which is the ratio of the number of well-classified observations over the total number of observations. Here, it is equal to 77,3%.
Next, the information on the tree structure is provided. For each node, this represents the number of objects at each node, the corresponding %, the improvement which correspond to the number of observations in the node times the reduction in impurity caused by the split of the node the purity that indicates what is the % of objects that belong to the dominating category of the dependent variable at this node, the parent and child nodes, the split variable and the value(s) or intervals of the latter, and the class predicted by the node.
The following table contains the rules built by the algorithm written in natural language. At each node, the rule corresponding to the predicted class is displayed. The % of observation in the node gives the % that corresponds to the predicted category at a specific node level.
In this way, we see that "If Glucose <= 127.5 then Diabetes = No in 63,2% of cases" this rule is verified by 485 observations (63% of the data set) with a node purity of 80,6 as we can see on the Tree structure table.
The next result is a part of the classification tree.
This diagram allows to visualize the successive steps during which the C&RT algorithm identifies the variables that allow to best split the categories of the dependent variable.
The algorithm stops when no additional rule can be found, or when one of the limits set by the user are reached (number of objects at a parent or child node, maximum tree depth, threshold CP <= user defined value).
XLSTAT offers a second possibility to visualize the classification trees. Instead of using bar charts, it uses pie charts. The latter are easier to read when they are many nodes and many categories for the dependent variable.