QUEST classification tree in Excel tutorial
This tutorial will help you set up and interpret a QUEST classification Tree in Excel with the XLSTAT software.
Not sure this is the supervised machine learning feature you are looking for? Check out this guide.
Dataset for creating a QUEST classification tree
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the tutorial is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. In particular, all patients here are females at least 21 years old of Pima Indian heritage. The datasets consist of several medical predictor (independent) variables and one target (dependent) variable, Outcome. Independent variables include the number of pregnancies the patient has had, their BMI, insulin level, age, and so on.
Setting up the dialog box to generate a QUEST classification tree in XLSTAT
After opening XLSTAT, select the XLSTAT / Machine Learning / Classification and regression trees command.
Once you've clicked the button, the Regression trees dialog box appears.
In the General tab, select the column Diabetes in the qualitative dependent field. The value Yes indicates that the patient has diabetes. Otherwise, the value is No. Select the rest of variables as the quantitative explanatory variables. We choose to use the C&RT algorithm to build the tree. Select the Variable labels to consider the variables names provided in the first row of the data set.
In the Options/General tab, we choose the Automatic option to find optimal parameters for our model.
Several technical options allow to better control the way the tree is built.
In Charts tab we select the Bar charts option to display the distribution of the species at each node.
The computations begin once you have clicked on OK. The results will then be displayed.
Interpreting the results of a QUEST classification tree in XLSTAT
The summary statistics for all variables and the correlation matrix are first displayed followed by the confusion matrix which summarizes the reclassification of the observations. The latter one allows to quickly see the % of well-classified observations, which is the ratio of the number of well-classified observations over the total number of observations. Here, it is equal to 75,0%.
Next, the information on the tree structure is provided. For each node, this represents the number of objects at each node, the corresponding %, the statistic test for the splitting, the purity that indicates what is the % of objects that belong to the dominating category of the dependent variable at this node, the parent and child nodes, the split variable and the value(s) or intervals of the latter, and the class predicted by the node.
The following table contains the rules built by the algorithm written in natural language. At each node, the rule corresponding to the predicted class is displayed. The percentage of observation in the node gives the percentage that corresponds to the predicted category at a specific node level.
In this way, we see that "If Glucose <= 143.592 then Diabetes = No in 77,1% of cases" this rule is verified by 592 observations (77% of the data set) with a node purity of 76,01 as we can see on the Tree structure table.
Next, the following table contains all the tested parameters for our model and sectioned parameters (in bold), with respective cross validation score.