Extreme Gradient Boosting (XGBOOST) in Excel
This tutorial will show you how to set up and train an XGBOOST classifier in Excel using the statistical software XLSTAT.
Dataset for setting up a Gradient Boosting model (XGBOOST)
The dataset used in this tutorial is extracted from the data science platform, Kaggle, and might be accessed at this address.
The “Banknotes” dataset comprises a list of 200 banknotes with some information: This dataset contains 6 variables; One is qualitative and concerns the banknotes' authenticity and the others are quantitative and related to banknotes' shapes.
Counterfeit: in case the banknote is genuine we put “0”, in the contrary case the banknote is counterfeit we put “1”. 100 banknotes are counterfeits when the 100 others are genuine in this dataset. Length, Left, Right, Bottom, Top are quantitative variables.
Setting up an XGBOOST model with XLSTAT
Select the XLSTAT/ Machine learning / Extreme Gradient Boosting. The dialog box pops up.
In the Response variable field, select the "Counterfeit" variable.
Set the Response type to binary because the response variable has only two distinct values.
In the quantitative explanatory variables, select all the remaining variables in the dataset.
In the Options tab, many parameters are available to set up the model. Here we let the default parameters.
In the Validation tab, choose to keep 30 observations at random to be able to test the model performance on new data.
Go to the Outputs tab and activate the confusion matrix, Variable importance with importance type set to Gain and Results by objects to display the corresponding results.
Click on OK to perform the calculations and display the results.
Interpret results from XGBOOST
The misclassification rates indicate how well the model performs both on the training and validation set. Here, the misclassification rate is 1.2% for the training and 10% for the validation set.
The confusion matrix for the training sample is then displayed in the report. This table shows the percentage of observations that were well classified for each modality (true positives and true negatives). For example, we can see that the observations of modality 0 (genuine) were well-classified at 98.78% while the observations of modality 1 (counterfeit ) were well-classified at 98.86%.
The confusion plot allows synthetically visualizing this table. The gray squares on the diagonal represent the observed numbers for each modality. The orange squares represent the predicted numbers for each modality. Thus, we can see that the surfaces of the squares completely overlap for the two modalities (81 well-predicted observations out of 82 observed observations for modality 0 and 87 well-predicted observations out of 88 observed observations for modality 1).
The Results by object table contains, for each observation, the real class, the predicted class, and the probability of belonging to the different categories of the response variable.
Next, we can look at the Variable importance table chart to see which variables are most important in predicting banknote authenticity.
The Gain metric corresponds to the relative contribution of a feature to the model. A higher value of this metric when compared to another feature implies it is more important for generating a prediction. For example, we can see that the most important feature in predicting authenticity here is the feature BOTTOM. Therefore, we may suggest that there is a link between the banknote authenticity and the Bottom margin width.
Conclusion of this XGBOOST classification with XLSTAT
In this example, we obtain a rate of well-classified observations of 97.65% and identify that the most essential characteristic that verifies authenticity is the lower margin of the banknote. Also, the validation phase with a rate of well-classified observations of 90% allows us to verify that the model generalizes to new data.
Was this article useful?