Training a Support Vector Machine (SVM) in Excel
This tutorial will help you set up and train a Support Vector Machine (SVM) classifier in Excel using the XLSTAT statistical software.
Dataset for training a SVM classifier
The dataset used in this tutorial is extracted from the Machine Learning competition entitled "Titanic: Machine Learning from Disaster" on Kaggle the famous data science platform.
The Titanic dataset might be accessed at this address. It refers to the sinking of the RMS Titanic in 1912. During this tragedy, more than 1,500 of the 2,224 passengers lost their lives due to an insufficient number of lifeboats.
The dataset is made up of a list of 1209 passengers together with some information: - survived: Survival (0 = No; 1 = Yes)
- pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- name: Name
- sex: Gender (male; female)
- age: Age
- sibsp: Number of Siblings/Spouses Aboard
- parch: Number of Parents/Children Aboard
- fare: Passenger Fare
- cabin: Cabin
- embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
Goal of this tutorial
The goal of this tutorial is to learn how to set up and train a SVM classifier on the Titanic dataset and see how well the classifier performs on a validation set.
Setting up a SVM classifier
To set up a SVM Classifier, Click on Machine Learning/Support Vector Machine as show below: Once you have clicked on the button, the dialog box appears. Select the data on the Excel sheet.
In the Response variable field, select the binary variable with want to predict when classifying our data. In our case, this is the column giving the survival information.
We also select both quantitative and qualitative explanatory variables by checking both checkboxes as shown below. In the quantitative field, we select columns corresponding to the following fields: - age
To select multiple columns, you may use the Ctrl key.
In the qualitative field, we select the columns with qualitative information: - pclass
As the name of each variable is present at the top of the table, we must check the Variable labels checkbox. In the Options tab, the classifier parameters must be set up.
For the SMO parameters, we will let the default options. The C field corresponds to the regularization parameter. It translates how much misclassification you want to allow during the optimization. A large value of C means a strong penalty on each miss-classified observation. In our case, we set the value of C at 1.
The tolerance parameter is a stopping criterion. If you want to speed up calculations, you can increase the tolerance parameter. We leave the tolerance at its default value.
We select standardisation in the preprocessing field and we use linear kernels as shown below. As we want to see how well our classifier performs, we will make a validation sample out of the training sample. For this purpose, in the Validation tab, we check the Validation check box and select 100 observations randomly drawn from the training sample as indicated below. Finally, in the Outputs tab, we select the outputs we want to obtain as shown below: The computations begin once you have clicked on OK. The results will then be displayed.
Interpreting the results of the SVM classifier
The first table displays a summary of the optimized SVM classifier. You can see on the figure below that the classifier had to classify between classes 0 and 1 and that the class 0 has been labeled as the positive class. There were 943 observations used to train the classifier out of which 452 support vectors have been identified. The second table shown below gives the complete list of the 452 support vectors together with the associated alpha coefficient values and the positive or negative value of the output class. Together with the bias value of the former table, this information is sufficient to fully describe the optimized classifier. The next table displays the resulting confusion matrix from the validation sample. This matrix gives us indications of how well our classifier performed. For the validation sample, we have 78% of correct answers. Following this table, the graph below shows the ROC curve for the validation sample. The ROC curve corresponds to the graphical representation of the couple (1 – specificity, sensitivity) for the various possible threshold values.
We want a curve close to the upper left corner: Finally, the Area under the curve (AUC) is 83,1% which means the model has 83,1% chance to classify well an observation from the positive class.
Conclusion on the SVM classification
We trained our classifier using linear kernels and obtained fairly good results with 78% of correct classification. Some additional tuning might yet be necessary to challenge the best data scientist on kaggle. One approach could be to change the kernel family and see how well a higher dimension space might perform on our dataset.
Was this article useful?