# Training a Support Vector Machine (SVM) in Excel

2016-06-13
This tutorial will help you set up and train a Support Vector Machine (SVM) classifier in Excel using the XLSTAT statistical software.

## Dataset for training a SVM classifier

An Excel sheet with both the data and results of this tutorial can be downloaded by clicking here.
The dataset used in this tutorial is extracted from the Machine Learning competition entitled "Titanic: Machine Learning from Disaster" on Kaggle the famous data science platform.
The Titanic dataset might be accessed at this address. It refers to the sinking of the RMS Titanic in 1912. During this tragedy, more than 1,500 of the 2,224 passengers lost their lives due to an insufficient number of lifeboats.
The dataset is made up of a list of 1309 passengers together with some information:
• survived: Survival (0 = No; 1 = Yes)
• pclass: Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
• name: Name
• sex: Gender (male; female)
• age: Age
• sibsp: Number of Siblings/Spouses Aboard
• parch: Number of Parents/Children Aboard
• fare: Passenger Fare
• cabin: Cabin
• embarked: Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)

## Goal of this tutorial

The goal of this tutorial is to learn how to set up and train a SVM classifier on the Titanic dataset and see how well the classifier performs on a validation set.

## Setting up a SVM classifier

To set up a SVM Classifier, Click on Machine Learning/Machine Support Vector as show below:

Once you have clicked on the button, the SVM dialog box appears. Select the data on the Excel sheet.
In the Response variable field, select the binary variable with want to predict when classifying our data. In our case, this is the column giving the survival information.
We also select both quantitative and qualitative explanatory variables by checking both checkboxes as shown below.

In the quantitative field, we select columns corresponding to the following fields:
• age
• sibsp
• parch
• fare

To select multiple columns, you may use the Ctrl key.

In the qualitative field, we select the columns with qualitative information:
• pclass
• sex
• embarked

As the name of each variable is present at the top of the table, we must check the Variable labels checkbox.
In the Options tab, the classifier parameters must be set up.
For the SMO parameters, we will let the default options. The C field corresponds to the regularization parameter. It translates how much misclassification you want to allow during the optimization. A large value of C means a strong penalty on each miss-classified observation. In our case, we set the value of C at 1. The Epsilon field is a numerical precision parameter. It is Machine dependent and can be left at 1e-12. The tolerance parameter tells how accurate the optimization algorithm will be when comparing support vectors. If you want to speed up calculations, you can increase the tolerance parameter. We leave the tolerance at its default value.

We select Rescaling in the preprocessing field and we use linear kernels as shown below.

As we want to see how well our classifier performs, we will make a validation sample out of the training sample. For this purpose, in the Validation tab, we check the Validation check box and select 100 observations randomly drawn from the training sample as indicated below.

Finally, in the Outputs tab, we select the outputs we want to obtain as shown below:

The computations begin once you have clicked on OK. The results will then be displayed.

## Interpreting the results of the SVM classifier

The first table displays a summary of the optimized SVM classifier. You can see on the figure below that the classifier had to classify between classes 0 and 1 and that the class 0 has been labeled as the positive class. There were 943 observations used to train the classifier out of which 766 support vectors have been identified.

The second table shown below gives the complete list of the 766 support vectors together with the associated alpha coefficient values and the positive or negative value of the output class. Together with the bias value of the former table, this information is sufficient to fully describe the optimized classifier.

The next two tables displays the resulting confusion matrices from both the training and the validation samples. Those matrices gives us indications of how well our classifier performed. For the training dataset, we has 66.49% of correct answers, this number rises up to 74% for the validation set.

## Conclusion on the SVM classification

We trained our classifier using linear kernels and obtained fairly good results with 74% of correct classification. Some additional tuning might yet be necessary to challenge the best data scientist on kaggle. One approach could be to change the kernel family and see how well a higher dimension space might perform on our dataset.

1c26995d494fb3061dd0ae8571ffc0a4@xlstat.desk-mail.com
https://cdn.desk.com/
false
desk