Training a one-class support vector machine in Excel
This tutorial will help you set up and train a One-class Support Vector Machine (One-class SVM) classifier in Excel using the XLSTAT statistical software.
Dataset for training a SVM classifier
The dataset used in this tutorial is extracted from the data science platform, Kaggle and might be accessed at this address.
The “Banknotes” dataset is made up of a list of 200 banknotes with some information: This dataset contains 7 variables; One is qualitative and informs on the banknotes' authenticity and the others are quantitative variables and inform on banknotes' shapes.
Counterfeit: in case the banknote is genuine we put “0”, in the contrary case the banknote is counterfeit we put “1”. 100 banknotes are counterfeits when the 100 others are genuine in this dataset. Length, Left, Right, Bottom, Top, Diagonal are quantitative variables.
A second dataset of 10 banknotes is available and will be used to predict if those banknotes are genuines or counterfeits.
Goal of this tutorial
The goal of this tutorial is to learn how to detect anomalies using a One-class SVM on the “Banknotes” dataset and evaluate how well the algorithm performs based on a 10-fold cross-validation.
Setting up a One-class SVM classifier and a cross-validation
To set up a One-class SVM method, click on Machine Learning/One-class Support Vector Machine:
Once you have clicked on the button, the dialog box appears. Select the data on the Excel sheet.
First select the quantitative explanatory variables by checking checkbox as shown below.
In the Quantitative field, we select all quantitative variables: “Length”, “Left”, “Right”, “Bottom”, “Top” and “Diagonal”. To select multiple columns, you may use the Ctrl key.
To use the references we have for each banknote, check Known classes and select the binary variable “Counterfeit”. Then, we enter “1” as the outlier class because it is the class of the counterfeit banknotes.
As the name of each variable is present at the top of the table, we must check the Variable labels checkbox.
In the Options tab, the classifier parameters must be set up. For the SMO parameters, the Nu field corresponds to the regularization parameter. It translates the percentage of outliers you want to allow during the optimization. The closer the value of Nu is to 1, the greater the percentage of outliers.
The tolerance parameter tells how accurate the optimization algorithm will be when comparing support vectors. If you want to speed up calculations, you can increase the tolerance parameter. We leave the tolerance at its default value.
We select Standardization in the preprocessing field, and we use linear kernels as shown below.
Finally, to evaluate the quality of the classifier, we can run a 10-fold cross-validation.
We can predict the 10 new observations by selecting the 6 quantitative variables in the Prediction Tab. Again, as the name of each variable is present at the top of the table, we must check the Variable labels checkbox.
Finally, in the Outputs tab, we select the outputs we want to obtain as shown below:
The computations begin once you have clicked on OK. The results will then be displayed on a new Excel sheet.
Interpreting the results of the SVM classifier and a cross-validation
The first table shows performance results for the cross-validation. For each fold, the classification error rate, the f-score and the balanced accuracy (BAC) are displayed. The mean classification error rate is 29.1% for a model using parameters previously chosen.
The following table displays performance metrics on the training dataset including 10 indicators:
In the case of One-class classification, F-score is preferred to assess the classifier. Here, the F-score is 78% it means the model is good for anomaly detection.
The following graph can help to evaluate the classifier. The ROC curve corresponds to the graphical representation of the couple (1 – specificity, sensitivity) for the various possible threshold values. We want a curve close to the upper left corner, which is the case here:
We can now see results according to our classifier.
The first table displays a summary of the optimized SVM classifier. You can see on the figure below that the outlier class is 1 as chosen previously. There were 200 observations used to train the classifier out of which 113 support vectors have been identified and a bias near of zero was computed.
The second table shown below gives the complete list of the 113 support vectors together with the associated alpha. Together with the bias value of the former table, this information is sufficient to fully describe the optimized classifier.
Finally, we can find predicted classes for the 10 observations of the prediction dataset: The first column shows the predicted class for each observation, while the second column is the decision function.
As we chose « 1 » as the outlier class, then observations predicted as “1” are outliers according to the method.
Was this article useful?
- Yes
- No