DBSCAN clustering in Excel
This tutorial shows how to set up and interpret a DBSCAN clustering in Excel using the XLSTAT software.
Dataset for DBSCAN clustering
The data are from [Fisher M. (1936). The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics, 7, 179 -188] and correspond to 150 Iris flowers, described by four variables (sepal length, sepal width, petal length, petal width) and their species.
Three different species have been included in this study: setosa, versicolor and virginica. Moreover, the dataset contains 50 observations from each of the three species.
Goal of this tutorial
The goal of this tutorial is to set up and interpret a DBSCAN clustering and see how well the clustering performs on the Iris dataset.
Setting up a DBSCAN clustering in XLSTAT
Once XLSTAT is open, click on Machine Learning / DBSCAN:
The DBSCAN dialog box appears. Select the data on the Excel sheet.
In the General tab, check the Quantitatives checkbox and select the following columns: - sepal length
- sepal width
- petal length
- petal width
As the name of each variable is displayed at the top of the table, we must check the Variable labels checkbox. To see how the points are clustered, check the Observation labels checkbox and select the column with the species.
In the Options tab, set up the DBSCAN parameters: - Epsilon: in this example, we enter 0.85. If the value of epsilon is too high, one class will contain all observations. However, if epsilon is too low, all observations will be considered as a noise.
- Minimum number of points: XLSTAT offers you to run several analyses with different minimum numbers of points. Here, we put 3 and 4 points.
The Distance matrix method is used to search neighbors in a radius equal to Epsilon. Finally, the Euclidean distance is chosen. Finally, in the Outputs tab, we can choose to display one or several output tables.
Interpreting a DBSCAN clustering
The first table gives a view on noises included in the dataset and the number of observations by class according to the minimum number of points. When the minimum number of points is equal to 3, class 1 contains 50 observations, class 2 contains 100 observations, and no observation are considered as noise. If the minimum number of points is 4, only one observation is considered as noise. Results according to the minimum number of points are displayed. The number of classes is displayed and the following table gives the class and the silhouette score for each observation. Here, the first 10 Iris setosa are assigned to class 1. A graph representing silhouette scores allows to visually study the goodness of the clustering. If the score is close to 1, the observation lies well in its class. On the contrary, if the score is close to -1, the observation is assigned to the wrong class.
Here, observations assigned to class 1 have a higher silhouette score than observations assigned to class 2. The last table gives a overview of the noises and the observations sorted by class. The first 10 rows and the last 5 rows of the table show that DBSCAN separates all setosa observations in class 1 from the other observations assigned to class 2. Same tables and graphs are displayed for a minimum number of points equal to 4.
Conclusion on DBSCAN clustering
DBSCAN algorithm produces 2 groups within the 3 Iris species. Nevertheless, class 1 contains all setosa observations and class 2 contains the two other species.
DBSCAN algorithm is a good clustering to separate setosa species from versicolor and virginica species but does not allow to separate the 3 species better than other clustering method.
Was this article useful?