Creating workflows to combine analyses in XLSTAT
This tutorial shows you how to create a workflow, which allows you to run a series of XLSTAT features in a simple and fluid way.
Dataset to illustrate the XLSTAT Workflow
The dataset used to illustrate how the Workflow feature works come from the Kaggle data science platform (FIFA World Cup 2018 Prediction). It describes 668 soccer players of the English League One in 2018 based on 84 variables.
Since the exploration of our data may require the use of several consecutive features, the goal here will be to chain the following analyses:
A Principal Component Analysis (PCA) to reduce the dimension of the dataset by creating new synthetic variables.
A k-means clustering algorithm to identify clusters of soccer players with similar profiles.
Descriptive Statistics or a data visualization tool on the obtained clusters to characterize the profiles that compose them.
How does a workflow work?
A workflow is a succession of analyses represented by interconnected blocks. It is a simple way of chaining analyses that need the results of other analyses as input data.
How to initialize a workflow in XLSTAT?
Select the XLSTAT / Workflow / New workflow menu. The workspace associated with the workflow appears with a first node related to the input data.
In the General tab of the dialog box that appeared, select manually or automatically the data to be used as input data. In manual mode, select the data and click on Add. Finally, check the Column labels option and click OK to complete the configuration of the first node.
How to configure a workflow in XLSTAT?
Once the workflow is initialized, we can add statistical methods as follows:
Click on the first block to open the menu, then select the analysis to add in the Block to add after sub-menu.
Configure the dialog box of the analysis and click on OK.
The results tables of the performed analysis are automatically detected and can be used in the following blocks.
Repeat for each new block you want to create.
Save the workflow by clicking the Save button, or click the Export button to be able to share it.
How to interpret the results of a workflow?
Each node in the workflow provides a result sheet. To view it, use the menu of the node or click on the node icon in the left sidebar. It is now possible to relaunch the workflow with updated data or with the initial data.
Example of workflow creation
Initialize the workflow and select the data
Add a PCA (Principal Component Analysis), then configure the dialog box by selecting the 77 quantitative variables of the dataset. Then click on OK.
According to the following table and scree plot, the 5 first principal components account for more than 80% of the information provided by the initial dataset.
Add a k-means classification block, using the new synthetic variables computed by the PCA as input data. These data are automatically detected and available in all the following blocks.
A quick analysis of the evolution of the silhouette score allows us to determine that the 4-cluster partition seems to be the most appropriate.
Lastly, add one or more scatter plots to visualize the principal components with colors defined by the different clusters.
The 4-cluster partition corresponds to very distinct player profiles. A quick analysis of the outputs of our PCA allows us to say that cluster n°3 would correspond to goalkeepers.
Once these different player profiles have been identified, it might be interesting to perform more detailed analyses within each of these groups.
Was this article useful?