Feature Extraction Tutorial in Excel
This tutorial explains how to extract feature vectors from a collection of text documents in Excel using the XLSTAT software.
Feature extraction is used to reduce the number of resources required to describe a large set of textual data. It is a general term for methods of constructing combinations of the variables to get around these problems while still describing the data with sufficient accuracy. The extracted features are commonly used in methods of document classification, where the frequency of occurrence of each word in a document is used as a feature for training a classifier.
Dataset for Running a Feature Extraction in Excel
In this tutorial, we will use data from the Internet Movie Database (IMDB) which consists of 4000 movie reviews written in English.
Setting Up a Feature Extraction in Excel Using XLSTAT
-
Open XLSTAT.
-
Select the XLSTAT / Text mining / Feature extraction.
-
Select the column "Review" in the Worksheet field
-
Select the column "Id" in the Document labels field.
-
In the Preprocessing sub-menu of the Options tab
-
Exclude a list of English stop words
-
Remove punctuation marks and numbers
-
Activate the Stemming option
-
-
In the Intermediate form sub-menu, apply filtering:
-
Activate the Remove sparse terms option
-
Enter 10 as the Minimum frequency
-
-
Click on OK.
Interpret the Results of the Feature Extraction
The document term-matrix is displayed.
The terms whose proportion of presence is lower than 5% on all the reviews have been removed. The terms that appear fewer than 10 times on all the reviews are not present in the generated documents-terms matrix.
The word cloud, which represents the frequency of words present in all the reviews, is displayed after the document-term matrix.
Was this article useful?
- Yes
- No