Feature selection in an industrial data set
Publicerad
Författare
Typ
Examensarbete för masterexamen
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Feature selection is a technique for reducing the dimensionality of data sets which
can provide benefits in terms of computational time, performance and interpretability.
This thesis presents the development of a genetic algorithm for feature selection
in an industrial data set on investigations, where a large proportion of the features
are categorical. The genetic algorithm is designed to always select one-hot encoded
categorical features as a group. The quality of a proposed feature selection subset
was assessed using Naive Bayes classifiers, decision trees, artificial neural networks,
support vector machines and logistic regression classifiers. The classification performance
of the subsets obtained from the genetic algorithm were further compared to
stepwise forward selection, Relief, LASSO and random forests. The results showed
that the dimensionality of the data set could be reduced drastically while maintaining
a good classification accuracy. Most significant results were obtained for
the Naive Bayes classifier, where the genetic algorithm and stepwise forward selection
managed to produce subsets with prediction performances that significantly
exceeded both the full data set and the subsets from the other feature selection algorithms.
For the other classifiers, the differences were smaller. Given the extensive
time required to run the genetic algorithm and stepwise forward selection, the other
feature selection algorithms are a better choice for these classifiers.
Beskrivning
Ämne/nyckelord
feature selection, genetic algorithms, categorical features