Feature selection in an industrial data set

Typ
Examensarbete för masterexamen
Program
Complex adaptive systems (MPCAS), MSc
Publicerad
2019
Författare
Andreasson, Philip
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Feature selection is a technique for reducing the dimensionality of data sets which can provide benefits in terms of computational time, performance and interpretability. This thesis presents the development of a genetic algorithm for feature selection in an industrial data set on investigations, where a large proportion of the features are categorical. The genetic algorithm is designed to always select one-hot encoded categorical features as a group. The quality of a proposed feature selection subset was assessed using Naive Bayes classifiers, decision trees, artificial neural networks, support vector machines and logistic regression classifiers. The classification performance of the subsets obtained from the genetic algorithm were further compared to stepwise forward selection, Relief, LASSO and random forests. The results showed that the dimensionality of the data set could be reduced drastically while maintaining a good classification accuracy. Most significant results were obtained for the Naive Bayes classifier, where the genetic algorithm and stepwise forward selection managed to produce subsets with prediction performances that significantly exceeded both the full data set and the subsets from the other feature selection algorithms. For the other classifiers, the differences were smaller. Given the extensive time required to run the genetic algorithm and stepwise forward selection, the other feature selection algorithms are a better choice for these classifiers.
Beskrivning
Ämne/nyckelord
feature selection , genetic algorithms , categorical features
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index