Keyword Spotting Within an Automotive Environment
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The aim of this project is to implement a system to detect keywords within speech,
i.e., keyword spotting (KWS), that performs well in an automotive environment.
The system is based on extracting features from sound waves captured by a microphone,
and machine learning (ML). The Google speech commands (GSC) dataset
is used to develop the models in combination with audio book samples from the
LibriSpeech (LS) dataset. The combination of these two datasets is unique and was
done with the goal of increasing the robustness of the models. In addition, data augmentation
and the insertion of background noise are key tools within this project,
to target the system towards an automotive environment.
Aside from standard performance metrics, the complexity of the model, which will
appear as a time delay for the user, is also an important aspect to enable real-time
usage. Performance is examined using recorded speech and in real-world settings,
both in a noise-free and a noisy automotive environment. The best-performing model
was found to be a temporal convolutional residual neural network (TC-ResNet) using
mel frequency cepstral coefficients (MFCC) features, which achieved an accuracy of
95.34% on the validation dataset. The model complexity is low compared to models
in previous studies, with 152.7 K parameters and 3.22 M multiplications performed
by the model. The model’s performance is substantially lowered in an automotive
environment with an average accuracy of 83.71%, but it is considered promising
due to multiple possible improvements regarding capturing and filtering the speech
signals by using the car’s hardware instead of the laptop that was used. Due to
low performance when evaluating the models on coherent speech, the suggestion is
that the system should be implemented with a voice activity detection system or a
"push-to-talk" button and not as a constantly ongoing process. The data collection is
proposed as the main focus for future improvements, as more labeled audio segments
are needed to build a more qualitative model with wider functionalities.
Beskrivning
Ämne/nyckelord
Keyword spotting, machine learning, artificial neural networks, automotive environment.