Speech Categorization with Prosodic Features and Deep Learning
Publicerad
Typ
Examensarbete på kandidatnivå
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The purpose of this thesis is to investigate whether it is possible to perform three
separate categorizations of speech based only on pitch and intensity. By using these
pitch and intensity curves, the goal is to be able to distinguish between the spoken
languages Swedish, Spanish, English, German, French, and Chinese, as well as determining the sex and age group of the speaker, with the use of neural networks.
The pitch and the intensity were extracted from thousands of hours of audio files
collected from the Swedish Riksdag and a website with audiobooks in the public
domain called LibriVox. When categorizing the age group and the sex, only the
audio files from the Swedish Riksdag were used, since they were the only audio files
with labels of the sex and the birth year.
The categorization was performed using two different methods. The first was to extract several language characteristic features from the pitch and intensity to use as
input data, training multiple feedforward neural networks using the FFNN model,
one or more for each categorization. The other method was to use the pitch and
the intensity directly as input data to multiple recurrent neural networks using the
LFLB-LSTM model, again one or more network for each categorization.
The conclusion is that the LFLB-LSTM model can distinguish between the six
languages as well as the sexes and the age groups solely using the pitch and intensity
extracted from the audio files. The FFNN model performed significantly worse than
the LFLB-LSTM model but still better than pure probability, potentially because
of a lack of understanding about what it was in the data that differentiated the
categories from one another. Further, it was concluded that it is essential to have
sufficient variance in the audio data both within the groups and between the groups.
To capture this successfully it is advisable to use sources of audio with a high variance
of genders, ages, audio quality, and dialects, preferably by a large number of diverse
speakers in each group