Speech Categorization with Prosodic Features and Deep Learning
dc.contributor.author | DAVALLIUS, DANIEL | |
dc.contributor.author | INGVARSSON, MARKUS | |
dc.contributor.author | ORTHEDEN, JULIA | |
dc.contributor.author | PETTERSSON, MARKUS | |
dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
dc.contributor.examiner | Ahrendt, Wolfgang | |
dc.contributor.supervisor | Prasad, K V S | |
dc.date.accessioned | 2020-10-19T12:02:14Z | |
dc.date.available | 2020-10-19T12:02:14Z | |
dc.date.issued | 2019 | sv |
dc.date.submitted | 2020 | |
dc.description.abstract | The purpose of this thesis is to investigate whether it is possible to perform three separate categorizations of speech based only on pitch and intensity. By using these pitch and intensity curves, the goal is to be able to distinguish between the spoken languages Swedish, Spanish, English, German, French, and Chinese, as well as determining the sex and age group of the speaker, with the use of neural networks. The pitch and the intensity were extracted from thousands of hours of audio files collected from the Swedish Riksdag and a website with audiobooks in the public domain called LibriVox. When categorizing the age group and the sex, only the audio files from the Swedish Riksdag were used, since they were the only audio files with labels of the sex and the birth year. The categorization was performed using two different methods. The first was to extract several language characteristic features from the pitch and intensity to use as input data, training multiple feedforward neural networks using the FFNN model, one or more for each categorization. The other method was to use the pitch and the intensity directly as input data to multiple recurrent neural networks using the LFLB-LSTM model, again one or more network for each categorization. The conclusion is that the LFLB-LSTM model can distinguish between the six languages as well as the sexes and the age groups solely using the pitch and intensity extracted from the audio files. The FFNN model performed significantly worse than the LFLB-LSTM model but still better than pure probability, potentially because of a lack of understanding about what it was in the data that differentiated the categories from one another. Further, it was concluded that it is essential to have sufficient variance in the audio data both within the groups and between the groups. To capture this successfully it is advisable to use sources of audio with a high variance of genders, ages, audio quality, and dialects, preferably by a large number of diverse speakers in each group | sv |
dc.identifier.coursecode | DATX02 | sv |
dc.identifier.uri | https://hdl.handle.net/20.500.12380/301891 | |
dc.language.iso | eng | sv |
dc.setspec.uppsok | Technology | |
dc.title | Speech Categorization with Prosodic Features and Deep Learning | sv |
dc.type.degree | Examensarbete på kandidatnivå | sv |
dc.type.uppsok | M2 |