Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool

dc.contributor.authorSintes , Jules
dc.contributor.departmentChalmers tekniska högskola / Institutionen för fysiksv
dc.contributor.departmentChalmers University of Technology / Department of Physicsen
dc.contributor.examinerMIRKHALAF, Mohsen
dc.contributor.supervisorDOAN NGUYEN, Nhut
dc.date.accessioned2023-07-03T11:05:10Z
dc.date.available2023-07-03T11:05:10Z
dc.date.issued2023
dc.date.submitted2023
dc.description.abstractAutomatic Speech Recognition has become a key application of deep learning and neural networks. Thanks to the development of new model architectures such as transformers, audio processing tasks such as speech-to-text, audio classification, or audio segmentation technologies are now a crucial part of human-computer inter action systems and widely used in commercial products. In addition, while models are becoming more accurate and robust, an interest in emotion recognition systems is growing to assist operators in their interaction with customers (or patients in the context of healthcare). This thesis aims at improving the previous proof of concept and develop speech emotion recognition and speaker diarization models for real-life data. Firstly, for speech emotion recognition task, we create a new conversational dataset in French language based on real-life recordings from TV documentaries. It contains a large plurality of speakers in various contexts, expressing a wide diversity of emo tions. We conduct a comparative study of various approaches and models with our dataset and achieve state-of-the-art performance, beating pre-trained English-based benchmark models on real-life data while still achieving acceptable results on the RAVDESS benchmark dataset. Next, speaker diarization relates to answering the question "Who spoke when?" We conduct an in-depth comparative study of major open-source frameworks on chosen test cases, with an emphasis on optimizing accuracy along with inference time and hardware requirements. Finally, we implement the emotion recognition and speaker diarization models in an end-to-end conversational analysis tool, which generates a diarized text transcription of the conversational content, along with intensity and emotion recognition on a segment level for both text and audio. The tool also includes a zero-shot topic detection feature, which can be easily extended with various other NLP tasks. The web application can be used as a demonstration tool for business cases and showcases the scalability and flexibility of the proposed approach.
dc.identifier.coursecodeTIFX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/306535
dc.language.isoeng
dc.setspec.uppsokPhysicsChemistryMaths
dc.subjectdeep learning, automatic speech recognition, speech emotion recognition, speaker diarization.
dc.titleMulti-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeBiotechnology (MPBIO), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Master_Thesis_Jules_Sintes.pdf
Storlek:
4.25 MB
Format:
Adobe Portable Document Format
Beskrivning:
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: