Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool

Examensarbete för masterexamen
Master's Thesis
Biotechnology (MPBIO), MSc
Sintes , Jules
Automatic Speech Recognition has become a key application of deep learning and neural networks. Thanks to the development of new model architectures such as transformers, audio processing tasks such as speech-to-text, audio classification, or audio segmentation technologies are now a crucial part of human-computer inter action systems and widely used in commercial products. In addition, while models are becoming more accurate and robust, an interest in emotion recognition systems is growing to assist operators in their interaction with customers (or patients in the context of healthcare). This thesis aims at improving the previous proof of concept and develop speech emotion recognition and speaker diarization models for real-life data. Firstly, for speech emotion recognition task, we create a new conversational dataset in French language based on real-life recordings from TV documentaries. It contains a large plurality of speakers in various contexts, expressing a wide diversity of emo tions. We conduct a comparative study of various approaches and models with our dataset and achieve state-of-the-art performance, beating pre-trained English-based benchmark models on real-life data while still achieving acceptable results on the RAVDESS benchmark dataset. Next, speaker diarization relates to answering the question "Who spoke when?" We conduct an in-depth comparative study of major open-source frameworks on chosen test cases, with an emphasis on optimizing accuracy along with inference time and hardware requirements. Finally, we implement the emotion recognition and speaker diarization models in an end-to-end conversational analysis tool, which generates a diarized text transcription of the conversational content, along with intensity and emotion recognition on a segment level for both text and audio. The tool also includes a zero-shot topic detection feature, which can be easily extended with various other NLP tasks. The web application can be used as a demonstration tool for business cases and showcases the scalability and flexibility of the proposed approach.
deep learning, automatic speech recognition, speech emotion recognition, speaker diarization.
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Teknik / material