Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Program
Biotechnology (MPBIO), MSc
Publicerad
2023
Författare
Sintes , Jules
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Automatic Speech Recognition has become a key application of deep learning and
neural networks. Thanks to the development of new model architectures such as
transformers, audio processing tasks such as speech-to-text, audio classification, or
audio segmentation technologies are now a crucial part of human-computer inter action systems and widely used in commercial products. In addition, while models
are becoming more accurate and robust, an interest in emotion recognition systems
is growing to assist operators in their interaction with customers (or patients in the
context of healthcare). This thesis aims at improving the previous proof of concept
and develop speech emotion recognition and speaker diarization models for real-life
data.
Firstly, for speech emotion recognition task, we create a new conversational dataset
in French language based on real-life recordings from TV documentaries. It contains
a large plurality of speakers in various contexts, expressing a wide diversity of emo tions. We conduct a comparative study of various approaches and models with our
dataset and achieve state-of-the-art performance, beating pre-trained English-based
benchmark models on real-life data while still achieving acceptable results on the
RAVDESS benchmark dataset.
Next, speaker diarization relates to answering the question "Who spoke when?" We
conduct an in-depth comparative study of major open-source frameworks on chosen
test cases, with an emphasis on optimizing accuracy along with inference time and
hardware requirements.
Finally, we implement the emotion recognition and speaker diarization models in an
end-to-end conversational analysis tool, which generates a diarized text transcription
of the conversational content, along with intensity and emotion recognition on a
segment level for both text and audio. The tool also includes a zero-shot topic
detection feature, which can be easily extended with various other NLP tasks. The
web application can be used as a demonstration tool for business cases and showcases
the scalability and flexibility of the proposed approach.
Beskrivning
Ämne/nyckelord
deep learning, automatic speech recognition, speech emotion recognition, speaker diarization.