Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool

Sintes , Jules

Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool

Ladda ner

Master_Thesis_Jules_Sintes.pdf (4.25 MB)

Publicerad

2023

Författare

Sintes , Jules

Typ

Examensarbete för masterexamen
Master's Thesis

Program

Biotechnology (MPBIO), MSc

Sammanfattning

Automatic Speech Recognition has become a key application of deep learning and neural networks. Thanks to the development of new model architectures such as transformers, audio processing tasks such as speech-to-text, audio classification, or audio segmentation technologies are now a crucial part of human-computer inter action systems and widely used in commercial products. In addition, while models are becoming more accurate and robust, an interest in emotion recognition systems is growing to assist operators in their interaction with customers (or patients in the context of healthcare). This thesis aims at improving the previous proof of concept and develop speech emotion recognition and speaker diarization models for real-life data. Firstly, for speech emotion recognition task, we create a new conversational dataset in French language based on real-life recordings from TV documentaries. It contains a large plurality of speakers in various contexts, expressing a wide diversity of emo tions. We conduct a comparative study of various approaches and models with our dataset and achieve state-of-the-art performance, beating pre-trained English-based benchmark models on real-life data while still achieving acceptable results on the RAVDESS benchmark dataset. Next, speaker diarization relates to answering the question "Who spoke when?" We conduct an in-depth comparative study of major open-source frameworks on chosen test cases, with an emphasis on optimizing accuracy along with inference time and hardware requirements. Finally, we implement the emotion recognition and speaker diarization models in an end-to-end conversational analysis tool, which generates a diarized text transcription of the conversational content, along with intensity and emotion recognition on a segment level for both text and audio. The tool also includes a zero-shot topic detection feature, which can be easily extended with various other NLP tasks. The web application can be used as a demonstration tool for business cases and showcases the scalability and flexibility of the proposed approach.

Ämne/nyckelord

deep learning, automatic speech recognition, speech emotion recognition, speaker diarization.

URI

http://hdl.handle.net/20.500.12380/306535

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Multi-task French speech analysis with deep learning Emotion recognition and speaker diarization models for end-to-end conversational analysis tool

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced