Transfer learning for domain specific automatic speech recognition in Swedish: An end-to-end approach using Mozilla’s DeepSpeech

Publicerad

Typ

Examensarbete för masterexamen

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Modern systems for automatic speech recognition (ASR), powered by artificial neural networks, make up the core in many day-to-day products and services. This development has been mostly driven by the large tech giants of the industry, which naturally has led to a large focus on developing models for the English language. If instead focusing on ASR for the Swedish language, there exists little research, and no open and available models. The lack of open and available models means that in order to create products and services relying on Swedish ASR, one needs to rely on third party commercial solutions. This becomes an issue when privacy, integrity, and cost are taken into account. Moreover, when considering specialised domains such as health care, creating a new model from scratch might not be feasible due to a lack of data. In this thesis, the goal has been to explore how recent research in ASR can be applied to the Swedish language. In recent papers, transfer learning has been proposed as a technique to develop ASR models for languages where sufficient training data is lacking. In this thesis we aim to use the same technique to create a new state-of-theart model for Swedish ASR, comparing against previous research on Swedish ASR as well as commercial solutions. Additionally, we explore if transfer learning can be successfully utilised to achieve even better ASR in specialised domains. To achieve the aim of the thesis, the NST Acoustic Database for Swedish has been used to train a model based on Mozilla’s DeepSpeech. Additionally, two domain specific datasets have been created as part of the thesis to explore if they can be used to fine-tune the general model for Swedish ASR in certain domains. The resulting general model for Swedish ASR achieves a new state-of-the-art result on the test part of the NST Acoustic Database for Swedish, with a 13.80% word error rate and 4.78% character error rate. Additionally, we show that transfer learning can improve the results in specialised domain with on average 12% lower word error rate and 6% lower character error rate compared to the general Swedish model. We conclude that recent research in ASR applies also to the Swedish language. We reaffirm that transfer learning is a powerful technique to create new ASR models based on existing ones, both for new languages and for specialised domains, with little extra effort in terms data and resources.

Beskrivning

Ämne/nyckelord

automatic speech recognition, speech-to-text, transfer learning, artificial neural networks, end-to-end, DeepSpeech, specialised domains, Swedish

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced