Transfer learning for domain specific automatic speech recognition in Swedish: An end-to-end approach using Mozilla’s DeepSpeech
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Modern systems for automatic speech recognition (ASR), powered by artificial neural
networks, make up the core in many day-to-day products and services. This
development has been mostly driven by the large tech giants of the industry, which
naturally has led to a large focus on developing models for the English language. If
instead focusing on ASR for the Swedish language, there exists little research, and
no open and available models. The lack of open and available models means that
in order to create products and services relying on Swedish ASR, one needs to rely
on third party commercial solutions. This becomes an issue when privacy, integrity,
and cost are taken into account. Moreover, when considering specialised domains
such as health care, creating a new model from scratch might not be feasible due to
a lack of data.
In this thesis, the goal has been to explore how recent research in ASR can be applied
to the Swedish language. In recent papers, transfer learning has been proposed as
a technique to develop ASR models for languages where sufficient training data is
lacking. In this thesis we aim to use the same technique to create a new state-of-theart
model for Swedish ASR, comparing against previous research on Swedish ASR
as well as commercial solutions. Additionally, we explore if transfer learning can be
successfully utilised to achieve even better ASR in specialised domains.
To achieve the aim of the thesis, the NST Acoustic Database for Swedish has been
used to train a model based on Mozilla’s DeepSpeech. Additionally, two domain
specific datasets have been created as part of the thesis to explore if they can be
used to fine-tune the general model for Swedish ASR in certain domains.
The resulting general model for Swedish ASR achieves a new state-of-the-art result
on the test part of the NST Acoustic Database for Swedish, with a 13.80% word error
rate and 4.78% character error rate. Additionally, we show that transfer learning
can improve the results in specialised domain with on average 12% lower word error
rate and 6% lower character error rate compared to the general Swedish model.
We conclude that recent research in ASR applies also to the Swedish language. We
reaffirm that transfer learning is a powerful technique to create new ASR models
based on existing ones, both for new languages and for specialised domains, with
little extra effort in terms data and resources.
Beskrivning
Ämne/nyckelord
automatic speech recognition, speech-to-text, transfer learning, artificial neural networks, end-to-end, DeepSpeech, specialised domains, Swedish