Brachycephalic Obstructive Airway Syndrome (BOAS)
classification in dogs based on respiratory noise analysis using

machine learning

Moa Mårtensson

Master Thesis in Biomedical Engineering

February 2021

Department of Physics
Chalmers University of Technology
Gothenburg, Sweden 2021


Brachycephalic Obstructive Airway Syndrome (BOAS)
classification in dogs based on respiratory noise analysis using

machine learning

Moa Mårtensson

Master Thesis in Biomedical Engineering

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2021


Brachycephalic Obstructive Airway Syndrome (BOAS) classification in dogs based on respiratory
noise analysis using machine learning

Moa Mårtensson

Master Thesis in Biomedical Engineering

Supervisors:
Magnus Karlsteen, Department of Physics, Chalmers University of Technology
Eva Skiöldebrand, Swedish University of Agricultural Sciences

Examiner:
Magnus Karlsteen, Department of Physics, Chalmers University of Technology

c©Moa Mårtensson, 2021

Typeset in LATEX
Printed by Chalmers Reproservice
Gothenburg, Sweden 2021

Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
+46 31 772 1000


Abstract
Brachycephalic Obstructive Airway Syndrome (BOAS) is a problem in several dog breeds
due to a compressed shape of the skull. It is classified as BOAS grade 0-3, where 0 is nor-
mal breathing and 3 is the most severe grade of the syndrome. Grade 2-3 can cause great
suffering for the affected dogs and needs treatment. This study aimed to find a method us-
ing machine learning to classify the BOAS grade based on audio recordings of respiratory
noise. The recordings were converted into Mel-Frequency Cepstral Coefficients (MFCCs)
to be processed as images by the network. The results proved that Recurrent Neural Net-
work - Long Short-Term Memory (RNN-LSTM) was a successful method to classify the
four different BOAS grades with an accuracy of about 86-87% for dictaphone recordings
and about 62-66% for stethoscope recordings. Convolutional Neural Networks (CNN) also
managed to classify the BOAS grades but this method was less accurate, with an accuracy
of approximately 74-76% for dictaphone recordings and 50-54% for stethoscope record-
ings. The study was a collaboration between Chalmers University of Technology and
Swedish University of Agricultural Sciences.

Keywords: Brachycephalic Obstructive Airway Syndrome, BOAS, Machine learning, Con-
volutional Neural Network, CNN, Mel-Frequency Cepstral Coefficients, MFCC, Recurrent
Neural Network, Long Short-Term Memory, RNN-LSTM, Respiratory noise analysis

i


Acknowledgements
This study is a collaboration between Chalmers University of Technology and Swedish
University of Agricultural Sciences.

I would like to thank the team at the Swedish University of Agricultural Sciences; Eva,
Ingrid and Maria. Thank you for your expertise in the field of veterinary medicine and for
a fun cooperation!

Thanks to all dog owners and dogs who were willing to take the time and effort to partici-
pate in this study. Without you this project would not be possible!

The greatest thanks possible I want to give to Magnus who has been the best supervisor
anyone could ask for! Thank you for helping me with Python, the report and everything
else, and for being available at all hours of the day. Thank you for your support, dedication
and patience!

ii


Abbrevations
BOAS - Brachycephalic Obstructive Airway Syndrome
CNN - Convolutional Neural Network
ET - Exercise Test
LSTM - Long Short-Term Memory
MFCCs - Mel-Frequency Cepstral Coefficients
RNN - Recurrent Neural Network

Medical terms
Apnoea - respiratory arrest
Brachycephalic - short skull
Dyspnoea - difficulty breathing
Hyperplasia - overgrowth of tissue
Hypoplasia - underdevelopment of tissue
Larynx - voice box, part of respiratory tract above the trachea
Nasopharynx - the rear part of the nasal cavity above the soft palate
Regurgitation - reflux
Rhinoplasty - surgery to change the shape of the nose
Staphylectomi - removal of a part of the soft palate
Stenosis - narrowing
Stertor - a low pitched snoring sound during inspiration
Stridor - a high pitch wheezing sound from the laryngeal area
Trachea - windpipe

iii


Contents
1 Introduction 1

1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Brachycephalic Obstructive Airway Syndrome (BOAS) . . . . . . . . . . . 1

2 Methods 3
2.1 Audio recording . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Preprocessing the signal . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2.1 Mel-Frequency Cepstral Coefficients (MFCCs) . . . . . . . . . . . 3
2.3 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Convolutional Neural Networks (CNN) . . . . . . . . . . . . . . . 7
2.3.3 Recurrent Neural Networks (RNN) . . . . . . . . . . . . . . . . . 7
2.3.4 Recurrent Neural Network - Long Short-Term Memory (RNN-LSTM) 7

3 Results 9
3.1 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 RNN-LSTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2.1 Four BOAS classes . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2.2 Three BOAS classes . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.3 Two BOAS classes . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Discussion 16
4.1 Potential sources of error . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Conclusions 18

iv


1 Introduction
All dog breeds descend from the wolf [1]. It can be hard to imagine from the variety of
shapes and sizes of dogs today. But what consequences does excessive breeding to promote
a certain physical attribute instead of good health and mentality have for the affected dogs?

1.1 Aim
The aim of this project is to find a method using machine learning to classify Brachy-
cephalic Obstructive Airway Syndrome (BOAS) grades in brachycephalic dogs based on
respiratory noises.

1.2 Brachycephalic Obstructive Airway Syndrome (BOAS)
Dog breeds with flat faces and short noses such as pugs and bulldogs are called brachy-
cephalic dogs and refer to the compressed shape of the skull [2]. Many of these dogs suf-
fer from Brachycephalic Obstructive Airway Syndrome (BOAS). Dogs with BOAS have
anatomical deviations such as nostril stenosis, nasopharyngeal hyperplasia, tracheal hy-
poplasia and elongated and thickened soft palate that may obstruct parts of the airways [3].
Figure 1 presents normal nostrils and nostril stenosis.

Figure 1: Normal nostrils to the left and nostril stenosis to the right.
Photo: Moa Mårtensson

Figure 2 presents normal upper airway anatomy in dogs.

BOAS results in symptoms such as snoring, inspiratory dyspnoea, sleep apnoea, regurgita-
tion/vomiting and complications related to anesthesia [3]. The symptoms from the upper

1


Figure 2: Normal upper airway anatomy in dogs.
Image from Hill’s Atlas of Veterinary Clinical Anatomy, used with permission.

airways increase during and after exercise, with high temperatures, when stressed or excited
and with overweight [3].

With a stethoscope a veterinarian can auscultate abnormal respiratory sounds such as stertor
and stridor. Stertor is a low pitched noise usually caused by an elongated soft palate or
nasopharyngeal obstruction. Stridor is a high pitched noise caused by compromised or
collapsed laryngeal area. The presence of stridor is associated with a higher BOAS grade
[3].

BOAS is graded from 0-3, where 0 is no respiratory noises and 3 is severe respiratory
noises. Generally grade 0-1 does not affect the dog, whereas grade 2-3 is a problem that
affects the dogs quality of life and requires treatment [4].

Treatment for BOAS is usually weight loss and/or surgery [4]. Surgery aims to correct
some of the airway malformations in order to breath better. Airway surgery may include
procedures like staphylectomy (shortening of the soft palate) and/or rhinoplasty (nostril
enlargement) [3] [2].

Extreme brachycephalic breeds have a shorter life expectancy than other breeds the same
size. They also have a higher risk of dying from respiratory related causes [4].

2


2 Methods
The programming language used is Python version 3.8 with additional libraries such as
Tensorflow, Librosa and Keras. The code is inspired from Valerio Velardo’s series of tuto-
rials on YouTube called ”The sound of AI”, which is used for classifying music genres. The
code has been modified to accommodate BOAS-grading of respiratory noises from dogs.

2.1 Audio recording
Recording of the dogs respiratory noises was performed during September 21st-25th 2020
at the University Animal Hospital in Uppsala, Sweden. All participants needed to be at least
one year old and their owners had to sign a consent form and answer some questions, in-
cluding medical history. There were 41 dogs participating, 24 of them were brachycephalic
dogs and 17 were reference dogs. The brachycephalic dogs consisted of eleven pugs, eleven
French bulldogs, one English bulldog and one Boston terrier. All dogs were examined by
veterinarian Maria Dimopoulou, an EBVS European Specialist in Small Animal Surgery,
and graded according to the BOAS scale from 0 to 3. The nostrils were photographed
and respiratory patterns video recorded if needed during the analysis process. The respira-
tory sounds at rest were recorded with a 3M Littmann electronic stethoscope model 3200
placed on the side of the larynx. Simultaneously an Olympus linear PCM recorder LS-P1
dictaphone was placed 10-15cm from the dogs mouth and nose. The recordings were car-
ried out for 30s. Thereafter the dogs performed a three-minute exercise test (ET) where
they ran back and forth with a handler. Immediately after, they were recorded again in the
same way for another 30s. The Littmann stethoscope was connected via Bluetooth to the
software Stethassist, where the recordings could be saved and exported.

2.2 Preprocessing the signal
The audio recordings have been manually edited into 30s wav-files and major distortions
and errors have been removed. The software used for this was BandLab. An attempt
was made to amplify the signal from the stethoscope recordings due to low volume. The
amplification created a lot of distortion and was hence not used.

2.2.1 Mel-Frequency Cepstral Coefficients (MFCCs)

To convert the audio signal into a time-frequency domain representative image for the ma-
chine learning network to interpret, MFCCs are used [5] [6]. MFCCs are widely used in
signal processing for sound recognition [7] [6] and classification to find similarities in or
between signals. Human hearing sensitivity is different for different frequencies. MFCCs
can mimic the non-linear frequency characteristics of human hearing [6]. MFCCs are the

3


inverse discrete cosine transform of the logarithmic energy in mel-frequency bands of the
signal [7] [5] which is presented in equation 1:

m f cc =

√
2

Mm f cc

Mm f cc

∑
m=1

log(Xm(t)) cos

(
c(m− 1

2)π

Mm f cc

)
(1)

where Mm f cc is the number of mel-frequency bands, m is the index of the mel-frequency
band, Xm(t) is the energy of the mth band and c is the index of the coefficient [5].

Figure 3 presents the waveform of a 30s stethoscope recording graded BOAS 3. The x-axis
represents time and the y-axis represents amplitude. This waveform was transformed into
a MFCC spectrogram for the machine learning process.

Figure 3: The waveform of a 30s stethoscope recording graded BOAS 3.
Image created by Moa Mårtensson.

For visualisation purposes, Figure 4 presents a Mel spectrogram. The x-axis represents the
time and the y-axis represents frequency. The colorbar represents the amplitude in dB. The
image suggests that the frequency as well as the amplitude are relatively low.

Figure 5 presents a MFCC spectrogram with 39 coefficients. The axes represent time on
the x-axis and coefficients on the y-axis. The color bar to the right represents the values of
the coefficients.

A Python code divided the audio files into 3s segments and created 39 MFCCs for each
segment and saved them all into a json-file that was later used for machine learning. The
code for creating the MFCCs can be found in Appendix I.

4


Figure 4: A Mel spectrogram of a 30s stethoscope recording graded BOAS 3.
Image created by Moa Mårtensson.

Figure 5: The MFCCs of a 30s stethoscope recording graded BOAS 3 and 39 coefficients.
Image created by Moa Mårtensson.

2.3 Machine Learning
Deep learning is a type of machine learning method that contains multiple levels of nonlin-
ear operations with hidden levels in a neural network. Deep learning can discover complex
relationships within datasets using algorithms that keep relevant information from previous
layers. It is used for classification purposes since its advanced functions can learn to differ
between different response classes [8].

The initial task of machine learning was to transform relevant information from the audio
recordings into representative images. In this project that was the MFCC images. Then

5


a model was created and trained using the MFCCs. After the training, the model needed
to be evaluated to know how well the model worked. Unbalanced number of outcomes
between the class labels while training can cause problems measuring the accuracy for the
model [9].

There are different types of machine learning problems. If the output is numeric it is called
a regression problem. If the output is a class label it is a classification problem. The
number of labels decides the subcategory. If it is a yes or no problem, with only two pos-
sible options, it is called a binary classification problem. More than two classes are called
multi-class classification problems. There is also something called multi-label classifica-
tion, when a sample belongs to multiple classes [9].

2.3.1 Overfitting

Overfitting is when the model is overtrained on the training dataset. It is unable to gen-
eralize the test dataset and therefor performs poorly [10] [11]. It can be suspected if the
training dataset has a significantly higher accuracy than the test dataset. This can be a result
of a too small dataset [10].

Common techniques for limiting overfitting are simplifying the network architecture, regu-
larization, collecting more data and data augmentation. Unfortunately, there is no universal
solution for overfitting, hence testing different techniques and evaluating the result is the
way to handle it [11].

Simplifying the model can consist of reducing the number of layers and the number of neu-
rons in the layers. This is a time-consuming strategy since there are countless combinations
to try and evaluate [11].

There are dozens of regularization techniques. The most common ones are L1, L2, dropout
and early stopping. L1 and L2 are weights that reduces the networks capacity to adapt to the
dataset. This is done by adding a term to the cost function [11]. Dropout randomly removes
a neuron from the network in each epoch of training the model [10] [11]. Randomly remov-
ing units creates a smaller network where the weights are smaller and distributed across the
predictors in the model [8]. Early stopping reduces the number of iterations/epochs during
training so the model stops before it has overtrained on the dataset [10].

Collecting more data is probably the easiest way to reduce overfitting. In reality, it may not
always be possible [11].

Data augmentation has the goal of generating additional data. It is the same data only
modified and added, but it has been made unrecognizable for the model. Augmentation for

6


an image can consist of for example rotating, shifting and stretching [11].

To address the overfitting in this project, simplifying the model and the regularization meth-
ods L1 and L2, dropouts and early stopping were implemented. These actions had inade-
quate results and the overfitting persisted. Gathering more data was unfortunately not an
option. Data augmentation was not implemented due to time limitations, but would be very
interesting as a part of future work.

2.3.2 Convolutional Neural Networks (CNN)

CNNs are a type of artificial neural network that are inspired by the neurons in the visual
cortex of the human brain. This makes them suitable for artificial vision and image anal-
ysis. CNNs mainly consist of convolutional and pooling layers. The convolutional layers
perform a mathematical operation called convolution, which practically means to apply a
filter matrix to the input. This makes it possible to detect and learn shapes, patterns and
other features in the input. The pooling layers down-samples the input by reducing the
dimension, which makes the network more robust [9]. The code for the CNN used in this
project is presented in Appendix II.

2.3.3 Recurrent Neural Networks (RNN)

RNN is a method for processing sequential data. It passes forward what was learned from
the previous time step (the output) as an additional input to the next time step. This way
it attempts to predict the output from the history of previous input data [12] [8]. It uses
feed forward neural networks with cyclic connections. In the network there are three main
connections types that are very important; input to hidden layer, hidden to hidden layer and
hidden to output layer. The weights from these connections are represented by different
matrices that are processed into a scalar value that is classified as a binary variable. The
loss function then compares the predicted binary variable to the actual label [8].

RNN architecture has some limitations. Even though it was created to learn long-term
dependencies, it does not do it very well due to problems called vanishing gradient and
exploding gradient. It causes the weights in the network to become extremely small or
large during network training due to the error signal only can be traced back a few steps.
The weight deviations hence increase exponentially over time. To circumvent this problem
a variant of the RNN model was created, Long Short-Term Memory (LSTM) [8] [12].

2.3.4 Recurrent Neural Network - Long Short-Term Memory (RNN-LSTM)

RNN-LSTM avoids vanishing and exploding gradients by remembering important infor-
mation from previous steps while eliminating information that is unnecessary from being

7


used in future steps. The model is able to remember context and dependencies over long
periods of time [12].

RNN-LSTM has an architecture built on connected sub-networks called memory blocks.
The memory blocks recall input [8]. They contain accumulator cells and three different
types of gates; input gate, forget gate and output gate. The gates are multiplication units
that can store and access information [8] [12]. Each gate can learn what inputs are useful
for predicting the outputs. It passes input information forward and back-propagates the
error and adjusts the weights [12].

In Figure 6 the architecture of the RNN-LSTM network used in this project is presented.
The complete code is presented in Appendix III.

Figure 6: Architecture of the RNN-LSTM network.
Image created by Magnus Karlsteen.

To find the optimum number of neurons in each layer, a programming loop was created
that tested different combinations. The same method was used to evaluate the most suit-
able learning rate and the best combinations of activation functions. To find other optimum
hyperparameters, such as number of layers and mini-batch size, some different combina-
tions were tested manually. Based on the results from these procedures, the activation
functions hard sigmoid and softmax were used. The number of neurons were set to 1024,
256 and 16 in the different layers. The dataset was divided into 55% training data, 25%
test data and 20% validation data. For the dropout layers the number of units dropped were
set to 0.3. The learning rate optimizer used for compiling the model was called adaptive
moments or Adam. The learning rate was set to 1 ∗ 10−5. Mini-batch size is the number
of training set samples that are processed in each iteration. A large mini-batch size takes a
long time for each iteration while a small mini-batch size may not reach the local minimum

8


[8]. The mini-batch size used in this network was 32 and the number of epochs/iterations
was 1000.

3 Results
The 41 participants were graded by veterinarian Maria Dimopoulou, an EBVS European
Specialist in Small Animal Surgery, into BOAS classifications and the distribution is pre-
sented in Table 1.

Table 1: The distribution of BOAS grades among the participants.

BOAS 0 BOAS 1 BOAS 2 BOAS 3
Pugs 2 5 4 0
French bulldogs 0 5 2 4
English bulldogs 0 0 1 0
Boston terriers 0 1 0 0
Reference dogs 17 0 0 0
Total number of dogs 19 11 7 4

3.1 CNN
When the CNN method is used, the model has an average accuracy ranging from about
50% to 76% for four BOAS-classes, as can be seen in Table 2.

Table 2: Classification accuracy for the different recording types using four BOAS classes
and CNN.

Littmann Before ET Littmann After ET Olympus Before ET Olympus After ET
Highest accuracy 57.10% 62.90% 80.50% 81.30%
Lowest accuracy 41.00% 42.90% 66.90% 69.10%
Average accuracy 49.53% 53.67% 73.73% 75.57%

When BOAS 2 and 3 are combined into one class, because of limited data in these classes,
three BOAS classes can be evaluated. Three BOAS classes are used in Table 3. A slight
improvement can be seen as the models average accuracy varies from about 53% to 79%.

3.2 RNN-LSTM
The RNN-LSTM method is used in the following sections.

9


Table 3: Classification accuracy for the different recording types using three BOAS
classes and CNN.

Littmann Before ET Littmann After ET Olympus Before ET Olympus After ET
Highest accuracy 58.10% 68.60% 78.00% 86.20%
Lowest accuracy 46.70% 57.10% 68.60% 69.90%
Average accuracy 53.33% 64.13% 73.88% 78.85%

3.2.1 Four BOAS classes

In Table 4 the models classification accuracy is presented after 20 tries in each recording
category and with four BOAS-classes; 0, 1, 2 and 3. The average accuracy varies from
62% for Littmann After ET to almost 87% for Olympus After ET, which is a significant
improvement compared to using CNN.

Table 4: Classification accuracy for the different recording types using four BOAS classes
and RNN-LSTM.

Littmann Before ET Littmann After ET Olympus Before ET Olympus After ET
Highest accuracy 77.10% 69.50% 92.40% 94.30%
Lowest accuracy 59.00% 52.40% 81.40% 81.30%
Average accuracy 66.39% 61.80% 86.28% 86.92%

In Figures 7-8, 11-12 and 14-15 confusion matrices illustrates the distribution of classi-
fications for each audio segment, 3s in length. The observant reader may notice that the
matrices have a total of 101 test samples, which is due to Python crashing if 100 was used,
which would have corresponded to percent otherwise. The matrices present the classifica-
tion accuracy for each BOAS class, as opposed to the tables that show the classification
accuracy of the entire model. The x-axis represents the predicted BOAS class and the y-
axis represents the true BOAS class. The colorbar shows lighter colors for higher values.
The trend of a light diagonal from the top left to the bottom right with dark sides indi-
cates a successful classification. The accuracy for each BOAS class is calculated from the
following equation:

accuracy =
TruePositive+TrueNegative

TruePositive+TrueNegative+FalsePositive+FalseNegative
(2)

and the weighted overall precision for each matrix is calculated using:

precision =
C

∑
i=1

(
Samples i

TotalSamples
∗ TruePositive i

TruePositive i+FalsePositive i

)
(3)

10


where i represents the rows of the matrix and C are the number of classes.

In Figure 7 a confusion matrix for Littmann with four BOAS classes is presented. For
Littmann Before ET the accuracy is 81% for BOAS 0, 86% for BOAS 1, 86% for BOAS
2 and 93% for BOAS 3. The weighted overall precision is 72.7%. For Littmann After ET
the accuracy is 80% for BOAS 0, 90% for BOAS 1, 87% for BOAS 2 and 96% for BOAS
3. The precision is 69.9%.

Figure 7: Confusion matrix for RNN-LSTM with four BOAS classes for Littmann Before
ET on the left and Littmann After ET on the right.

Images created by Magnus Karlsteen.

In Figure 8 a confusion matrix for Olympus with four BOAS classes is presented. A clear
lighter diagonal with darker sides can be seen indicating a successful result. Olympus
Before ET for BOAS 0 has 93% accuracy, BOAS 1 96%, BOAS 2 93% and BOAS 3
98%. The weighted overall precision is 90.4%. Olympus After ET for BOAS 0 has 93%
accuracy, BOAS 1 93%, BOAS 2 96% and BOAS 3 100%. The precision is 91.2%.

A problem with overfitting was discovered when comparing the training accuracy and the
test accuracy. In Figure 9 a problem with overfitting for Littmann After ET with four BOAS
classes has been visualized to the left. The train accuracy is significantly higher than the
test accuracy, and the test error/validation loss increases over time. To the right a graph
without overfitting is presented for comparison. The train and test graphs follow each other
well.

11


Figure 8: Confusion matrix for RNN-LSTM with four BOAS classes for Olympus Before
ET on the left and Olympus After ET on the right.

Images created by Magnus Karlsteen.

Figure 9: Overfitting to the left and no overfitting to the right.
Images created by Magnus Karlsteen.

3.2.2 Three BOAS classes

If BOAS grade 2 and 3 are combined into one group, the overall result for the model
improved as is presented in Table 5. Here 12 tries for each recording type were used. The
average accuracy varies from 69-88% for the different recording types.

12


Table 5: Classification accuracy for the different recording types using three BOAS
classes and RNN-LSTM.

Littmann Before ET Littmann After ET Olympus Before ET Olympus After ET
Highest accuracy 73.30% 79.00% 90.70% 92.70%
Lowest accuracy 60.00% 64.80% 78.80% 84.60%
Average accuracy 68.56% 70.95% 85.81% 88.48%

In Figure 10 two matrices for Littmann Before and After ET with three BOAS classes are
presented. For Littmann Before ET on the left the accuracy for BOAS 0 is 81%, BOAS 1
is 75% and BOAS 2 and 3 is 80%. The precision is 69.7%. For Littmann After ET on the
right the accuracy is 81% for BOAS 0, 78% for BOAS 1 and 83% for BOAS 2 and 3. The
precision is 70.8%.

Figure 10: Confusion matrices with three BOAS-classes. Littmann Before ET on the left
and Littmann After ET on the right.

Images created by Magnus Karlsteen.

In Figure 11 two matrices for Olympus Before and After ET with three BOAS classes
are presented. For Olympus Before ET BOAS 0 has 96% accuracy, BOAS 1 has 94% and
BOAS 2 and 3 combined has 94%. The precision is 92.1%. Olympus After ET has identical
accuracy and precision as Olympus Before ET.

Four Littmann After ET recordings were then excluded from the training data in an attempt
to circumvent overfitting. The whole 30s recordings are divided into 3s segments and
each segments classification as well as the whole recordings classification are presented in

13


Figure 11: Confusion matrices with three BOAS-classes. Olympus Before ET on the left
and Olympus After ET on the right.

Images created by Magnus Karlsteen.

Figure 12. The same procedure was performed for the other recording types, but with a
significantly less accurate result.

Figure 12: Four Littmann after ET files successfully BOAS-graded.
Image created by Magnus Karlsteen and Moa Mårtensson.

14


3.2.3 Two BOAS classes

In Table 6 BOAS 0 and 1 are combined into one class and BOAS 2 and 3 are one class.
25 tries for each recording type were performed. The average accuracy of the model varies
from 79-93% for the different recording types.

Table 6: Classification accuracy for the different recording types using two BOAS classes
and RNN-LSTM.

Littmann Before ET Littmann After ET Olympus Before ET Olympus After ET
Highest accuracy 86.70% 92.40% 94.90% 97.60%
Lowest accuracy 63.80% 77.10% 86.40% 86.20%
Average accuracy 79.16% 85.23% 90.97% 93.04%

In Figure 13 two confusion matrices for Littmann Before and After ET with two BOAS
classes are presented. Littmann Before ET has an accuracy of 84% for BOAS 0 and 1, and
84% for BOAS 2 and 3. The precision is 83.8%. Littmann After ET has an accuracy of
92% for BOAS 0 and 1, and 92% for BOAS 2 and 3. The precision is 92.4%.

Figure 13: Confusion matrices with two BOAS-classes. Littmann Before ET on the left
and Littmann After ET on the right.

Images created by Magnus Karlsteen.

In Figure 14 two matrices for Olympus Before and After ET with two BOAS classes are
presented. Olympus before ET has an accuracy of 93% for BOAS 0 and 1 and 93% for
BOAS 2 and 3. The precision is 93.1%. Olympus after ET has an accuracy of 96% for
BOAS 0 and 1 and 96% for BOAS 2 and 3. The precision is 96.2%.

15


Figure 14: Confusion matrices with two BOAS-classes. Olympus Before ET on the left
and Olympus After ET on the right.

Images created by Magnus Karlsteen.

4 Discussion
RNN-LSTM has a significantly improved classification accuracy compared to CNN. This
was discovered early on and the number of tests with CNN was reduced because of it, but
kept for comparison. Focus has been on RNN-LSTM.

The general trend is that Olympus has a better classification accuracy than Littmann, and
that fewer BOAS classes have better accuracy than many BOAS classes.

Overall, Littmann results are significantly lower than Olympus. A possible reason for this
is that the Littmann recordings have a lower sound volume. This was tried to circumvent
by amplifying the signal, but the sound quality became too poor with extensive noise and
could hence not be used. Generally, the overfitting is more prominent for Littmann than
Olympus, and higher for many BOAS classes than for few BOAS classes, which may be
another reason why Olympus and few BOAS classes perform better.

4.1 Potential sources of error
A limited number of recordings as well as fewer recordings in some BOAS classes than
others lead to limited training data, especially for BOAS 3. This is a cause for overfitting.
Actions such as different regularization methods and simplifying the model has been taken,

16


but with inadequate results. If more time had been available, further attempts to address
this issue would have been executed, mainly using data augmentation. A larger number of
recordings and a more even distribution between the BOAS grades would probably be very
beneficial to decrease overfitting, and hence increase the accuracy for every recording type.

The recordings are not 100% free from disturbances, but it may not be possible when
working with animals. Many of the dogs were panting, which may mask the respiratory
sounds of interest. In some cases the handler who ran with the dog during the exercise test
were panting during the after ET recording. At some point doors closed and people walked
or talked in adjacent rooms.

4.2 Future work
For future work, more recordings from a larger number of dogs would be useful. Gathering
more data would probably limit the overfitting, which would improve the classification
accuracy. The training data can be augmented with pitch shift, time stretch and background
noise if applied to the audio files. If applied to the MFCC images, rotating, shift and stretch
can be used. Augmentation was not performed in this project because of time limitations.
This method would increase the number of audio files since the augmented files would be
added to the existing dataset.

The network can be further developed to possibly achieve a better model for classifying
BOAS.

The method may be used for development of a mobile phone application for dog owners
to have an assessment of their dogs breathing. If classified as BOAS 2 or 3 it would be
recommended to see a veterinarian for further assessment and possible treatment.

It could also be a tool for smaller veterinary clinics to do a first assessment of a dog to see
if it needs a referral to a specialist on BOAS.

A tool based on this technique could also benefit brachycephalic breed buyers when visiting
a breeder with the intention of buying a dog. The tool could be used to obtain the BOAS
grade for the dog of interest, if the dog is at least one year old, or for the parents of the
puppy of interest. Hopefully, most people would refrain from buying a dog that is proven to
struggle with breathing or has parents who do. It may require expensive and risky surgery,
as well as have a compromised quality of life.

Best case scenario would be if a tool using this technique could improve the guidelines and
laws for the breeding industry. A tool could decide which brachycephalic dogs are suitable
and not for breeding purposes regarding their airways. Other factors such as overall health

17


and temper should of course also be taken into account. Making the airway problems easily
measurable would in this situation be a huge benefit compared to the current arbitrary
opinion of the dog owner, who may not be aware of existing breathing difficulties.

5 Conclusions
The general trends are that Olympus has a better accuracy than Littmann, fewer BOAS
classes have better accuracy than many BOAS classes and RNN-LSTM performs better
than CNN, although both methods manage to classify BOAS grades.

RNN-LSTM has proven to be an efficient method to classify BOAS grades from 0-3. The
average accuracy is 86% and the precision is 90% for the Olympus dictaphone Before ET,
and the accuracy is 87% and the precision is 91% for Olympus After ET. The method is also
efficient for Littmann stethoscope recordings but with an accuracy of 66% and a precision
of 73% for Littmann Before ET, and the accuracy is 62% and the precision is 70% for
Littmann After ET.

The accuracy can probably be further improved as a part of future work using more data
and data augmentation. Ideally the additional data would be evenly distributed between the
BOAS classes. More data, data augmentation and evenly distributed data would probably
address the overfitting of training data and hence improve the classification accuracy.

The results could be used as a tool such as a mobile phone application to easily measure
the BOAS grades of dogs using the mobile phones microphone. The dictaphone results are
more interesting than the stethoscope results if the next goal is to develop a mobile phone
application, since it is more similar to the phones microphone.

18


References
[1] P. Jouventin, Y. Christen, and F. S. Dobson, “Altruism in wolves explains the coevo-

lution of dogs and humans,” Ideas in Ecology and Evolution 2016, vol. 9, no. 9, pp.
4–11, may 2016.

[2] S. J. Ettinger, E. C. Feldman, and E. Côté, Textbook of veterinary internal medicine,
8th ed. USA: Elsevier, 2017.

[3] J. Riggs, N.-C. Liu, D. R. Sutton, D. Sargan, and J. F. Ladlow, “Validation of exercise
testing and laryngeal auscultation for grading brachycephalic obstructive airway syn-
drome in pugs, french bulldogs, and english bulldogs by using whole-body barometric
plethysmography,” Veterinary Surgery, vol. 48, no. 48, pp. 488–496, 2019.

[4] J. Ladlow, N.-C. Liu, L. Kalmar, and D. Sargan, “Brachycephalic obstructive airway
syndrome,” Veterinary record, vol. -, no. -, pp. 375–378, 2018.

[5] T. Virtanen, M. D. Plumbley, and D. Ellis, Computational analysis of sound scenes
and events, 1st ed. Gewerbestrasse 11, 6330 Cham, Switzerland: Springer Interna-
tional publishing, 2018.

[6] S. Jin, X. Wang, L. Du, and D. He, “Evaluation and modeling of automotive trans-
mission whine noise quality based on mfcc and cnn,” Applied Acoustics, vol. 172, no.
107562, 2021.

[7] I. D. Jokić, S. D. Jokić, V. D. Delić, and Z. H. Perić, “One solution of extension of
mel-frequency cepstral coefficients feature vector for automatic speaker recognition,”
Information Technology and Control, vol. 49, no. 1, pp. 224–236, 2020.

[8] B. K. Reddya and D. Delenb, “Predicting hospital readmission for lupus patients:
An rnn-lstm-based deep-learning methodology,” Computers in Biology and Medicine,
vol. 101, no. 101, pp. 199–209, 2018.

[9] J. Ramı́rez and M. Flores, “Machine learning for music genre: multifaceted review
and experimentation with audioset,” J Intell Inf Syst, vol. 55, pp. 469–499, 2020.

[10] H. H. Aghdam and E. J. Heravi, Guide to convolutional neural networks - A practical
application to traffic-sign detection and classification, 1st ed. Gewerbestrasse 11,
6330 Cham, Switzerland: Springer International publishing, 2017.

[11] U. Michelucci, Applied Deep Learning: A Case-Based Approach to Understanding
Deep Neural Networks, 1st ed. Switzerland: Apress, 2018.

19


[12] B. D. Bowes, J. M. Sadler, M. M. Morsy, M. Behl, and J. L. Goodall, “Forecasting
groundwater table in a flood prone coastal city with long short-term memory and
recurrent neural networks,” Water, vol. 11, no. 1098, 2019.

20


Appendix I

Code for MFCC extraction from audio recordings
import json
import os
import math
import librosa
import numpy

DATASET PATH = ”C:\Users\moama\Desktop\boaslibrary z\littmann after\”
JSON PATH = ”C:\Users\moama\Desktop\boaslibrary z\littmann after\data littmann after.json”
SAMPLE RATE = 22050
TRACK DURATION = 30 # measured in seconds
SAMPLES PER TRACK = SAMPLE RATE * TRACK DURATION

def save mfcc(dataset path, json path, num mfcc=39, n fft=2048, hop length=512, num segments=5):
”””Extracts MFCCs from music dataset and saves them into a json file along witgh genre
labels.
:param dataset path (str): Path to dataset
:param json path (str): Path to json file used to save MFCCs
:param num mfcc (int): Number of coefficients to extract
:param n fft (int): Interval we consider to apply FFT. Measured in # of samples
:param hop length (int): Sliding window for FFT. Measured in # of samples
:param: num segments (int): Number of segments we want to divide sample tracks into
:return:
”””

# dictionary to store mapping, labels, and MFCCs
data = {
”mapping”: [], #names of genres, ex classical, blues
”labels”: [], #output, ex 0=classical, 1=blues
”mfcc”: [] #input
}

samples per segment = int(SAMPLES PER TRACK / num segments)
num mfcc vectors per segment = math.ceil(samples per segment / hop length)

# loop through all genre sub-folder
for i, (dirpath, dirnames, filenames) in enumerate(os.walk(dataset path)):

21


# ensure we’re processing a genre sub-folder level
if dirpath is not dataset path:

# save genre label (i.e., sub-folder name) in the mapping
semantic label = dirpath.split(”/”)[-1]
data[”mapping”].append(semantic label)
print(”\nProcessing: { }”.format(semantic label))

# process all audio files in genre sub-dir
for f in filenames:

# load audio file
file path = os.path.join(dirpath, f)
signal, sample rate = librosa.load(file path, sr=SAMPLE RATE)

# process all segments of audio file
for d in range(num segments):

# calculate start and finish sample for current segment
start = samples per segment * d
finish = start + samples per segment #number of samples per segment

# extract mfcc
mfcc = librosa.feature.mfcc(signal[start:finish], sample rate, n mfcc=num mfcc, n fft=n fft,
hop length=hop length)
mfcc = mfcc.T

# store only mfcc feature with expected number of vectors
if len(mfcc) == num mfcc vectors per segment:
data[”mfcc”].append(mfcc.tolist())
data[”labels”].append(i-1)
print(”{}, segment:{}”.format(file path, d+1))

# save MFCCs to json file
with open(json path, ”w”) as fp:
json.dump(data, fp, indent=4)

22


if name == ” main ”:
save mfcc(DATASET PATH, JSON PATH, num segments=10)

23


Appendix II

Code and architecture for CNN
import json
import numpy as np
from sklearn.model selection import
train test split
import tensorflow.keras as keras
import matplotlib.pyplot as plt

DATA PATH = ”C:\Users\moama\Desktop\boaslibrary z\littmann after\data littmann after.json”

def load data(data path):
”””Loads training dataset from json file.
:param data path (str): Path to json file containing data
:return X (ndarray): Inputs
:return y (ndarray): Targets
”””

with open(data path, ”r”) as fp:
data = json.load(fp)
X = np.array(data[”mfcc”])
y = np.array(data[”labels”])
return X, y

def plot history(history):
”””Plots accuracy/loss for training/validation set as a function of the epochs
:param history: Training history of model
:return:
”””

fig, axs = plt.subplots(2)

# create accuracy sublpot
axs[0].plot(history.history[”accuracy”], label=”train accuracy”)
axs[0].plot(history.history[”val accuracy”], label=”test accuracy”)
axs[0].set ylabel(”Accuracy”)
axs[0].legend(loc=”lower right”)
axs[0].set title(”Accuracy eval”)

24


# create error subplot
axs[1].plot(history.history[”loss”], label=”train error”)
axs[1].plot(history.history[”val loss”], label=”test error”)
axs[1].set ylabel(”Error”)
axs[1].set xlabel(”Epoch”)
axs[1].legend(loc=”upper right”)
axs[1].set title(”Error eval”)
plt.show()

def prepare datasets(test size, validation size):
”””Loads data and splits it into train, validation and test sets.
:param test size (float): Value in [0, 1] indicating percentage of data set to allocate to test
split
:param validation size (float): Value in [0, 1] indicating percentage of train set to allocate
to validation split
:return X train (ndarray): Input training set
:return X validation (ndarray): Input validation set
:return X test (ndarray): Input test set
:return y train (ndarray): Target training set
:return y validation (ndarray): Target validation set
:return y test (ndarray): Target test set
”””

# load data
X, y = load data(DATA PATH)

# create train, validation and test split
X train, X test, y train, y test = train test split(X, y, test size=test size)
X train, X validation, y train, y validation = train test split(X train, y train,
test size=validation size)

# add an axis to input sets
X train = X train[..., np.newaxis]
X validation = X validation[..., np.newaxis]
X test = X test[..., np.newaxis]

return X train, X validation, X test, y train, y validation, y test

def build model(input shape):

25


”””Generates CNN model
:param input shape (tuple): Shape of input set
:return model: CNN model
”””

# build network topology
model = keras.Sequential()

# 1st conv layer
model.add(keras.layers.Conv2D(1024, (3, 3), activation=’relu’, input shape=input shape))
model.add(keras.layers.MaxPooling2D((3, 3), strides=(1, 1), padding=’same’))
model.add(keras.layers.BatchNormalization())

# 2nd conv layer
model.add(keras.layers.Conv2D(256, (3, 3), activation=’relu’))
model.add(keras.layers.MaxPooling2D((3, 3), strides=(1, 1), padding=’same’))
model.add(keras.layers.BatchNormalization())

# 3rd conv layer
model.add(keras.layers.Conv2D(16, (3, 3), activation=’relu’))
model.add(keras.layers.MaxPooling2D((3, 3), strides=(1, 1), padding=’same’))
model.add(keras.layers.BatchNormalization())

# flatten output and feed it into dense layer
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(512, activation=’relu’))
model.add(keras.layers.Dropout(0.3))

# output layer
model.add(keras.layers.Dense(4, activation=’softmax’))

return model

def predict(model, X, y):
”””Predict a single sample using the trained model
:param model: Trained classifier
:param X: Input data
:param y (int): Target
”””

26


# add a dimension to input data for sample - model.predict() expects a 4d array in this
case
X = X[np.newaxis, ...] # array shape (1, 130, 13, 1)

# perform prediction
prediction = model.predict(X)

# get index with max value
predicted index = np.argmax(prediction, axis=1)

print(”Target: { }, Predicted label: { }”.format(y, predicted index))

if name == ” main ”:

# get train, validation, test splits
X train, X validation, X test, y train, y validation, y test = prepare datasets(0.25, 0.2)

# create network
input shape = (X train.shape[1], X train.shape[2], 1)
model = build model(input shape)

# compile model
optimiser = keras.optimizers.Adam(learning rate=0.0001)
model.compile(optimizer=optimiser,
loss=’sparse categorical crossentropy’,
metrics=[’accuracy’])

model.summary()

# train model
history = model.fit(X train, y train, validation data=(X validation, y validation), batch size=32,
epochs=1000)

# plot accuracy/error for training and validation plot history(history)

# evaluate model on test set
test loss, test acc = model.evaluate(X test, y test, verbose=2)
print(’\nTest accuracy:’, test acc)

# pick a sample to predict from the test set

27


X to predict = X test[40]
y to predict = y test[40]

# predict sample
predict(model, X to predict, y to predict)

28


Appendix III

Code and architecture for RNN-LSTM
import json
import tensorflow as tf
import numpy as np
import seaborn as sns
from sklearn.model selection import train test split
import tensorflow.keras as keras
import matplotlib.pyplot as plt

# import tensorflow as tf
physical devices = tf.config.list physical devices(’GPU’)
tf.config.experimental.set memory growth(physical devices[0], enable=True)
# gpus = tf.config.experimental.list physical devices(device type=’GPU’)
# tf.config.experimental.set memory growth(device=gpus[0], enable=True)

DATA PATH = ”data olympus before.json”

def load data(data path):
”””Loads training dataset from json file.
:param data path (str): Path to json file containing data
:return X (ndarray): Inputs
:return y (ndarray): Targets
”””

with open(data path, ”r”) as fp:
data = json.load(fp)

X = np.array(data[”mfcc”])
y = np.array(data[”labels”])
return X, y

def plot history(history,way):
”””Plots accuracy/loss for training/validation set as a function of the epochs
:param history: Training history of model
:return:
”””

29


fig, axs = plt.subplots(2)

# create accuracy sublpot
axs[0].plot(history.history[”accuracy”], label=”train accuracy”)
axs[0].plot(history.history[”val accuracy”], label=”test accuracy”)
axs[0].set ylabel(”Accuracy”)
axs[0].legend(loc=”lower right”)
axs[0].set title(”Accuracy eval”)

# create error sublpot
axs[1].plot(history.history[”loss”], label=”train error”)
axs[1].plot(history.history[”val loss”], label=”test error”)
axs[1].set ylabel(”Error”)
axs[1].set xlabel(”Epoch”)
axs[1].legend(loc=”upper right”)
axs[1].set title(”Error eval”)
plt.savefig(’images/error accuracy’+way)
#plt.show()

def prepare datasets(test size, validation size):
”””Loads data and splits it into train, validation and test sets.
:param test size (float): Value in [0, 1] indicating percentage of data set to allocate to test
split
:param validation size (float): Value in [0, 1] indicating percentage of train set to allocate
to validation split
:return X train (ndarray): Input training set
:return X validation (ndarray): Input validation set
:return X test (ndarray): Input test set
:return y train (ndarray): Target training set
:return y validation (ndarray): Target validation set
:return y test (ndarray): Target test set
”””

# load data
X, y = load data(DATA PATH)

# create train, validation and test split
X train, X test, y train, y test = train test split(X, y, test size=test size)
X train, X validation, y train, y validation = train test split(X train, y train, test size=validation size)

30


return X train, X validation, X test, y train, y validation, y test

def build model(batch input shape):
”””Generates RNN-LSTM model
:param batch input shape (tuple): Shape of input set
:return model: RNN-LSTM model
”””

aktiv=’hard sigmoid’
aktiv1=’softmax’
N1=1024
N2=256
N3=32

# build network topology
model = keras.Sequential()

# 2 LSTM layers
model.add(keras.layers.LSTM(N1, input shape=batch input shape, return sequences=True,
kernel regularizer=keras.regularizers.l2(0.001)))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.LSTM(N2, kernel regularizer=keras.regularizers.l2(0.001)))
model.add(keras.layers.Dropout(0.3))

# dense layer
model.add(keras.layers.Dense(N3, activation=aktiv, kernel regularizer=keras.regularizers.l2(0.001)))
model.add(keras.layers.Dropout(0.3))

# output layer
model.add(keras.layers.Dense(4, activation=aktiv1))

return model

def draw(way):
# get train, validation, test splits
X train, X validation, X test, y train, y validation, y test = prepare datasets(101/470, 0.2)

# create network
batch input shape = (X train.shape[1], X train.shape[2]) # 130, 13
model = build model(batch input shape)

31


# compile model
optimiser = keras.optimizers.Adam(learning rate=1E-5)
model.compile(optimizer=optimiser,
loss=’sparse categorical crossentropy’,
metrics=[’accuracy’])

model.summary()

# train model
history = model.fit(X train, y train,
validation data=(X validation, y validation),
batch size=32, epochs=1000)

# plot accuracy/error for training and validation
plot history(history,way)

# evaluate model on test set
test loss, test acc = model.evaluate(X test, y test, verbose=2)
print(’\nTest accuracy:’, test acc)

metrics = history.history
plt.plot(history.epoch, metrics[’loss’], metrics[’val loss’])
plt.legend([’loss’, ’val loss’])
plt.savefig(’images/loss valloss’+way)

#plt.show()
print(y test)

#predictions = model.predict(X test[:7])
predictions = model.predict(X test)
print(”predictions shape:”, predictions.shape)
#np.set printoptions(precision=3)
#[print(*line) for line in predictions]
#print(predictions)

y pred=np.argmax(predictions, axis=1)
print(”y pred shape:”, y pred.shape)
print(y pred)

32


confusion mtx = tf.math.confusion matrix(y test, y pred)
plt.figure(figsize=(4, 4))
commands=(”Boas0”,”Boas1”,”Boas2”,”Boas3”)
sns.heatmap(confusion mtx, xticklabels=commands, yticklabels=commands, annot=True,
fmt=’g’)
plt.xlabel(’Prediction’)
plt.ylabel(’Label’)
plt.savefig(’images/matrix’+way)
#plt.show()
return

if name == ” main ”:
way=(’ OB4 grafer.jpg’)
draw(way)

33