Data Augmentation for Audio Based
Machine Learning
Classifying Brachycephalic Obstructive Airway Syndrome (BOAS)
in Dogs

Master’s thesis in Electrical Engineering

HENRIK PETTERSSON
OLIVIA STENSÖTA

DEPARTMENT OF PHYSICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021
www.chalmers.se

www.chalmers.se


Master’s thesis 2021

Data Augmentation for Audio Based Machine
Learning

Classifying Brachycephalic Obstructive Airway Syndrome (BOAS) in
Dogs

HENRIK PETTERSSON
OLIVIA STENSÖTA

Department of Physics
Division of Material Physics

Chalmers University of Technology
Gothenburg, Sweden 2021


Data Augmentation for Audio Based Machine Learning
Classifying Brachycephalic Obstructive Airway Syndrome (BOAS) in Dogs
HENRIK PETTERSSON
OLIVIA STENSÖTA

© HENRIK PETTERSSON, OLIVIA STENSÖTA, 2021.

Supervisors: Magnus Karlsteen, Department of Physics,
Chalmers University of Technology

Maria Dimopoulou, Swedish University of Agricultural Sciences
Ingrid Ljungvall, Swedish University of Agricultural Sciences
Eva Skiöldebrand, Swedish University of Agricultural Sciences

Examiner: Magnus Karlsteen, Department of Physics, Chalmers University of Tech-
nology

Master’s Thesis 2021
Department of Physics
Division of Material Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: An example of an MFCC image of a audio recording.

Typeset in LATEX, template by Magnus Gustaver
Printed by Chalmers Reproservice
Gothenburg, Sweden 2021

iv


Data Augmentation for Audio Based Machine Learning
Classifying Brachycephalic Obstructive Airway Syndrome (BOAS) in Dogs
HENRIK PETTERSSON
OLIVIA STENSÖTA
Department of Physics
Chalmers University of Technology

Abstract
Breathing problems of varying degree are common amongst dog breeds with shorter
snouts also called brachycephalic dogs. The process of classifying each case consists
of a veterinarian visit where tests are preformed to assess the severity on a scale from
zero to three. In this master thesis, we aim to simplify this procedure by machine
learning and will be working with two hypothesis. Hypothesis I is a continuation
of the master thesis Brachycephalic Obstruction Airway Syndrome (BOAS) classi-
fication in dogs based on respiratory noise analysis using machine learning by Moa
Mårtensson. Here we augmented the audio files to generate a larger data set and
extracted multiple features. The features include MFCC, ZCR and RMS that are
fed to a LSTM network. The second hypothesis aims to classify BOAS(-) and (+),
this hypothesis uses frequency data enhanced with SMOTE and a CNN. We show
that it is possible to classify BOAS using machine learning, but that more data is
required in order to confidently diagnose BOAS. We can conclude that hypothesis II
using data collected from the Littmann device shows the best result on unseen audio
files. There is a possibility to further develop this into a tool for both veterinarians
and dog owners.

This thesis is a collaboration between Chalmers University of Technology and the
Swedish University of Agricultural Sciences in Uppsala.

Keywords: machine learning, augmenting, MFCC, RMS, ZCR, SMOTE and BOAS.

v


Acknowledgements
We would first like to thank our incredible supervisor and examiner Magnus Karl-
steen. Without your enthusiasm and help, these months would not have been as
joyful as they have nor would our thesis be as exceptional as it is. Thank you!

We would also like to give a special thank you to Moa Mårtensson. Thank you for
allowing us to continue your work and for the amazing support you have provided
for our entire thesis.

Maria Dimopoulou, Eva Skiöldebrand and Ingrid Ljungvall. Thank you for all your
work gathering more data and your quest to vanquishing BOAS. You have also been
a great help in our understanding of the breathing problems in dogs.

The computations were enabled by resources provided by the Swedish National
Infrastructure for Computing (SNIC) at [SNIC CENTRE] partially funded by the
Swedish Research Council through grant agreement no. 2018-05973.

And lastly, but certainty not least, we would like to thank our opponents, Oskar
Andersson, Julia Nystrand and Arnita Spule. Thank you for a rewarding discussion
as well as valuable feedback on our thesis.

Henrik Pettersson & Olivia Stensöta
Göteborg, June 2021

vii


Contents

List of Figures xi

List of Tables xiii

List of Abbreviations xvii

1 Introduction 1
1.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Two Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Audio Recordings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2.1 Splitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Time Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 Pitch Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.4 Speed Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Noise Introduction . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.6 Synthetic Minority Oversampling Technique . . . . . . . . . . 6

2.3 Training Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Mel-Frequency Cepstrum . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Root Mean Square . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.3 Zero Cross Rate . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.4 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Presenting the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Methods - Hypothesis I 13
3.1 Training Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.3 Optimal Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Littmann Settings . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3.2 Olympus Settings . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 Settings for Splitting . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 Overfitting Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

ix


Contents

3.4.1 BOAS3 Size Limit . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4.2 Class-weights . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Methods - Hypothesis II 23
4.1 Training Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Networks and Data Augmentation . . . . . . . . . . . . . . . . . . . . 23
4.3 Optimal Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Results - Hypothesis I 27
5.1 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5.1.1 Littmann Data Set . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.2 Olympus Data Set . . . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Littmann Data Set . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Olympus Data Set . . . . . . . . . . . . . . . . . . . . . . . . 34

5.3 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Results - Hypothesis II 37

7 Discussion 39
7.1 Hypothesis II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 Hypothesis I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

7.2.1 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.2.3 Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

7.3 Hypothesis I vs II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.4 Littmann vs. Olympus . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.5 Recordings Performed Before vs. After Exercise . . . . . . . . . . . . 42
7.6 Additional Possible Errors . . . . . . . . . . . . . . . . . . . . . . . . 43
7.7 Ethical Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

8 Conclusion 45
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

Bibliography 47

A BOAS Classification Protocol I

B Networks V
B.1 Network 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
B.2 Network 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
B.3 Network 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
B.4 Network 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
B.5 Network 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI
B.6 Network 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
B.7 freq2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII

x


List of Figures

2.1 Illustration of MFCC, the x-axis is the index of the frames. . . . . . . 7
2.2 FFT plots for Olympus after, one can notice a few differences between

the different classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Example accuracy and loss chart (a) and confusion matrix (b). . . . . 11

3.1 In the accuracy we can see that both the training and the validation
is still increasing after the 2500 epochs. . . . . . . . . . . . . . . . . . 21

4.1 The accuracy approaches 1 faster for the freq2 network than for the
freq network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.1 The distribution of the different BOAS classes in the recorded data. . 28
5.2 The distribution of the different dog breeds in the recorded data.

breeds with two or fewer dogs are placed in the other section. . . . . 28
5.3 Training and validation (a) as well as confusion matrix (b) for the

Littmann before exercise data set using the optimal settings. . . . . . 30
5.4 Training and validation (a) as well as confusion matrix (b) for the

Littmann after exercise data set using the optimal settings. . . . . . . 30
5.5 Training and validation (a) as well as confusion matrix (b) for the

Olympus before exercise data set using the optimal settings. . . . . . 32
5.6 Training and validation (a) as well as confusion matrix (b) for the

Olympus after exercise data set using the optimal settings. . . . . . . 32

xi


List of Figures

xii


List of Tables

2.1 Description of the naming convention for the recording files. . . . . . 3
2.2 The class spread of the original 41 dogs. . . . . . . . . . . . . . . . . 4
2.3 Very simple table explaining the splitting augmentation. In the origi-

nal audio file we have one segment containing 9 seconds (123456789).
When we split this into 3-second-long segments we get segment 1, 2
and 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.4 A simplified image of how the time shift and split augmentation would
work. The original signal would be shifted until segment 2 is the same
as segment 1 was originally. . . . . . . . . . . . . . . . . . . . . . . . 5

2.5 A simplified table of how time stretching works. In reality the stretch-
ing would be much smaller in relation to the segment than in this
simplified table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.6 The base-(LSTM)-network. . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 The freq network. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 The default values when we train the networks. . . . . . . . . . . . . 10
2.9 Example accuracy table. . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.10 Example classification table. . . . . . . . . . . . . . . . . . . . . . . . 11

3.1 Naming convention in tables; Even indicates that the data set has
been time shifted to have the same number of audio files in each class.
In Pitch, the data set has been pitch shifted and in Speed, the data
set has been speed shifted. In Noise, noise has been introduced to the
audio file and during Combo noise has been introduced as well as pitch
and speed shift. And finally, Combo × 2 means that the combination
of all the augmentations has been used with two parameters. . . . . . 14

3.2 The table shows the minimum, maximum and average accuracy of
the network when introducing noise, speed and pitch shifting. The
settings described in Section 3.3.1 are used, except for the previous
results [9] which used a hop length of 512. . . . . . . . . . . . . . . . 14

3.3 The minimum, maximum and average accuracy of the network when
introducing noise, speed and pitch shifting. The settings described in
Section 3.3.2 are used. . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4 The table shows the minimum, maximum and average accuracy of the
network with different hop lengths; default = 512, 256, 128 and 64
for the Littmann data set, recorded after exercise, with even number
of segments in each BOAS class. . . . . . . . . . . . . . . . . . . . . . 16

xiii


List of Tables

3.5 The table shows the minimum, maximum and average accuracy of
the network with different number of MFCC; 39, 26 and 13 for the
Littmann data set, recorded after exercise. . . . . . . . . . . . . . . . 16

3.6 The table shows the minimum, maximum and average accuracy of
the network with different width of FFT window length; 1024, 2048
and 4096 for the Littmann data set, recorded after exercise. . . . . . 16

3.7 The table shows the minimum, maximum and average accuracy of
the network with additional training features for the Littmann data
set, recorded after exercise. . . . . . . . . . . . . . . . . . . . . . . . . 16

3.8 The table shows the minimum, maximum and average accuracy of the
network with different hop lengths; 512, 256 and 128 for the Olympus
data set, recorded after exercise which has been augmented to have
the same number of segments in every BOAS class. . . . . . . . . . . 17

3.9 The minimum, maximum and average accuracy of the network with
different number of MFCC; 39, 26 and 13 for the Olympus after ex-
ercise data set which has been augmented to have the same number
of segments in every BOAS class. . . . . . . . . . . . . . . . . . . . . 17

3.10 The minimum, maximum and average accuracy of the network with
different width of FFT window length; 1024, 2048 and 4096 for the
Olympus after exercise data set which has been augmented to have
the same number of segments in every BOAS class. . . . . . . . . . . 18

3.11 The minimum, maximum and average accuracy of the network with
additional training features for the Olympus data set recorded after
exercise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.12 The minimum, maximum and average accuracy of the network with
additional training features for the Littmann data set recorded after
exercise using optimal settings. . . . . . . . . . . . . . . . . . . . . . 19

3.13 The minimum, maximum and average accuracy of the network. The
first network uses the Olympus data set recorded before exercise
which has been evened with optimal settings. The network ran 3
cycles. The networks uses a smaller version of the Olympus before
exercise data set but with only 13 dogs. Optimal settings were used
and the network ran for two cycles. . . . . . . . . . . . . . . . . . . . 20

3.14 Classification on omitted files. The network uses a smaller version of
the Olympus before exercise data set but with only 13 dogs. Optimal
settings were used and the network ran for two cycles. . . . . . . . . . 20

3.15 The minimum, maximum and average accuracy of the network. The
networks uses the Olympus data set, before exercise. Optimal settings
were used, except for the evened time shift. OBS, not all runs are
added here, only on run was finished for the optimal. . . . . . . . . . 20

3.16 Classification on omitted files. The network used is the best perform-
ing in Table 3.15. Optimal settings were used, except for the evened
time shift. The network ran for one cycle. . . . . . . . . . . . . . . . 21

4.1 The minimum, maximum and average accuracy as well as the variance
for two different Convolutional Neural Networks (CNN). . . . . . . . 24

xiv


List of Tables

4.2 The minimum, maximum and average accuracy as well as the variance
for the freq2 network. We vary the neighbors while using Synthetic
Minority Oversampling Technique (SMOTE) with 2, 5 and 9 neighbors. 24

4.3 The minimum, maximum and average accuracy of the network, with
different settings for the Littmann data set, recorded before exercise. 25

4.4 The minimum, maximum and average accuracy of the network, with
different settings for the Olympus data set, recorded before exercise.
The 22050 run is only ran once because of its size. . . . . . . . . . . . 25

5.1 The optimal values for each variable for the Littmann data set before
and after exercise based on Section 3.3.1 . . . . . . . . . . . . . . . . 29

5.2 The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Littmann data set, recorded
before exercise. Previous results are the results from the previous the-
sis [9]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3 The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Littmann data set, recorded
after exercise. This training uses all data acquired before 2021-05-01.
Previous results are the results from the previous thesis [9]. . . . . . . 30

5.4 The optimal values for each variable for the Olympus data set based
on Section 3.3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.5 The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Olympus data set, recorded
before exercise. This training uses all data acquired before 2021-05-
01. Previous results are the results from the previous thesis [9]. . . . . 31

5.6 The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Olympus data set, recorded
after exercise. This training uses all data acquired before 2021-05-01.
Previous results are the results from the previous thesis [9]. . . . . . . 32

5.7 Classification on omitted files for the Littmann data set, recorded
before exercise, using optimal settings. . . . . . . . . . . . . . . . . . 33

5.8 Classification on omitted files for the Littmann data set, recorded
after exercise, using optimal settings. . . . . . . . . . . . . . . . . . . 33

5.9 Classification on omitted files for the Olympus data set, recorded
before exercise, using optimal settings. . . . . . . . . . . . . . . . . . 34

5.10 Classification on omitted files for the Olympus data set, recorded after
exercise, using optimal settings. . . . . . . . . . . . . . . . . . . . . . 34

5.11 The minimum, maximum and average accuracy of the network with
different networks with the Littmann data set recorded after exercise. 35

5.12 The minimum, maximum and average accuracy of the network with
different networks with the Olympus data set recorded after exercise. 35

6.1 Classification on omitted files for the Littmann data set, recorded
before exercise, using optimal settings. . . . . . . . . . . . . . . . . . 37

6.2 Classification on omitted files for the Littmann data set, recorded
after exercise, using optimal settings. . . . . . . . . . . . . . . . . . . 37

xv


List of Tables

6.3 Classification on omitted files for the Olympus data set, recorded
before exercise, 22050 frequency points. Very bad result . . . . . . . . 38

6.4 Classification on omitted files for the Olympus data set, recorded after
exercise, 22050 frequency points. . . . . . . . . . . . . . . . . . . . . . 38

B.1 A network based on the base-(LSTM)-network but with fewer neurons
in the LSTM layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

B.2 A network based on the base-(LSTM)-network but with an extra
LSTM network and fewer neurons. . . . . . . . . . . . . . . . . . . . V

B.3 A network based on the base-(LSTM)-network but with higher dropout
rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

B.4 A network based on the base-(LSTM)-network but with an extra
LSTM layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

B.5 A smaller network; one LTSM layer which is half as big. . . . . . . . VI
B.6 Much fewer nodes per level . . . . . . . . . . . . . . . . . . . . . . . . VII
B.7 A bigger CNN to work with basic frequency data. . . . . . . . . . . . VII

xvi


List of Abbreviations

BOAS Brachycephalic Obstruction Airway Syndrome.

CNN Convolutional Neural Networks.

FFT Fast Fourier Transform.

JSON JavaScript Object Notation.

LR Learning Rate.
LSTM Long Short-Term Memory.

MFCC Mel-Frequency Cepstral Coefficients.

RMS Root Mean Square.

SMOTE Synthetic Minority Oversampling Technique.

ZCR Zero Crossing Rate.

xvii


List of Abbreviations

xviii


1
Introduction

The dog was the first animal that humans domesticated. This was because dogs
helped warn us against threatening animals, helped us with chores (herding sheep,
etc.) but also as company. Today, dogs are used as a pet for the vast majority
of people. This has led to a shift in the breeding of the dog, from practical to
more aesthetic. In the desire to achieve aesthetic perfection, there have been some
complications. Two examples of this are that some dogs have difficulty walking
[1] and difficulty breathing [2]. This thesis will focus on detecting dogs who have
trouble breathing.

Pugs and bulldogs are dog breeds with a flat face and short nose, also called brachy-
cephalic. Brachycephalic breeds often suffer from Brachycephalic Obstruction Air-
way Syndrome (BOAS) because of the way their skull is shaped. BOAS can be
classified in two ways, with a number between zero and three, where BOAS 0 is a
minor inconvenience and BOAS 3 require surgery [3]. The other way is as BOAS(-)
and BOAS(+). In layman terms, BOAS means that the dog cannot breathe properly
and can, in severe cases, die. By giving potential dog owners the ability to assess
if a dog or its offspring may suffer from BOAS before purchase, we can in the long
run reduce the number of dogs with BOAS and potentially save future generations
from surgery.

According to data from the Swedish Board of Agriculture (Jordbruksverket) [4] there
was 12607 French bulldogs in Sweden 2020 compared to 11239 in 2019, an increase
of 1368 dogs or 12 %. According to Sveriges television [5], French bulldog is the
second most common breed of dog that swedes adopted in 2020. According to a
study from 2019 [6], 64% of French bulldogs suffers from at least one of the typical
BOAS ailments. This makes it apparent that there is a need to inform the public
about the negative aspects of BOAS and eventually produce an easy and accessible
way of diagnosing dogs.

This thesis will be a collaboration between Chalmers University of Technology and
the Swedish University of Agricultural Sciences in Uppsala. Our thesis is coded in
the language Python in Visual Studio Code, central librariews are Librosa[7] and
Keras[8]. GitHub was used as edition control as well as a way to collaborate on
multiple devices.

1


1. Introduction

1.1 Previous Work
This thesis is the continuation of Brachycephalic Obstruction Airway Syndrome
(BOAS) classification in dogs based on respiratory noise analysis using machine
learning by Moa Mårtensson [9]. In her thesis, two different devices were used to
capture recordings. The devices were a Littmann electronic stethoscope and an
Olympus linear PCM recorder. Veterinarians captured audio sequences on both
devices of dogs breathing, before and after physical exercise.

Machine learning was then used to train a network with the recordings to be able to
classify which BOAS class a dog suffered from. The results from the previous thesis
are very good with an average classification as high as 88.48 %. The results use
the Olympus after physical exercise, data set but has a large variance, the highest
accuracy being 92.7 % and the lowest 86.6 %.

1.2 Aim
The aim for this thesis is to make the groundwork for a future application that can
classify whether a dog suffers from BOAS by recording its breathing, instead of a
thorough examination by a professional. It is also of interest to have an application
that veterinarians can use to get an initial indication of the BOAS class. We do this
by comparing two different methods to increase the performance of the previous
thesis.

1.3 Two Hypotheses
The first hypothesis is to continue Mårtenssons work [9] using the methods she
derived as well as introduce new features, use variations of the network and further
investigate MFCC settings. We also continue to use four classes during training
(BOAS0 - 3). Our second hypothesis explores another approach using a frequency
feature and new type of network. For this hypothesis we use two classes (BOAS-
and BOAS+). In both hypotheses we will augment our data to generate a larger
data set but with two different approaches as will be described more in Chapter 2.

1.4 Limitations
Time will be the main limiting factor both for coding and writing the report as
well as the computational time for training networks. We limit ourselves to only
implement and test MFCC, RMS, ZCR and frequency as training features. In this
thesis we will only tests variants of a LSTM network as well as a two CNN. This
thesis will also only investigate two settings when it comes to the pitch, speed and
noise augmentations.

2


2
Theory

In this chapter we discuss the theory behind our thesis. We first present how the data
sets have been recorded. We continue with the different methods of augmentation.
The next section evolves around the different training features. We conclude with a
description of the different networks that are examined in the report as well as an
explanation of how we present our results in Chapter 5.

2.1 Audio Recordings

Veterinarians at Uppsala University take in total four different kinds recordings of
each dog; two with a Littmann stethoscope [10] and two with a Olympus Dictaphone
[11]. One recording on each device is performed on the dog while it is calm and
resting while the other is performed after a few minutes of exercise. New recordings
are performed continuously throughout the thesis and will hopefully continue after.
The files are in .wav format and the naming convention is described in Table 2.1.

The Littmann audio files are 30 seconds long while the Olympus files have been
edited down to 30-60 seconds. The Olympus Dictaphone has a large scope and
captures more than just the patient breathing. We have therefore manually removed
disturbances such as talking and doors slamming. The Littmann stethoscope has
a more narrow spectrum as it does not pick up surrounding sounds to the same
extent as Olympus does. These files, have therefore not been manually edited for
disturbances.

Table 2.1: Description of the naming convention for the recording files.

XXX_ABY.WAV
XXX recordig ID
AB Type of recording shorthand

LB Littmann Before
LA Littmann After
OB Olympus Before
OA Olympus After

Y BOAS grade 0-3

3


2. Theory

The Littmann and the Olympus files are not identical; the Littmann device has
a sampling rate of 4000 samples/s while the Olympus device operate at a more
ordinary sample rate of 44100 samples/s.

The dogs are also examined by the veterinarian that classifies a BOAS rating from 0
(best) to 3 (worst). The classification process involves listening to breathing sounds.

2.2 Data Augmentation

The data inherited from the previous thesis consisted of 41 different dogs. The
spread of BOAS classes are not uniform as can be seen in Table 2.2.

Table 2.2: The class spread of the original 41 dogs.

BOAS 0 19
BOAS 1 11
BOAS 2 7
BOAS 3 4

Since we have limited data available, data augmentation could be a useful tool to
synthesis new training data. There are multiple ways of doing this. We could for
instance, either augment the audio files or the images that the network eventually
trains on. In hypothesis I, we focus on augmenting the audio files by splitting, time,
speed or pitch shifting, and introducing noise. In hypothesis II we augment the
frequency data fed to the network using SMOTE.

2.2.1 Splitting
Every audio file can be split into X number of shorter audio segments, which will
increase the data files by a factor of X, see Section 2.2.1. This has already been
implemented in an earlier stage of the project where each file was split into multiple
3-second-long segments. Splitting will be used on each audio file, after their primary
augmentation.

Table 2.3: Very simple table explaining the splitting augmentation. In the original
audio file we have one segment containing 9 seconds (123456789). When we split
this into 3-second-long segments we get segment 1, 2 and 3.

Original Segment
123456789

Split Segment 1 Segment 2 Segment 3
123 456 789

4


2. Theory

2.2.2 Time Shift
During time shift, the signal is shifted small steps to the right. When the signal is
then split into segments using the splitting augmentation, the first segment will be
discarded as it will be padded with zeros and using it would corrupt the data set.
Table 2.4 shows a simplified version of this augmentation method. In reality the
step size will not be 1 but rather a predetermined number of samples or seconds.
The time shift must be shorter than the length of segments after splitting. The total
number of segments can be calculated as

total segments = segments + (segments− 1) · segment length
time shift . (2.1)

The variables are thus the number of segments, the segment length and the time
the signal will be shifted, the time shift must be larger than zero and smaller than
the segment length. It is easy to pick a very small time shift coefficient and gain
a lot of extra segments, these new segments will be almost identical to the other
augmented segments. Identical training data is not very valuable to train on, but a
large time shift coefficient results in few new segments. A balance must be struck
between the number of total segments and the difference between them.

Table 2.4: A simplified image of how the time shift and split augmentation would
work. The original signal would be shifted until segment 2 is the same as segment
1 was originally.

segment 1 segment 2 segment 3
123 456 789
-12 345 678
–1 234 567
— 123 456

To perform the time shift augmentation, the roll function from numpy [12] is utilised
and the start is set to 0, see Code 2.1.

1 import numpy as np
2 def timeShift(data, sampling_rate, shift):
3 augmented_data = np.roll(data, shift)
4 # Set to silence for heading
5 augmented_data[:shift] = 0
6 return augmented_data

Code 2.1: Code for augmentation, timeshift

2.2.3 Pitch Shift
In this process the pitch of the sound is changed while not altering the speed of the
file. To augment the pitch we use the librosa pitch shift effects package [13].

5


2. Theory

2.2.4 Speed Shift

During this procedure the signal is stretched or compressed slightly, however the
segment length will be the same when Mel-Frequency Cepstral Coefficients (MFCC)s
are extracted. Stretching our data would give more segments, but the individual
segments may not contain all the relevant data needed to train the network. On the
other hand, compressing our data might result in the individual segments containing
more data than necessary, see Table 2.5. The speed shift uses the librosa time
stretching package [14].

Table 2.5: A simplified table of how time stretching works. In reality the stretching
would be much smaller in relation to the segment than in this simplified table.

s1 s2 s3 s4 s5 s6 s7 s8 s9
compressed 123 456 789
original 112 233 445 566 778 899
stretched 111 222 333 444 555 666 777 888 999

2.2.5 Noise Introduction

To introduce noise would be detrimental if the target network were human, but since
the target is a neural network, it can be beneficial. The idea is that the noise will
slightly alter the signal, giving more data to train on, but still be similar enough
for the original signal to be the main characteristic. We use white noise with zero
mean and variance equal to one Code 2.2.

1 import numpy as np
2 def noiseAddition(data, noise_factor):
3 noise = np.random.randn(len(data))
4 augmented_data = data + noise_factor * noise
5 # Cast back to same data type
6 augmented_data = augmented_data.astype(type(data[0]))
7 return augmented_data

Code 2.2: Code for noise augmentation.

2.2.6 Synthetic Minority Oversampling Technique

As the frequency data is not time dependent like the MFCCs are, we can use a
method called SMOTE to up-sample the smaller classes [15]. SMOTE works by
finding the neighbor to a data point and then generates extra points in-between
them. The augmented result is that all classes have the same size.

6


2. Theory

2.3 Training Features
In machine learning, when training a network, there are different features the net-
work can be trained on. Hypothesis I uses MFCC, as this has already been imple-
mented in a previous stage of the project, as well as Root Mean Square (RMS), Zero
Crossing Rate (ZCR). We use Librosa [7] which extracts the feature points using the
data points inside an n data points wide window. The window is then moved one
hop length of m data points. This gives overlapping windows that are eventually
stacked into the image. For our second hypothesis we use a frequency feature.

2.3.1 Mel-Frequency Cepstrum
To calculate the MFCC we use the Fourier transform on the audio file. From this
we get a Fourier spectrum [16]. On this spectrum we use a logarithmic scale to
visualize the magnitude which we then perform a cosine transform on. The resulting
spectrum is not in the time or frequency domain since it is a spectrum performed
on a frequency spectrum. Because of this, the creators [17] called it the quefrency
domain and decided to call this specific spectrum a cepstrum. To extract MFCC we
will use the command librosa.feature.mfcc() [18].

A segment can look like Figure 2.1b. It consists of frames that are fed into the
network. A single frame can look like Figure 2.1a.

(a) One single frame, the frames are
strung together to form the segment.

(b) One 3 secound segement.

Figure 2.1: Illustration of MFCC, the x-axis is the index of the frames.

When extracting the MFCC there are 3 mayor parameters; number of MFCC, the
Fast Fourier Transform (FFT) window size and the hop length. These can be used
to control the amount of data that is used. A higher number of MFCC gives more
data and longer run time whereas a smaller hop length gives more data and longer
run time. We investigate the optimal settings in Section 3.3.

2.3.2 Root Mean Square
The data points in the window are used to calculate the root mean square value

7


2. Theory

RMS =

√√√√ 1
n

i=n∑
i=0

xi. (2.2)

The librosa command for this is librosa.feature.rms() [19]. This method has
been proven to be effective in classifying audio files [20].

2.3.3 Zero Cross Rate

ZCR is a type of feature that can be extracted from audio files. It is represented by
a number between 0 and 1, where 0 means that the signal has the same sign and 1
means that the sign changes for every data point. To extract this feature the Librosa
command librosa.feature.zero_crossing_rate() [21] is used. ZCR has been
effectively used to classify audio files [20].

2.3.4 Frequency

When observing the four classes in the frequency spectrum, one can notice a dif-
ference between the classes, see Figure 2.2. In attempt to utilize this, we created a
method to extract the frequency feature. Unlike the other features, the frequency
feature does not depend on time. We can therefore not train a Long Short-Term
Memory (LSTM) network. Because of this, we also created a CNN, more about the
network in Section 2.4.

8


2. Theory

Figure 2.2: FFT plots for Olympus after, one can notice a few differences between
the different classes.

2.4 Networks

There are many variables and structures that can be combined to generate a network
to the degree that it is not feasible for us to explore them all. For the first hypothesis,
we use the network that Mårtensson concluded with, a LSTM network [9] and
try different alterations. We will call this network the base-(LSTM)-network, see
Table 2.6. The variations of this network that are used can be found in Appendix B.

Table 2.6: The base-(LSTM)-network.

Layer (type) Output Shape Param #
lstm (LSTM) (None, 47, 1024) 4358144
dropout (Dropout=0.3) (None, 47, 1024) 0
lstm1 (LSTM) (None, 256) 1311744
dropout1 (Dropout) (None, 256) 0
dense (Dense) (None, 32) 8224
dropout2 (Dropout=0.3) (None, 32) 0
dense1 (Dense) (None, 4) 132

9


2. Theory

Because our second hypothesis uses a frequency feature that is not dependent on
time, we cannot use a LSTM network. Hence, we construct a CNN that we call the
freq network, see Table 2.7. We also try a variation of the network which can be
seen in Appendix B.7.

Table 2.7: The freq network.

Layer (type) Output Shape Param #
conv1d (Conv1D) (None, 1, 1024) input dependent
conv1d (Conv1D) (None, 1, 512) 2621952
dropout (Dropout=0.3) (None, 1, 512) 0
conv1d (Conv1D) (None, 1, 128) 327808
dense1 (Dense) (None, 4) 516

The models are trained with some additional parameters, such as Learning Rate
(LR), batch size and optimizer. These parameters are the same for all networks
unless stated otherwise and can be found in Table 2.8.

Optimzer Adam
Learning rate 10E-5
Batch size 128
Epochs 2500

Table 2.8: The default values when we train the networks.

2.5 Presenting the Results
The bulk of the results will be presented in the form of tables that show the accuracy
in percentages for the given altered variable, see Table 2.9. We will also present
accuracy and loss charts that show how the accuracy and loss changes for each
epoch the network is trained. The accuracy is represented by a number between 0
and 1; 0 cannot classify anything correctly and 1 classifies everything correctly. As
we have four classes, an accuracy of 25 % would be the same as choosing the right
class at random. The model will aim to minimize the loss with each epoch. The blue
lines in the charts uses the training data while the orange lines use the validation
data i.e. data not used in training, see Figure 2.3a.

The validation data is also represented in the confusion matrix. The confusion
matrix has four rows and four columns. Each row represent a batch of segments
from the corresponding BOAS class. The numbers represents the percentage of
the segments the model classifies as the BOAS class in the rows. This means that
the diagonal boxes represent what percentage the model classifies correctly. For
example, in Figure 2.3b the final model is correct at classifying 96 % of the BOAS0
segments.

10


2. Theory

Table 2.9: Example accuracy table.

Altered variable Min [%] Max [%] Average [%] Variance [%]
MFCC 73.8 84.4 78.9 10.5

(a) (b)

Figure 2.3: Example accuracy and loss chart (a) and confusion matrix (b).

The final way we present our results are with classification tables. In the classifica-
tion tables, entire audio files that have been omitted from the training are used. The
files are split into segments, much like the splitting augmentation, and each segment
is classified. The tables show the BOAS class, the filename and what percentage of
the audio files segments are classified as the different BOAS classes. For example,
in Table 2.10, we see that the model classifies 90 % of the file called 999_lb0.wav
correctly as BOAS0.

Table 2.10: Example classification table.

Classification [%]
Class Filename BOAS0 BOAS1 BOAS2 BOAS3
BOAS0 999_lb0.wav 90 10 0 0

The examples above are specifically for hypothesis I. The results for hypothesis II
will be presented the same way except that it uses two classes instead of four.

11


2. Theory

12


3
Methods - Hypothesis I

In the following chapter we demonstrate our approach to data augmentation, settings
for extracting features as well as different networks for our first hypothesis. We begin
by explaining our training classes. We continue with finding augmentations which
increases the accuracy for each recording device. We then move on to finding the
optimal settings for extracting MFCC and splitting as well as when to use RMS and
ZCR. We end the chapter with an overfitting problem that was caught during the
thesis.

3.1 Training Classes

BOAS can be classified in four different categories. When we train a network,
we decide what we want the model to classify. Since Mårtensson work [9] showed
promising results, we continued to use develop a network that could classify all four
grades.

3.2 Augmentation

We augment in two steps; early results indicated a bias towards classifying BOAS0.
This was thought to be a result of the larger size of BOAS0 data, see Figure 5.1.
Because of this, we first time shift the data so that all four classes contain approx-
imately the same number of segments. We use time shifting for this as it does not
alter the signal, it is only sampled in different intervals. We then continue by in-
creasing the amount of data. For this, we use speed and pitch shifting as well as
noise introduction.

All tests in this section will be using the Littmann data set as well as the Olympus
data set, both recorded after exercise. The base-(LSTM)-network is used and all
tests will run for 15 cycles meaning that each setup will have 15 models trained.
Statistics from all cycles will then be compiled in the following tables.

The naming convention in Tables 3.2 and 3.3 and later on in the result section is
explained in Table 3.1.

13


3. Methods - Hypothesis I

Table 3.1: Naming convention in tables; Even indicates that the data set has been
time shifted to have the same number of audio files in each class. In Pitch, the data
set has been pitch shifted and in Speed, the data set has been speed shifted. In
Noise, noise has been introduced to the audio file and during Combo noise has been
introduced as well as pitch and speed shift. And finally, Combo × 2 means that the
combination of all the augmentations has been used with two parameters.

Name Variables
Even Time shift to get the classes of same size
Pitch 1

Pitch × 2 1 and -1
Speed 1.05

Speed × 2 0.95 and 1.05
Noise 0.005

Noise × 2 0.005 and 0.0075
Combo Pitch, Speed, Noise

Combo × 2 Pitch × 2, Speed × 2, Noise × 2

For the Littmann data set, pitch and speed shifting increased the accuracy, see
Table 3.2. Noise, however, seems to have no or even negative effect. A data set
which has been evened as well as pitch and speed shifted shows the most promising
results with an average accuracy of 83 % and a variance of 4 %

Table 3.2: The table shows the minimum, maximum and average accuracy of the
network when introducing noise, speed and pitch shifting. The settings described in
Section 3.3.1 are used, except for the previous results [9] which used a hop length
of 512.

Littmann Min [%] Max [%] Average [%] Variance [%]
Previous results 52.4 69.5 61.8 17.1
Even segments 68.8 84.9 78.7 16.1
Even segments + pitch 75.3 85.1 81 9.8
Even segments + speed 81 85.7 83.1 4.7
Even segments + noise 68.1 77 72.7 8.9
Even segments combo × 2 74.3 78.7 76.8 4.4
Even segments, pitch, speed 81 85.1 83 4

The results for the Olympus data set are different and can be seen in table 3.3.
We see that we had an increase in accuracy and a decrease in variance for both
speed shift and noise introduction separately but also combined. Shifting the pitch
resulted in a smaller variance but a lower overall average. With all augmentations,
the variance was decreased from almost 10 % to≈ 5 %. The highest average accuracy
was obtained by combining all augmentations with two parameters.

14


3. Methods - Hypothesis I

Table 3.3: The minimum, maximum and average accuracy of the network when
introducing noise, speed and pitch shifting. The settings described in Section 3.3.2
are used.

.
Olympus Min [%] Max [%] Average [%] Variance [%]
Previous results 81.3 94.3 86.9 13
Even segments 85.9 95.1 91.6 9.2
Even segments + pitch 87.5 93.6 91.2 6.1
Even segments + speed 91.3 96.7 94.4 5.4
Even segments + noise 91.6 96.8 94.1 5.2
Even segments +
speed + noise 92.7 98.3 96.2 5.6

Even segments + combo × 2 97.2 98.8 97.7 1.6

Comparing Table 3.3 and Table 3.2 it is apparent that the data sets react differently
to the augmentations. For the Olympus data set, noise augmentation showed the
greatest improvement while for the Littmann data set, noise was by far the worst
type of augmentation. Because of this we will use all three types of augmentation
for the Olympus data sets but only pitch and speed shifting for the Littmann data
sets.

3.3 Optimal Settings
To achieve the best results, the optimal values for the different parameters used to
extract MFCC has to be determined. It is also of interest to see if features such as
RMS and ZCR show improving results. For all tests, the base-(LSTM)-network is
used. The tests will run for 15 cycles, statistics from all cycles will then be compiled
in all tables.

3.3.1 Littmann Settings
The default variables used to extract the json files from the Littmann data sets are

#MFCC 39
Windowsize 2048
hop length 256

Tables 3.4 and 3.5 suggests that a hop length of 128 and 26 MFCC should yield the
best result. The results in Table 3.6 is not as decisive as it improves with both a
higher and a lower FFT window.

15


3. Methods - Hypothesis I

Table 3.4: The table shows the minimum, maximum and average accuracy of the
network with different hop lengths; default = 512, 256, 128 and 64 for the Littmann
data set, recorded after exercise, with even number of segments in each BOAS class.

Hop length Min [%] Max [%] Average [%] Variance [%]
512 72.8 83.4 77.9 5.6
256 68.8 84.9 78.7 16.1
128 80.4 86.9 83.2 6.5
64 72.8 82.9 78.2 10

Table 3.5: The table shows the minimum, maximum and average accuracy of the
network with different number of MFCC; 39, 26 and 13 for the Littmann data set,
recorded after exercise.

MFCC Min [%] Max [%] Average [%] Variance [%]
39 68.8 84.9 78.7 16.1
26 72.8 84.4 78.8 11.6
13 68.8 82.9 74.1 14.1

Table 3.6: The table shows the minimum, maximum and average accuracy of the
network with different width of FFT window length; 1024, 2048 and 4096 for the
Littmann data set, recorded after exercise.

FFT window Min [%] Max [%] Average [%] Variance [%]
1024 73.8 84.4 78.9 10.5
2048 68.8 84.9 78.7 16.1
4096 72.8 81.4 76.8 8.5

From Table 3.7 it seems that RMS can be used to increase the accuracy to a lower
degree compared to MFCC. RMS has a single row of data, which can be compared
to our default case of MFCC that has 39 rows of data. Because of this, the lower
accuracy is not surprising. ZCR is around 25 % which is the same as randomly
choosing a class.

Table 3.7: The table shows the minimum, maximum and average accuracy of the
network with additional training features for the Littmann data set, recorded after
exercise.

Feature Min [%] Max [%] Average [%] Variance [%]
MFCC 68.8 84.9 78.7 16.1
RMS 26.1 38.1 32.7 12
ZCR 19.5 29.6 26.3 10

MFCC+RMS+ZCR 74.8 84.4 79.4 9.5
MFCC+RMS 75.3 84.9 78.9 9.5

16


3. Methods - Hypothesis I

3.3.2 Olympus Settings

The default variables used to extract the JavaScript Object Notation (JSON) files
from the Olympus data sets are

#MFCC 39
Windowsize 2048
hop length 512

All tests are preformed with the Olympus data set recorded after exercise which has
been augmented to have the same number of segments in every BOAS class.

The default settings which have been used in previous tests for the Olympus data
sets are MFCC = 39, FFT window = 2048 and hop length = 512. In Table 3.8
we can see that a higher hop length gives a higher accuracy on average, however a
hop length of 512 gives the highest maximum accuracy. Table 3.9 suggests that 26
coefficients are preferable as it produces a high accuracy with smaller variance than
for example 39 coefficients. The width of the FFT window should be 1024 according
to Table 3.10.

Table 3.8: The table shows the minimum, maximum and average accuracy of
the network with different hop lengths; 512, 256 and 128 for the Olympus data
set, recorded after exercise which has been augmented to have the same number of
segments in every BOAS class.

Hop length Min [%] Max [%] Average [%] Variance [%]
1024 87.2 93.6 92.2 6.4
512 85.9 95.1 91.6 9.2
256 85.7 93.6 90.4 7.8

Table 3.9: The minimum, maximum and average accuracy of the network with
different number of MFCC; 39, 26 and 13 for the Olympus after exercise data set
which has been augmented to have the same number of segments in every BOAS
class.

MFCC Min [%] Max [%] Average [%] Variance [%]
39 85.9 95.1 91.6 9.2
26 87.7 94.6 91.3 6.8
13 86.7 92.6 89.6 5.9

17


3. Methods - Hypothesis I

Table 3.10: The minimum, maximum and average accuracy of the network with
different width of FFT window length; 1024, 2048 and 4096 for the Olympus after
exercise data set which has been augmented to have the same number of segments
in every BOAS class.

FFT window Min [%] Max [%] Average [%] Variance [%]
1024 91.6 95 93.2 3.4
2048 85.9 95.1 91.6 9.2
4096 89.7 95.5 92.6 5.8

In Table 3.11 we notice that both RMS and ZCR perform just slightly better than
random choosing a class, but when all three are combined the result are much better
than if only using MFCC. The minimum accuracy has been raised by ≈ 5 % and the
variance is halved.

Table 3.11: The minimum, maximum and average accuracy of the network with
additional training features for the Olympus data set recorded after exercise.

Feature Min [%] Max [%] Average [%] Variance [%]
MFCC 85.9 95.1 91.6 9.2
RMS 21.5 35.7 26.0 14
ZCR 25 41.6 29 16

MFCC+RMS+ZCR 90.6 95 92.6 4.4
MFCC+RMS 86.2 94.6 91.7 8.3

3.3.3 Settings for Splitting
As mentioned before, splitting has been used even before this thesis. Each audio file
was split into different segments of three seconds. The width of the image that was
created and fed to the network is calculated by

segment length · sample rate
hop length . (3.1)

We know that the sample rate for Littmann is 4000 samples/s and 44100 samples/s
for Olympus, that means that Olympus has 11 times more data than Littmann.
If we use the inherited hop length of 512 on both Littmann and Olympus we get
23 and 258 columns respectively to feed the LSTM network. The images created
from the Littmann recordings are significantly smaller than those from Olympus.
Different segment lengths were therefore tested on the Littmann data set recorded
after exercise using the optimal settings found in Table 5.4. The tests run for 15
cycles and statistics from all cycles will then be compiled in all tables.

In Table 3.12 we see that a segment length of 5 seconds performs better in every
aspect except for the variance. It would therefore be preferable to use a 5 second

18


3. Methods - Hypothesis I

segment length for the Littmann data set. These tests were performed in the final
stages of the thesis which means that the tests performed in the result chapter still
use a 3-second-segment length.

Table 3.12: The minimum, maximum and average accuracy of the network with
additional training features for the Littmann data set recorded after exercise using
optimal settings.

Segment length Min [%] Max [%] Average [%] Variance [%]
3 sec 83.2 87.1 85.2 3.8
5 sec 83.6 90.1 87.8 6.5

3.4 Overfitting Problem

In this section we touch upon an overfitting complication that is seen in Chapter 5
and discussed in Chapter 7. Since discovering this obstacle, attempts have been
implemented to circumvent the problem. Here we account for the two methods used
and discuss their potential.

When comparing the results we got during training with the results we got during
classification of omitted files, we get very different results. Instead of correctly
classifying up towards 90 % of the segments, the models classifies almost everything
as BOAS0. Earlier in the project we got a bias towards BOAS0 in the training
as well, we tried to counter this by evening the boas classes, see Section 3.2. This
had positive effects on the training; both the training accuracy and the validation
confusion matrices showed significant improvement. But when testing the models
on the omitted files the results were very unsatisfactory as there was an aversion to
BOAS 3; the class that the model was most apt on identifying during training. We
have successfully created a hidden over-fitting problem.

We make two different attempts to solve this; we limit the data in the classes to
the size of the smallest class (BOAS3) and we use an unevened data set with class-
weights.

3.4.1 BOAS3 Size Limit

In Table 3.13 we see that the smaller data set has similar training results compared
to a larger data set. During classification, Table 3.14, the network has a bias towards
BOAS 0 and 1. This is not better nor worse than other results. Thus this seems to
be a dead end and will be abandoned.

19


3. Methods - Hypothesis I

Table 3.13: The minimum, maximum and average accuracy of the network. The
first network uses the Olympus data set recorded before exercise which has been
evened with optimal settings. The network ran 3 cycles. The networks uses a
smaller version of the Olympus before exercise data set but with only 13 dogs.
Optimal settings were used and the network ran for two cycles.

Segment length Min [%] Max [%] Average [%] Variance [%]
Optimal settings 88.7 90 89.4 1.2
Smaller data set 89.3 90.6 89.9 1.2

Table 3.14: Classification on omitted files. The network uses a smaller version of
the Olympus before exercise data set but with only 13 dogs. Optimal settings were
used and the network ran for two cycles.

Classification [%]
Class Filename BOAS0 BOAS1 BOAS2 BOAS3
BOAS1 010_ob1.wav 0 70 0 30
BOAS0 035_ob0.wav 40 50 0 10
BOAS1 074_ob1.wav 83 16 0 0
BOAS3 076_ob3.wav 37 45 16 0
BOAS0 081_ob0.wav 64 35 0 0
BOAS2 106_ob2.wav 14 85 0 0

3.4.2 Class-weights
Another attempt to counter the hidden overfitting problem is to use class weights.
In this approach, the networks verdict is influenced by the weight of each class. To
test this method, the Olympus data set, recorded before exercise, which has not
been evened is used. When training the model it was fed the class-weight vector

{0 : 1/49, 1 : 1/35, 2 : 1/26, 3 : 1/13}.

This means that a BOAS0 file is worth less than the other classes during training.
As the class-weight decreases the impact of each segment.

Table 3.15: The minimum, maximum and average accuracy of the network. The
networks uses the Olympus data set, before exercise. Optimal settings were used,
except for the evened time shift. OBS, not all runs are added here, only on run was
finished for the optimal.

Segment length Min [%] Max [%] Average [%] Variance [%]
Default settings 74.8 80.6 77.3 5.8
Optimal settings 86.4 86.4 86.4 0

Optimal settings, higher LR 85.4 89.9 88.6 4.4

20


3. Methods - Hypothesis I

Table 3.16: Classification on omitted files. The network used is the best performing
in Table 3.15. Optimal settings were used, except for the evened time shift. The
network ran for one cycle.

Classification [%]
Class Filename BOAS0 BOAS1 BOAS2 BOAS3
BOAS1 010_ob1.wav 40 40 10 10
BOAS0 035_ob0.wav 90 10 0 0
BOAS1 074_ob1.wav 100 0 0 0
BOAS3 076_ob3.wav 100 0 0 0
BOAS0 081_ob0.wav 78 21 0 0
BOAS2 106_ob2.wav 64 35 0 0

The results in Table 3.15 for Default settings and Optimal settings shows a decrease
in accuracy compared to the result presented in Figure 5.5a. However, the accuracy
plot in Figure 3.1 we notice that both the training and validation curves are still
increasing after 2500 epochs. Therefor the learning rate is increased in Optimal
settings, higher LR from 1E − 5 to 1E − 4. But as seen in Table 3.16 this did not
improve the classification.

Figure 3.1: In the accuracy we can see that both the training and the validation
is still increasing after the 2500 epochs.

21


3. Methods - Hypothesis I

22


4
Methods - Hypothesis II

In this chapter we explore our methods for hypothesis II. First we talk about the
number of training classes. We then find the best network and augmentation set-
tings. We conclude with finding the optimal settings for the frequency feature.

4.1 Training Classes

As we have mentioned multiple times before, BOAS has four grading levels. In our
first hypothesis, we use the four BOAS grades as our classes. In our second hypoth-
esis, we use the fact that both grade 0 and 1 indicates a somewhat normal breathing
that does not require surgery. While grade 2 and 3 indicates severe breathing prob-
lems and does require surgery. Instead of developing a network that can distinguish
between all four categories, we crate a model that classifies BOAS- and BOAS+.

4.2 Networks and Data Augmentation

For this hypothesis we use the frequency feature and augment our data using
SMOTE. In SMOTE, one can choose the number of neighbors that is used to create
the new data points. The following tests are to establish which of the two networks
is favorable as well as what the number of neighbors should be. All tests use the the
Olympus data set, recorded before exercise. The tests run for 15 cycles each. To
save time and computing resources we do not run these tests for all four data sets.

When comparing the two networks freq and freq2, in Table 4.1, it is hard to draw
any real conclusion. If we look at the accuracy plots in Figure 4.1, we see that the
training accuracy for freq2 approaches 1 much faster than freq. Thus, the freq2
network is considered to be superior. Both networks show signs of overfitting as can
be seen in the loss charts in Figure 4.1. This is believed to be because of the class
spread of the dogs. To counter this we try to augment the data using SMOTE to
even out the classes. As we can see in Table 4.2 this proves to have a significant
effect. The choice of the number of neighbors seems to have little effect, although a
higher number seems to have a lower variance.

23


4. Methods - Hypothesis II

Table 4.1: The minimum, maximum and average accuracy as well as the variance
for two different CNN.

Network Min [%] Max [%] Average [%] Variance [%]
freq 43.7 45.3 44.5 1.5
freq2 39.5 47.2 42.9 7.6

(a) Accuracy/loss for the freq
network.

(b) Accuracy/loss for the freq2
network.

Figure 4.1: The accuracy approaches 1 faster for the freq2 network than for the
freq network.

Table 4.2: The minimum, maximum and average accuracy as well as the variance
for the freq2 network. We vary the neighbors while using SMOTE with 2, 5 and 9
neighbors.

SMOTE Min [%] Max [%] Average [%] Variance [%]
2 66.2 71.4 68.9 5.1
5 66.9 72.5 69.6 5.6
9 67.7 70.2 69 2.4

4.3 Optimal Settings

When taking the FFT of our signal, we get an array which will be of different size
for Littmann and Olympus. For our frequency feature, we then have to decide how
many of the data points to use. A rough estimate is to use half the sampling rate
(2000 for Littmann and 22050 for Olympus) as the signal will repeat itself after that.
All tests use the the Olympus and Littmann data sets, recorded before exercise with
the freq2 network. The tests run for 15 cycles each except for Olympus with 22050.
Because of the large number of trainable parameters, we only run it for one cycles.

24


4. Methods - Hypothesis II

Table 4.3: The minimum, maximum and average accuracy of the network, with
different settings for the Littmann data set, recorded before exercise.

# data points Min [%] Max [%] Average [%] Variance [%]
1500 50.3 80.2 75.7 29
2000 50.4 79.4 75.7 28

In Table 4.3 the results are pretty much interchangeable. As fewer data points gives
a faster training, 1500 data points were chosen.

Table 4.4 shows that a higher range of data points is preferable. Since Olympus has
a higher sampling rate, this is expected. There are however limits to our computing
capabilities, we therefore use 22050 data points for the Olympus data sets.

Table 4.4: The minimum, maximum and average accuracy of the network, with
different settings for the Olympus data set, recorded before exercise. The 22050 run
is only ran once because of its size.

# data points Min [%] Max [%] Average [%] Variance [%]
1500 50.3 79.8 73.3 29
3000 73.4 79.6 76.4 6.2
22050 82.4 82.4 82.4 0

25


4. Methods - Hypothesis II

26


5
Results - Hypothesis I

In the following chapter we will present the final results for hypothesis I. We begin
with the data sets that have been augmented in various ways with different settings
to determine the best data sets. The we continue with how the networks created
with the best performing data sets performs with new recordings. We end this
chapter with comparing the different networks.

5.1 Augmentation

Originally the data sets consisted of 41 dogs. During the project, 83 additional
dogs have been recorded totaling in 124 dogs. The interest of BOAS classification
amongst dog owners is great so even more recordings are expected to take place.
From this alone, the data set has increased more than 300 % during the project. The
data regarding the Olympus data set recorded after training, is first time shifted to
make the BOAS classes more equally large. This gives ≈ 230 recordings. These 230
recordings are then augmented by pitch, speed and noise according to the combi-
nation in table 3.1. The end result is the equivalent of 1610 recordings, which is
an increase by ≈ 1300 % or 39 times the data that the previous thesis had to train
with. For the Littmann data sets, the equivalent is about 1150 which corresponds
to about 920 %.

The final distribution of BOAS classes can be seen in Figure 5.1 and the breed of
dogs in Figure 5.2.

27


5. Results - Hypothesis I

Figure 5.1: The distribution of the different BOAS classes in the recorded data.

Figure 5.2: The distribution of the different dog breeds in the recorded data.
breeds with two or fewer dogs are placed in the other section.

All tests are performed with the base-(LSTM)-network. The tests will run for 15
cycles and statistics will then be compiled in tables.

5.1.1 Littmann Data Set
The optimal values were achieved by various tests as explained in Section 3.3.1
and can be seen in Table 5.1. In Table 5.2 we see that the Littman data set,
recorded before exercise with optimal settings performs the best in all categories
with an average accuracy of 86.1 %. The same can be said for the Littmann data

28


5. Results - Hypothesis I

set recorded after exercise also using the optimal settings. Here, we have an average
accuracy of 85.2 %, see Table 5.3.

An accuracy/loss chart as well as a confusion matrix for the best performing models
for both Littmann data sets can be seen in Figures 5.3 and 5.4. We see that the
training data is close to perfect while the loss could be lowered further. It would
seem that the Littmann data set, recorded before exercise, is better at classifying
BOAS3 compared to the Littmann data set, recorded after exercise, based on the
confusion matrices.

Table 5.1: The optimal values for each variable for the Littmann data set before
and after exercise based on Section 3.3.1

Variable Values
Timeshift Even
Pitch -1, 1
Speed 0.95, 1.05

#MFCC 26
FFT window 2048
Hop length 128
Network base-(LSTM)-network

Additional features RMS

Table 5.2: The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Littmann data set, recorded before
exercise. Previous results are the results from the previous thesis [9].

Littmann Before Min [%] Max [%] Average [%] Variance [%]
Previous results 69.5 77.1 66.4 18.1
Default settings 63 69.2 66.5 6.1
Even, default settings 63.3 70 65.5 6.6
Optimal settings 83.4 87.4 86.1 3.9

29


5. Results - Hypothesis I

(a) (b)

Figure 5.3: Training and validation (a) as well as confusion matrix (b) for the
Littmann before exercise data set using the optimal settings.

Table 5.3: The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Littmann data set, recorded after ex-
ercise. This training uses all data acquired before 2021-05-01. Previous results are
the results from the previous thesis [9].

Littmann After Min [%] Max [%] Average [%] Variance [%]
Previous results 52.4 69.5 61.8 17.1
Default settings 55.3 64 58.9 8.6
Even, default settings 61.6 67.1 64.2 5.4
Optimal settings 83.2 87.1 85.2 3.8

(a) (b)

Figure 5.4: Training and validation (a) as well as confusion matrix (b) for the
Littmann after exercise data set using the optimal settings.

5.1.2 Olympus Data Set
In Section 3.3.2, tests were performed to assess the optimal values for different
parameters and the result can be seen in Table 5.4. In Table 5.5 we see that the

30


5. Results - Hypothesis I

data set which uses the optimal settings for the Olympus data set, recorded before
exercise, has the highest performance out of the data sets. They yield an average
accuracy of 89.4 % and a variance of only 1.2 %. In contrast to this we see that in
Table 5.6 we have a maximum accuracy of 93.6 %, however, the accuracy presented
here is a compilation of only one run.

An accuracy/loss chart as well as a confusion matrix for the best performing models
for both Olympus data sets can be seen in Figures 5.5 and 5.6. We see that both
the Olympus data sets perform better for both accuracy and loss than the Littmann
data sets. The Olympus data set, recorded after exercise, has a better performance
for all classes except for BOAS 3.

Table 5.4: The optimal values for each variable for the Olympus data set based on
Section 3.3.2

Variable Value
Timeshift Even
Pitch -1, 1
Speed 0.95, 1.05
Noise 0.005, 0.0075

#MFCC 26
FFT window 2048
Hop length 256
Network base-(LSTM)-network

Aditional feature RMS, ZCR

Table 5.5: The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Olympus data set, recorded before
exercise. This training uses all data acquired before 2021-05-01. Previous results
are the results from the previous thesis [9].

Olympus Before Min [%] Max [%] Average [%] Variance [%]
Previous results 81.4 92.4 86.3 11
Default settings 73.2 83.5 80 10
Even, default settings 85.3 89.1 87.2 3.8
Optimal settings 88.7 90 89.4 1.2

31


5. Results - Hypothesis I

(a) (b)

Figure 5.5: Training and validation (a) as well as confusion matrix (b) for the
Olympus before exercise data set using the optimal settings.

Table 5.6: The minimum, maximum and average accuracy of the network with
different settings and augmentations for the Olympus data set, recorded after ex-
ercise. This training uses all data acquired before 2021-05-01. Previous results are
the results from the previous thesis [9].

Olympus After Min [%] Max [%] Average [%] Variance [%]
Previous results 81.3 97.3 86.9 13
Default settings 74.3 82.9 79.3 8.6
Even, default settings 87.5 90 88.7 2.5
Optimal settings 93.6 93.6 93.6 0

(a) (b)

Figure 5.6: Training and validation (a) as well as confusion matrix (b) for the
Olympus after exercise data set using the optimal settings.

32


5. Results - Hypothesis I

5.2 Classification

To test the final models, we use files that we randomly omitted from the training.
Note, there is only one file each of BOAS 2 and BOAS 3 because there are much
fewer recordings in these classes and it might damage the training by omitting more.
The audio files are split into 3-seconds-segments which the model will try to classify.

5.2.1 Littmann Data Set

The classifications is mostly incorrect for the Littmann data set, recorded before
exercise with optimal settings. The result are leaning towards favoring BOAS0,
see Table 5.7. The network can only classify two out of six dogs correctly. And
even though it is very certain of these two classifications, it is also certain of other
classifications which are incorrect.

Table 5.7: Classification on omitted files for the Littmann data set, recorded before
exercise, using optimal settings.

Classification [%]
Class Filename BOAS0 BOAS1 BOAS2 BOAS3 Correct
BOAS1 013_lb1.wav 90 0 10 0 X
BOAS0 033_lb0.wav 90 0 10 0 X
BOAS2 060_lb2.wav 27 18 0 54 X
BOAS3 068_lb3.wav 81 0 9 9 X
BOAS1 093_lb1.wav 58 0 0 41 X
BOAS0 102_lb0.wav 91 0 8 0 X

In Table 5.8, the Littmann data set recorded after exercise with optimal settings,
there is a bias towards BOAS3. It does classify BOAS2 is correctly, which is rare.
Overall, the results are poor as it can only classify one out six dogs correctly.

Table 5.8: Classification on omitted files for the Littmann data set, recorded after
exercise, using optimal settings.

Classification [%]
Class Filename BOAS0 BOAS1 BOAS2 BOAS3 Correct
BOAS0 006_la0.wav 0 0 0 100 X
BOAS1 014_la1.wav 0 40 0 60 X
BOAS2 055_la2.wav 0 0 72 27 X
BOAS3 069_la3.wav 0 8 58 33 X
BOAS1 093_la1.wav 0 0 0 100 X
BOAS1 102_la0.wav 0 0 30 69 X

33


5. Results - Hypothesis I

5.2.2 Olympus Data Set
In contrast to the very high performance in Table 5.5, we see that when we use our
model, created with the Olympus recordings before exercise data set, our classifica-
tion is mostly incorrect, see Table 5.9. We only classify two out of six dogs correctly.
We seem to have a hard time classifying BOAS2 and 3 as nearly no segments are
classified as BOAS3.

Table 5.9: Classification on omitted files for the Olympus data set, recorded before
exercise, using optimal settings.

Classification [%]
Class Filename BOAS0 BOAS1 BOAS2 BOAS3 Correct
BOAS1 010_ob1.wav 50 10 30 10 X
BOAS0 035_ob0.wav 90 10 0 0 X
BOAS1 074_ob1.wav 100 0 0 0 X
BOAS3 076_ob3.wav 4 95 0 0 X
BOAS0 081_ob0.wav 78 21 0 0 X
BOAS2 106_ob2.wav 50 42 7 0 X

We see similar results for the Olympus recordings after exercise seen in Table 5.10.
However, this model has an easier time classifying BOAS 2 and 3 although the
classification is mostly incorrect.

Table 5.10: Classification on omitted files for the Olympus data set, recorded after
exercise, using optimal settings.

Classification [%]
Class Filename BOAS0 BOAS1 BOAS2 BOAS3 Correct
BOAS1 029_oa1.wav 30 0 40 30 X
BOAS0 038_oa0.wav 0 10 10 80 X
BOAS3 059_oa3.wav 83 8 0 8 X
BOAS2 063_oa2.wav 16 24 60 0 X
BOAS1 090_oa1.wav 7 30 23 38 X
BOAS0 103_oa0.wav 0 0 14 85 X

5.3 Networks
As mentioned before, all previous tests have used the base-(LSTM)-network found
in Table 2.6. We further developed that network to try to increase the accuracy.
All tests in this section use the Littmann data set, recorded after exercise, or the
Olympus data set, recorded after exercise, which both have been evened. The default
settings found in Section 3.3.1 and Section 3.3.2 are used and each network has been
run 15 times.

34


5. Results - Hypothesis I

We see that the base-(LSTM)-network as well as network 4 has the highest average
accuracy. However, the base-(LSTM)-network has the lowest variance of the two,
see Table 5.11.

In Table 5.12 we can see that the base-(LSTM)-network as well as network 5 and
6 has the highest average accuracy. Network 6, which is a very small network has
a high variance while the base-(LSTM)-network and network 5 both have a lower
variance of 4.3 %.

Table 5.11: The minimum, maximum and average accuracy of the network with
different networks with the Littmann data set recorded after exercise.

Network Min [%] Max [%] Average [%] Variance [%]
Base 62.1 67.3 64.5 5.1
2 61.3 66.8 64.2 5.4
3 59.8 66.7 64.0 6.9
4 59.8 67.0 64.5 7.2
5 59.3 67.3 63.6 8.0
6 53.8 59.4 56.9 5.6
7 60.1 64.8 62.3 4.7

Table 5.12: The minimum, maximum and average accuracy of the network with
different networks with the Olympus data set recorded after exercise.

Network Min [%] Max [%] Average [%] Variance [%]
Base 85.6 89.9 88.3 4.3
2 82.7 89.9 86.7 7.1
3 83.9 90.2 86.6 6.3
4 83.9 89.3 87.8 5.4
5 86.6 90.9 88.3 4.3
6 84.0 91.5 88.3 7.4
7 79.1 85.7 82.3 6.6

35


5. Results - Hypothesis I

36


6
Results - Hypothesis II

In this chapter we show the classification results for each data set, with optimal
settings, using our second hypothesis. Each of the Littmann tests have run for 15
cycles while the Olympus tests only ran for 1 due to the large number of parameters.

In Table 6.1 we can see that the model classifies five out of six dogs correctly. It is
unsure about the last dog classifying 50 % of the segments as BOAS- and 50 % as
BOAS+. The results are of the data set Littmann, recorded before exercise.

Table 6.1: Classification on omitted files for the Littmann data set, recorded before
exercise, using optimal settings.

Classification [%]
Class Filename BOAS- BOAS+ Correct
BOAS1 013_lb1.wav 100 0 X
BOAS0 033_lb0.wav 60 40 X
BOAS2 060_lb2.wav 18 81 X
BOAS3 068_lb3.wav 36 63 X
BOAS1 093_lb1.wav 50 50 -
BOAS0 114_lb0.wav 30 69 X

For the Littmann data set recorded after exercise, the model is correct at classifying
four out of six, see Table 6.2.

Table 6.2: Classification on omitted files for the Littmann data set, recorded after
exercise, using optimal settings.

Classification [%]
Class Filename BOAS- BOAS+ Correct
BOAS0 006_la0.wav 60 40 X
BOAS1 014_la1.wav 50 50 -
BOAS2 055_la2.wav 72 27 X
BOAS3 069_la3.wav 41 58 X
BOAS1 093_lb1.wav 75 25 X
BOAS0 102_la0.wav 92 7 X

37


6. Results - Hypothesis II

table 6.3 shows our results for the Olympus data set, recorded before exercise. We
see that the model cannot classify a single dog correctly. We see better results in
Table 6.4, where the model can correctly classify four out of six dogs.

Table 6.3: Classification on omitted files for the Olympus data set, recorded before
exercise, 22050 frequency points. Very bad result

Classification [%]
Class Filename BOAS- BOAS+ Correct
BOAS1 010_ob1.wav 40 60 X
BOAS0 035_ob0.wav 20 80 X
BOAS1 074_ob1.wav 16 83 X
BOAS3 076_ob3.wav 66 33 X
BOAS0 081_ob0.wav 0 100 X
BOAS2 106_0b2.wav 57 42 X

Table 6.4: Classification on omitted files for the Olympus data set, recorded after
exercise, 22050 frequency points.

Classification [%]
Class Filename BOAS- BOAS+ Correct
BOAS1 029_ob1.wav 10 90 X
BOAS0 038_ob0.wav 80 20 X
BOAS1 059_ob3.wav 0 100 X
BOAS3 063_ob2.wav 20 80 X
BOAS0 090_ob1.wav 15 84 X
BOAS2 103_0b0.wav 71 28 X

38


7
Discussion

Here we try to analyse our results, reason why they behave as they do and point
out possible errors that may have contaminated the results. We will also touch on
the ethical aspects of the results and the changes it could lead to.

7.1 Hypothesis II
Hypothesis II proved to be a very good solution for the Littmann data sets as the
network was able to successfully classify five out of six and four out of six of the
unheard audio files respectively. We still have a limited data set which means that
it is hard to say weather or not the model can classify all BOAS gradings in all
dogs. But the very high classification results show that this method might be good
enough to be used in a prototype application today.

The hypothesis does not work as well for the Olympus data sets. Recordings after
exercise show a somewhat good result, while recordings before exercise are very
poor. We do not know the reason for this. We explain a potential reason to why we
get poorer results in Section 7.4.

One interesting thing to notice is that amongst all four of the models only one
BOAS1 file is correctly classified worse than any of the other. We do not really
know why, but it might be that it is simply harder to distinguish this class or it
might point to that there are some bad files in the BOAS1 training set corrupting
the training.

7.2 Hypothesis I
In the following sections we discuss the augmentation, classification and network
results for our first hypothesis.

7.2.1 Augmentation
In the Chapter 5 we see that for both the Littmann and Olympus data sets, recorded
before and after exercise, using the optimal settings gives the best results in most
categories. Considering that these data sets use the optimal settings that have

39


7. Discussion

been calculated in the Chapter 3, it is not surprising that these have the best
accuracy. If we compare the Littmann and Olympus data sets, we see that a network
using the Olympus data sets yield better accuracy. This is unsurprising since the
Olympus records contains about 11 times more data than the Littmann. Data sets
with recordings performed after exercise also show a higher accuracy than data sets
with the recordings performed before exercise which is expected since the breathing
sounds are more pronounced after a lighter exercise.

All data has been split into 3-seconds-segments. There is a balance here between
gaining more data and the segments being long enough to contain the relevant
features. We experimented with a five second segment for the Littmann data sets
since the image created with 3 seconds was small. This showed promising results
and is definitely a parameter that should be experimented with more in the future.

The Littmann data sets have been augmented over 900 %. The same number for the
Olympus data sets is 1300 %. It would be very easy to make even larger data sets
by adding more variables and values for pitch and speed shifting as well as noise to
Tables 5.1 and 5.4. However, we believe that at some point we will have diminishing
returns from the augmentation. A more in-depth study of the settings used during
augmentation is something that would be beneficial.

7.2.2 Classification

Even though Section 5.1 show promising results we have some problems using the
models to classify new recordings. Classifying BOAS2 and 3 seems to be the biggest
obstacle. This can be because the amount of original data in these classes were very
little. We first augment this data to have a more even number of recordings by time
shifting and then continuing to augment the files with pitch and speed shifting as
well as introducing noise. The data contained in these classes have been so heavily
augmented to produce 483 sample segments of 24 original recordings for BOAS2 and
392 sample segments of 11 original recordings for BOAS3. Because of this, when we
train our network on the augmented files, we get an overfitting problem that is not
present in our accuracy and loss charts. We are unknowingly teaching our network
to recognize these 24 and 11 original files, just in various forms. When we then feed
a new recording of BOAS2 or 3, our model will not recognize these as BOAS2 or
3. One solution to this problem would be to have more original data. The more
original data, the less augmentation is needed to have a sufficiently large data set.

As stated in Section 3.4, two ways of counteracting the overfitting problem were to
limit the size of the data sets to the smallest class and using class-weights. The idea
behind limiting the number of files for BOAS0, 1 and 2 to that of BOAS3 was that
one class should not be more augmented than any other. The training accuracy was
comparable to that of the optimal settings but the classification percentage did not
improve. The attempt at using class-weights showed a preliminary decrease in both
training and classification. None of these attempts provided the desired result.

40


7. Discussion

7.2.3 Network
In table Table 5.12 the base-(LSTM)-network as well as network 5 and 6 performed
similarly. Network 5 is larger than the base-(LSTM)-network and only performed 1
percentage point better in the maximum accuracy while the average and variance
results are the same. Is this one percentage point worth the extra computations that
are needed for the larger network? One could argue that the maximum accuracy is
not as important as the average accuracy and that it therefore is more affordable
both in time and economical factors to have as small network as possible while still
achieving the highest average accuracy. Network 6 is much smaller than the base-
(LSTM)-network and has the same average accuracy. This would be the preferred
network if the variance was lower.

When comparing the networks for the Littmann and Olympus data sets, different
networks show the best performance. The data from the different devices has differ-
ent properties, for example the sample rate, which could be a factor when choosing
the best network. And as we have discussed before, the amount of data gathered
with the devices vary a lot and could affect which network is best suited.

7.3 Hypothesis I vs II
If we compare the classification results of the two hypotheses, Section 5.2 and chap-
ter 6 it is clear that the second hypothesis performs better for all data sets.

If we consider the methods used, we know that we have different number of training
classes. The first hypothesis uses a more complex set of four classes because of the
promising results from Mårtensson [9] while hypothesis II uses only two. A network
learning to distinguish between only two classes has an easier time than one having
to separate four. This could be a reason for the poorer results. It would be beneficial
to be able to classify all four gradings, for statistics, veterinarians and for a more
precise model. But because a BOAS0 and 1 grading as well as a BOAS2 and 3 have
such similar consequences, a model that can separate these two groups would also
contribute to the end goal.

We also have a difference in which features we extract. When we extract the fre-
quency feature we assume that the important information lies in the frequency data.
The MFCC feature is based on frequency data but is later divided into the number
of coefficients we have. Because of this we loose a lot of information even though
we gain information over time when we use the LSTM network. This could also
be a factor to the better performance of hypothesis II. Another advantage of using
the frequency feature over MFCC is that it generates less complex training data as
it is only one vector of values for each segment instead of a time series of vectors.
Because of this we can use a smaller network and use less computational power.

Even though hypothesis I is much worse at classifying the omitted files it is right ever
so often, but does never correctly classify a dog with BOAS1, much like hypothesis
II.

41


7. Discussion

7.4 Littmann vs. Olympus
The sampling rate for Littmann is 4000 Hz while Olympus has a sampling rate of
44 100 Hz. This means that the Olympus recordings have more than ten times the
information that the Littmann recordings have. In hypothesis I, we counteract this
somewhat by using different hop lengths, 256 and 128 respectively. This lowers the
extra information to five times that of Littmann, which is still significant. This could
be a reason to why the Littmann data sets have worse performance than Olympus
in the first hypothesis. Different settings in segment length during the splitting
augmentation can be a way to circumvent the issue but Olympus still has more
information.

For our second hypothesis, the larger information is a disadvantage. Since the
Olympus device has a very high sampling rate, we need to use a large array of
data points. This in turn leads to many training parameters which increases the
computational demands. Because of this, it was not possible to have a larger number
of frequency points in this project. Increasing the frequency points could yield better
results for the Olympus data sets with hypothesis II.

In Chapters 3 and 5, we see that the Littmann and Olympus data sets react dif-
ferently to augmentations and networks. It is therefore questionable to have the
uniform approach that this thesis has where we implement the same changes to all
data sets. It could be beneficial to treat the data sets as separate projects in an
effort to focus on the best solutions for the specific data set.

The Olympus recordings as well as the device resembles a mobile phone more than
the Littmann does. Since one of the end goals is to implement a mobile application
the Olympus data set, especially the one recorded before exercise is of large interest.
Networks trained with the Olympus data sets are also the better performer of the two
devices in hypothesis I. This is promising and might indicate that the Olympus data
sets should be the main focus. However, Littmann will still be used by veterinarians
for other ailments and is the better performer in hypothesis II, with even better
classification results. There is also an issue that cell phones use different microphones
making it harder to classify.

7.5 Recordings Performed Before vs. After Exer-
cise

Recordings both before and after exercise were performed. In both our hypotheses,
the Littmann data sets networks, trained with recordings before exercise performed
better than the ones after exercise. This is somewhat surprising since breathing
sounds should be more prominent after a lighter exercise. A possible explanation
is disturbances due to the difficulty of keeping the stethoscope in the right position
when the dog is agitated.

If we instead focus on the Olympus recordings. During classification with our first

42


7. Discussion

hypotheses, there seems to be no difference in number of correctly classified dogs.
For our second hypothesis, recordings after exercise show a much better performance
than the once before.

If we study the classification results for our first hypothesis, we can see that the data
sets using recordings after exercise classifies more segments as BOAS2 and 3 than
the data set recorded before exercise, even though their correct label is BOAS0 or
1. This is strange and might be because our network identifies other characteristics
as BOAS2 and 3 than just breathing patterns. It could for instance pick up on
movements of the device since the dog is agitated.

7.6 Additional Possible Errors
The Olympus recordings have been edited, both to make them in the range of 30-60
seconds but also to remove voices and some other mayor disturbances. Multiple
people have edited different files without a strict protocol. This creates a bias in the
recordings that may effect how well the network may interact with new, unedited
recordings. In this regard, the Littmann files are better as there exists no bias from
the editors. This will hopefully be solved as more data is gathered since the same
disturbances will not be present in all recordings.

When classifying a BOAS degree, veterinarians follow a strict protocol. This means
that even though a recording of a dog does not show a high degree of BOAS, other
factors can influence the decision. This makes it difficult for our network to classify
the dog correctly. The future mobile application could also include a questionnaire
which the network also takes into consideration when making its judgment.

7.7 Ethical Consideration
As there are approximately 12000 french bulldogs in Sweden alone, and if 64 % of
them suffer from at least one type of BOAS affliction it would mean that at least
7500 dogs have a decreased standard of living due to a preventable disease. If we
extrapolate this to other short nosed dog breeds and dogs in other countries it
becomes apparent that there is a need to do something. We think that something
easy to use for ordinary dog owners to assess whether they buy a healthy dog or one
that will suffer throughout its life, would have a great impact on many dogs’ life.

This thesis investigates a non invasive way to diagnose BOAS. If we are able to
correctly classify the BOAS degree on a calm resting dog we are not exposing the
dog to any additional stress. Since a future application will read recordings that
could include a persons private affairs, some precautions to ensure that the data is
not leaked will need to implemented in a future state.

In our a assessment the benefits of this work outweighs the risks.

43


7. Discussion

44


8
Conclusion

We have shown that frequency data together with a CNN makes it possible to
classify the BOAS grade to a high degree. Our conclusion is to use this solution for
a potential product. The MFCC and other features together with a LSTM network
shows that it is possible to train the network, but the limited data together with
the augmentation produced a hidden over-fitting problem that showed when used
on unseen files. Therefor we think that out hypothesis II is better than hypothesis
I with the current amount of data. Even tho the system works with hypothesis II
it could still very much benefit from more training data. To increase the accuracy
and in time maybe be able to expand to more classes we recommend that recordings
should continue. If limitations must be made the focus should be on the recordings
before exercise.

We conclude that the Littmann and Olympus data sets behave differently to the
same augmentations and networks. It could therefor be valuable to continue with
two different projects in order to achieve the best results for both the Littmann and
Olympus data sets.

A sufficient result without physical exercise is preferable since it offers an even easier
and faster diagnosis.

8.1 Future Work
As stated before, the limited data is a large factor for the somewhat poor results. A
continuation of data gathering for all four classes is crucial for the success of using
machine learning to classify breathing severity in dogs.

Different variations in LSTM networks were tested as well as a CNN. A deeper
study in more types of networks for the purpose of classifying breathing sounds
would hopefully yield even better results than this thesis.

The work presented in this thesis will hopefully be used by both veterinarians as
well as aspiring dog owners to easily get an estimate of potential breathing problems
in dogs. In order to facilitate distribution, an easy-to-use mobile application would
be necessary.

It is also of interest to investigate parameters such as learning rate, batch size and

45


8. Conclusion

number of epochs to further optimize the results. One could also further research
parameters for extracting MFCC as well as class-weights.

We think that it would be beneficial to separate Littmann and Olympus into different
projects because of the difference between them.

46


Bibliography

[1] Genetic Welfare Problems of Companion Animals. url: https://www.ufaw.
org.uk/dogs/german-shepherd-hip-dysplasia (visited on 06/01/2021).

[2] Things to think about before buying a flat-faced (brachycephalic) dog. url:
https://www.bluecross.org.uk/pet- advice/things- think- about-
buying-flat-faced-dog (visited on 06/02/2021).

[3] Julia Riggs et al. “Validation of exercise testing and laryngeal auscultation for
grading brachycephalic obstructive airway syndrome in pugs, French bulldogs,
and English bulldogs by using whole-body barometric plethysmography”. In:
Veterinary Surgery 48 (Jan. 2019). doi: 10.1111/vsu.13159.

[4] Statistik ur hundregistret - Jordbruksverket.se. May 2021. url: https : / /
jordbruksverket.se/e-tjanster-databaser-och-appar/e-tjanster-
och-databaser-djur/hundregistret/statistik-ur-hundregistret.

[5] Alice Nordevik. “Här är valpboomens mest populära raser”. In: (May 2021).
url: https://www.svt.se/nyheter/har-ar-valpboomens-popularaste-
raser (visited on 05/18/2021).

[6] Ida Bertilsson and Linda Keeling. Phenotypic variation for BOAS within four
brachycephalic dog breeds-Can good welfare be obtained? Fenotypisk varia-
tion for BOAS within four brakycefala hundraser-Kan god djurvälfärd uppnås?
2019. url: https://stud.epsilon.slu.se.

[7] Brian McFee et al. “librosa/librosa: 0.8.0”. In: (July 2020). doi: 10.5281/
ZENODO . 3955228. url: https : / / doi . org / 10 . 5281 / zenodo . 3955228 #
.YJFZeQt24SM.mendeley.

[8] Keras. Keras layers API. Available at https://www.thermofisher.com/
order/catalog/product/R37601#/R37601[Accessed 2021-03-05].

[9] Mårtensson Moa.
[10] 3M™ Littmann® Electronic Stethoscope Model 3200. url: https : // www .

littmann . com / 3M / en _ US / littmann - stethoscopes / products / ~ / 3M -
Littmann-Electronic-Stethoscope-Model-3200/?N=5932256+8711017+
3293188392&rt=rud (visited on 05/07/2021).

[11] Linear PCM Recorder LS-P1. url: https://asia.olympus-imaging.com/
product/audio/lsp1/spec.html (visited on 05/07/2021).

[12] Overview — NumPy v1.20 Manual. url: https://numpy.org/doc/stable/
(visited on 05/07/2021).

47

https://www.ufaw.org.uk/dogs/german-shepherd-hip-dysplasia
https://www.ufaw.org.uk/dogs/german-shepherd-hip-dysplasia
https://www.bluecross.org.uk/pet-advice/things-think-about-buying-flat-faced-dog
https://www.bluecross.org.uk/pet-advice/things-think-about-buying-flat-faced-dog
https://doi.org/10.1111/vsu.13159
https://jordbruksverket.se/e-tjanster-databaser-och-appar/e-tjanster-och-databaser-djur/hundregistret/statistik-ur-hundregistret
https://jordbruksverket.se/e-tjanster-databaser-och-appar/e-tjanster-och-databaser-djur/hundregistret/statistik-ur-hundregistret
https://jordbruksverket.se/e-tjanster-databaser-och-appar/e-tjanster-och-databaser-djur/hundregistret/statistik-ur-hundregistret
https://www.svt.se/nyheter/har-ar-valpboomens-popularaste-raser
https://www.svt.se/nyheter/har-ar-valpboomens-popularaste-raser
https://stud.epsilon.slu.se
https://doi.org/10.5281/ZENODO.3955228
https://doi.org/10.5281/ZENODO.3955228
https://doi.org/10.5281/zenodo.3955228#.YJFZeQt24SM.mendeley
https://doi.org/10.5281/zenodo.3955228#.YJFZeQt24SM.mendeley
https://www.thermofisher.com/order/catalog/product/R37601#/R37601
https://www.thermofisher.com/order/catalog/product/R37601#/R37601
https://www.littmann.com/3M/en_US/littmann-stethoscopes/products/~/3M-Littmann-Electronic-Stethoscope-Model-3200/?N=5932256+8711017+3293188392&rt=rud
https://www.littmann.com/3M/en_US/littmann-stethoscopes/products/~/3M-Littmann-Electronic-Stethoscope-Model-3200/?N=5932256+8711017+3293188392&rt=rud
https://www.littmann.com/3M/en_US/littmann-stethoscopes/products/~/3M-Littmann-Electronic-Stethoscope-Model-3200/?N=5932256+8711017+3293188392&rt=rud
https://www.littmann.com/3M/en_US/littmann-stethoscopes/products/~/3M-Littmann-Electronic-Stethoscope-Model-3200/?N=5932256+8711017+3293188392&rt=rud
https://asia.olympus-imaging.com/product/audio/lsp1/spec.html
https://asia.olympus-imaging.com/product/audio/lsp1/spec.html
https://numpy.org/doc/stable/


Bibliography

[13] librosa.effects.pitch_shift — librosa 0.8.0 documentation. url: https://librosa.
org/doc/main/generated/librosa.effects.pitch_shift.html (visited
on 05/07/2021).

[14] librosa.effects.time_stretch — librosa 0.8.0 documentation. url: https://
librosa.org/doc/main/generated/librosa.effects.time_stretch.html
(visited on 05/07/2021).

[15] Jason Brownledd. SMOTE for Imbalanced Classification with Python. Jan.
2021. url: https://machinelearningmastery.com/smote-oversampling-
for-imbalanced-classification/.

[16] The dummy’s guide to MFCC. url: https://medium.com/prathena/the-
dummys-guide-to-mfcc-aceab2450fd (visited on 05/07/2021).

[17] A.V. Oppenheim and Ronald Schafer. “From Frequency to Quefrency: A His-
tory of the Cepstrum”. In: Signal Processing Magazine, IEEE 21 (Oct. 2004),
pp. 95–106. doi: 10.1109/MSP.2004.1328092.

[18] librosa.feature.mfcc — librosa 0.8.0 documentation. url: https://librosa.
org / doc / latest / generated / librosa . feature . mfcc . html (visited on
05/07/2021).

[19] librosa.feature.rms — librosa 0.8.0 documentation. url: https://librosa.
org / doc / latest / generated / librosa . feature . rms . html # librosa .
feature.rms (visited on 05/07/2021).

[20] Costas Panagiotakis and Georgios Tziritas. “A speech/music discriminator
based on RMS and zero-crossings”. In: Multimedia, IEEE Transactions on 7
(Mar. 2005), pp. 155–166. doi: 10.1109/TMM.2004.840604.

[21] librosa.feature.zero_crossing_rate — librosa 0.8.0 documentation. url: https:
//librosa.org/doc/latest/generated/librosa.feature.zero_crossing_
rate.html (visited on 05/07/2021).

48

https://librosa.org/doc/main/generated/librosa.effects.pitch_shift.html
https://librosa.org/doc/main/generated/librosa.effects.pitch_shift.html
https://librosa.org/doc/main/generated/librosa.effects.time_stretch.html
https://librosa.org/doc/main/generated/librosa.effects.time_stretch.html
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd
https://medium.com/prathena/the-dummys-guide-to-mfcc-aceab2450fd
https://doi.org/10.1109/MSP.2004.1328092
https://librosa.org/doc/latest/generated/librosa.feature.mfcc.html
https://librosa.org/doc/latest/generated/librosa.feature.mfcc.html
https://librosa.org/doc/latest/generated/librosa.feature.rms.html#librosa.feature.rms
https://librosa.org/doc/latest/generated/librosa.feature.rms.html#librosa.feature.rms
https://librosa.org/doc/latest/generated/librosa.feature.rms.html#librosa.feature.rms
https://doi.org/10.1109/TMM.2004.840604
https://librosa.org/doc/latest/generated/librosa.feature.zero_crossing_rate.html
https://librosa.org/doc/latest/generated/librosa.feature.zero_crossing_rate.html
https://librosa.org/doc/latest/generated/librosa.feature.zero_crossing_rate.html


Date: 2020-09- ________    Location: _____________

Dog namn: Breed:                                  male/female

Name of owner: Neutered:   Yes        No

E-mail adress: Microchip number:

Date of birth: 
Weight: _________kg

Body condition score (BCS): ______/9
Colour: 

Medical History:   No         Yes : 

Medications: :   No         Yes: ………………………………………………………………………………………

Do you perceive that your dog has abnormal breathing sounds: 

Never       Seldom (once a month)        Often (several times a week)      Daily but intermittent   

Constant   Only when asleep 

Does your dog have trouble breathing: 

Never       Seldom (once a month)        Often (several times a week)      Daily but intermittent   

Constant   Only when asleep     Only when it is warm outside

Does your dog usually sleep on: 

   lying on the back          lying on the side    lying on the chest/belly    in a sitting position

Does your dog sleep with: 

a normal head position     an elevated head position     a toy in its mouth 

Does your dog have episodes of apnea (periodically not breathing at all/holding its breath) during sleep? 

   No         Yes

If Yes, how often? 

Seldom (happened once or twice)        Rare (happens monthly)      Often (happens weekly)   Daily

Does your dog ever had episodes of collapse? 

   No         Yes

If Yes, how often? 

Seldom (happened once or twice)        Rare (happens monthly)      Often (happens weekly)  

If Yes, when does it happen? 

during rest         during exercise    both during rest and exercise

Does your dog wake up/disturb frequently during night/sleeping cycles?   No         Yes

1

A
BOAS Classification Protocol

I


Other information: 

2


Date: 2020-09- ________    Location: ______________
Test id: _______________ BOAS grading set by:___________ 
Photo dog /nostrils By:_____________
Film respiratory pattern By:______________
Recordings before ET _____________________ Panting: ______Operator:___________

Recordings after ET_______________________ Panting: ______Operator:
___________

Physical examination Pre ET

⦁ Stress level:        Normal          Mild          Moderate          Severe

⦁ Open mouth breathing: :       No       Intermittent         Constant    

⦁ Nostrils:        Open       Mild stenosis        Moderate stenosis      Severe stenosis

⦁ Stertors (low pitch noise):     Not audible     Mild     Moderate     Severe 

⦁ Stridors (high pitch noise):    Not audible     Mild     Moderate     Severe

⦁ Inspiratory effort:     Not present       Mild         Moderate      Severe

⦁ Expiratory effort:       Not present       Mild         Mod