Simulating Shazam
Acoustic Fingerprinting for Music Identification

Elias Lennevi, Ludvig Sandh, Noah Sundström, Tom Önnermalm

Department of Electrical Engineering

Chalmers University of Technology
Gothenburg 2024
www.chalmers.se
EENX16-24-44

www.chalmers.se


This page was intentionally left blank


Bachelor Thesis 2024

Simulating Shazam –

Constructing an algorithm that creates, utilizes and
analyzes acoustic fingerprints for music identification

Elias Lennevi, Ludvig Sandh,
Noah Sundström, Tom Önnermalm

Department of Electrical Engineering
Chalmers University of Technology

Gothenburg 2024


Simulating Shazam
Acoustic Fingerprinting for Music Identification
Elias Lennevi, Ludvig Sandh, Noah Sundström, Tom Önnermalm

© Elias Lennevi, 2024
© Ludvig Sandh, 2024
© Noah Sundström, 2024
© Tom Önnermalm, 2024

Supervisor: Javad Aliakbari, Doctoral Student - Communication, Antennas and
Optical Networks, Electrical Engineering

Examiner: Giuseppe Durisi, Full Professor - Communication, Antennas and Opti-
cal Networks, Electrical Engineering.

Bachelor Thesis 2024
Department of Electrical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg

Written in LATEX
Gothenburg 2024

iv


Abstract

To identify music in a real-time audio recording, sometimes including noisy elements,
is not a particularly trivial audio signal processing task. In this study, the goal is
to understand and recreate the Shazam algorithm, a method to accurately detect
music from real life recordings. The programming language Python is both used to
design and implement the discrete signal processing concepts behind the algorithm,
and also to build a functioning application that can be used in a smaller scale in a
real environment.

The thesis examines the theoretical concepts from which the Shazam algorithm
is built upon. The major components of the algorithm will be highlighted, such as
the spectrogram generation, fingerprint identification, and fingerprint matching. By
analyzing and implementing these components, we aspire to explore optimization
techniques to match the performance of the original Shazam algorithm in terms of
speed.

Furthermore, this research aims to analyze the algorithm’s accuracy in environ-
ments with varying noise levels. By training the algorithm on a library of songs
spread throughout history and from diverse genres, we hope to develop an applica-
tion that respects different cultures and is inclusive, showing it would work well in
a real life scenario.

Fundamentally, our thesis helps spread the understanding of the applications of
advanced signal processing techniques, especially in the area of music identification.
By analyzing and recreating the Shazam algorithm, we aim to highlight its inner
workings, limitations, and potential further improvements.

Keywords: Acoustic fingerprint, sound recognition, Python algorithm, digital sig-
nal processing

v


Table of Contents

Abstract v

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Goals and requirement specification . . . . . . . . . . . . . . . . . . . 3
1.5 Boundaries and changes from project plan . . . . . . . . . . . . . . . 5

2 Theory 7
2.1 Spectrogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Spectrogram maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 Hashing the maxima . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5 Signal to noise ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Method 18
3.1 Implementation and development of the algorithm . . . . . . . . . . . 18

3.1.1 The audio recorder . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.2 Useful dataclasses . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.3 Connection to a fingerprint database . . . . . . . . . . . . . . 20
3.1.4 Creation of spectrogram and peaks . . . . . . . . . . . . . . . 21
3.1.5 Paring and hashing of peaks . . . . . . . . . . . . . . . . . . . 22
3.1.6 Algorithm class . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.7 Certainty score . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.8 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.9 Main program . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2 Data collection, data management and the database . . . . . . . . . . 28
3.3 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Testing without noise . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Real-life noise testing . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.3 Computer automated noise testing . . . . . . . . . . . . . . . 31
3.3.4 Hyperparameter testing . . . . . . . . . . . . . . . . . . . . . 32
3.3.5 Code tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Result 35
4.1 Results for tests without noise . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Real-life noise test results . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3 Computer automated noise test results . . . . . . . . . . . . . . . . . 37
4.4 Results of hyperparameter testing . . . . . . . . . . . . . . . . . . . . 37

vii


4.5 Results for code tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5 Discussion 40
5.1 Verification of requirements . . . . . . . . . . . . . . . . . . . . . . . 40
5.2 Analyzing heatmaps and hyperparameters . . . . . . . . . . . . . . . 42
5.3 Future improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion 46

7 Bibliography 47

viii


1
INTRODUCTION

In today’s society, music is all around us. Whether you are walking down a street,
listening to the radio, or watching a movie, you are prone to hear it. Music is being
played either in the background or foreground of almost all of our surroundings.
Music can also convey emotions and feelings through rhythm, melody, tone and
more [1], therefore, music can be a cultural bridge between people. Regardless of
genre or origin, whether it is the catchy beat of a pop song, the soulful charm of
country music or the calm chaos in jazz, music unites and inspires people. Thus,
when one hears an unknown song that one wants to listen to again, it would be
difficult to search through all of the existing songs today to find it. Spotify, the
audio streaming service, disclosed that in 2019, on average, 27 000 new tracks were
added to their service each day, 55 000 in 2020 and 33 000 in 2021 [2]. In today’s
society, most people carry their personal mobile phones everywhere they go which
could be used to identify unknown songs.

The following sections will introduce the background to music identification using
algorithms, followed up with the project’s purpose, goals and boundaries.

1.1 Background

Shazam, a music recognition service created by Shazam Entertainment, Ltd., was
initially an idea in the year 2000 to give people a tool to recognize and identify
songs using their mobile phones [3]. The Shazam algorithm had to identify songs
from a short audio sample, for example from the radio, a bar, or a movie, without
being affected by background noise. The algorithm could also filter out distortion,
reverb as well as background conversations and even identify songs slightly altered
in remixes [3]. The same source stated that with almost two million songs in their
database, Shazam had to focus on performance as well to be able to quickly give
their users accurate answers. When launched in 2003, users dialed in to the service
via a phone number and up to 15 seconds of the music was sampled and sent to
Shazam’s server. Thereafter a text message with the result was sent to the user with
information about the artist and song title [3]. Today the technology has moved
into an app that can be downloaded on a smartphone, rather than being a dialing
service.

1


1.2 Related technology

With the increasing digital music consumption and technical development, users
need quicker music identification to find songs. The Shazam service has made it
possible for the public to quickly identify and discover songs as well as receive infor-
mation about the artists and album. Music identification in complex environments
has been streamlined in conjunction with advancements in signal processing and
machine learning [4]. Several programming libraries have made it easier to develop
audio programs and have simplified access to processing power for developers [5].

Signal processing is being used widely throughout many different industries and
applications. Within other fields of research, for instance mental illnesses, signal
processing is being used to understand the connection between the brain and the
physiological responses, as well as developing diagnostic instruments that rely on the
same principles as music identification [6]. Advancements in digital signal process-
ing lead to these technologies becoming useful in more areas than just those related
to the music industry.

In summary, for music streaming to continue growing - and to meet users’ ongo-
ing expectations for improvement in these areas - music identification and signal
processing have become integral parts of the digital landscape. This topic is highly
relevant as the foundation of the technology can also be applied in other industries.

1.3 Purpose

The purpose of this project is to create a functioning application for computer units
that can identify songs by listening to recorded samples of them and display the
name of the song, artist, album and timestamp.

By developing this application, the aim is also to obtain more knowledge regarding
implementing the principles of the Shazam algorithm, which would result in a deeper
understanding of the functionality and applications of sound recognition. The study
also focused on the how the performance of the algorithm is affected by different
modifications.

Lastly, the purpose of this project is also to improve the groups’ skills in program-
ming and signal processing in an applicable environment.

2


1.4 Goals and requirement specification

This section addresses the goals of this project. By splitting them into four parts, a
more legible understanding regarding the project’s challenges is clarified to achieve
the aforementioned purpose.

• Sound recognition via software – The initial challenge is creating a reliable
algorithm that can recognize songs through a short acoustic sample between
three to twelve seconds. The total time from starting the algorithm to finding
the song should be maximum 30 seconds.

• Database – The database should contain at least 100 songs representing var-
ious aspects of the music industry. It should be diverse and representative,
including music from different genres and artists from around the world.

• User Interface – To make the application appealing and easy to use, it
needs a simple and straightforward interface, providing information on how
the application works and displaying metadata for identified songs.

• Software Testing – To ensure the functionality of the algorithm, thorough
testing is necessary. Tests are also needed to ensure the fingerprint database
works correctly and returns accurate information.

A concrete set of requirements is displayed in this section. The table 1.1 below shows
them in detail, their priority (must/want), as well as the method of verification.
These requirements are used in section 5.1, where they will be compared to the
results of the project.

3


Table 1.1: The table containing the requirements specifications for the project. The requirements
are grouped into sections, each regarding a separate aspect of the application.

4


1.5 Boundaries and changes from project plan

Throughout the project, the team has encountered various unforeseen difficulties
that have resulted in the need to update both the project plan and scope. In the
planning report, it was stated that artificial intelligence and machine learning would
be explored as an alternative to the Shazam algorithm. This would have been an in-
teresting approach to implement and compare with the initial algorithm. However,
as the project progressed and became more complex, the group decided to fully focus
on the initial signal processing approach instead of delivering two semi-functioning

5


applications due to the time constraint.

To keep the project within its time limit, and also make sure there was no issues
with data storage, the following boundaries were assigned:

• The project only needs to be stored locally on the team’s computers to avoid
any copyright inconveniences associated with sharing music information. This
means that no web API needs to be developed.

• The real Shazam service is today a mobile application. Since the focus of the
project is to recreate the mechanics of the identification capabilities, rather
than the user interface, the application’s target operative system is PC and
Mac. We do not have to create a mobile or web application of any sort.

• Even though the goal is to match Shazam’s performance in terms of identifying
songs quickly, there is no need to match Shazam in terms of the number of
songs that could be identified. To be able to identify slightly above 100 songs
is enough for this thesis.

• During the project it was also assumed that all noise affects the performance
equally. This is because it makes testing easier and some of the results easier
to decipher.

In this section, the project has been presented, as well as goals and boundaries.
The next section gives a description of the theory needed to understand the Shazam
algorithm itself and the rest of this report.

6


2
THEORY

The algorithm outlined in this section is based on a scientific paper authored by
one of the founders of Shazam, Avery Wang [3]. The core concept of the Shazam
algorithm revolves around processing audio tracks, also known as sampling, into
what is named as fingerprints. These fingerprints contain a small amount of unique
data that effectively represents a part of a song. By utilizing fingerprints, compar-
isons between songs become more efficient, as opposed to analyzing entire tracks.
The resulting fingerprints are assembled in a database, serving as points of reference
during the identification process. A notable feature of the Shazam algorithm is its
robustness against various sources of noise and filtering during sampling. Even in
scenarios where a song is recorded in a noisy environment using a low-quality mi-
crophone, the resulting fingerprint is expected to be close enough to resemble the
corresponding fingerprint stored in the database. Once recorded, these fingerprints
are compared with those in the database, and the song that shows the closest match
is then returned by the algorithm. This means that the algorithm can only de-
tect songs that earlier have been processed into fingerprints and then stored in the
database. A song that the algorithm has never seen before cannot be detected, in
which case the algorithm either outputs the most similar known song, or doesn’t
find any resemblance at all.

For a more detailed examination of the algorithm, it can be deconstructed into the
following key components: Spectrogram generation, Spectrogram maxima,
Hashing the maxima, and Matching. The former three components regard the
construction of the fingerprints. This process is used both when analyzing clean
songs, to establish a reference database, and song identification from user record-
ings. The final component is used to determine how similar fingerprints from two
songs are, enabling the algorithm to determine which song best matches the record-
ing. See figure 2.1 for a visual description of the algorithm steps.

7


Figure 2.1: A block diagram for a visual interpretation of how the process of the algorithm works
for both registered songs in the database and song identification.

In the following sections, in-depth explanations of each of the four key algorithm
components are presented. Then, section 2.5 introduces theory relevant for compar-
ing loudness between wanted and unwanted elements of sound recordings.

2.1 Spectrogram

A spectrogram is a visual depiction illustrating the frequency spectrum of a signal
over time. To generate a spectrogram an acoustic signal is processed by transforming
it to the frequency domain using Short-Time Fourier transform (STFT). Essentially
what the Short-Time Fourier Transform does is splitting the time domain signal into
smaller sections of equal lengths, then taking the Fourier transform of each of them
separately [7]. The resulting Fourier transformed sections are lined up vertically
next to each other to create a two-dimensional image. Since the time-domain signal
is split into separate sections, frequency components of one section of the song
will not spread to all other sections when taking the STFT. Compare this to the
regular Fourier transform, where a frequency component will be included in the
frequency domain signal regardless of when it occurred in the time domain signal.
The following definition shows how the discrete STFT transformation depends on
two variables as inputs corresponding with time and frequency, and generates an
output that represents how big the amplitude is in the original signal at that time
and frequency [7].

STFT{x[n]}(m,ω) ≡ X(m,ω) =
∞∑

n=−∞

x[n]w[n−m]e−iωn (2.1)

8


The discrete function x[n] represents the time domain signal undergoing transforma-
tion, while w[n] is a window function centered around zero. The inputs m and ω are
the desired time (in number of samples) and frequency, respectively, at which the
STFT is applied. The window function is responsible for splitting the time-domain
signal into smaller sections. To explain what happens in the rightmost part of equa-
tion 2.1, the input signal x[n] is multiplied element-wise by a window function w[n],
which is shifted m steps to the right. The result is an isolated section of x[n] at
n = m, where all other parts of x[n] have become zero. See figure 2.2 for a visual
explanation of this process. Now that a short section of x[n] has been isolated, it
is Fourier transformed. This corresponds to the

∑
and e−iωn part of the equation.

X(m,ω) is essentially the discrete Fourier transform of x[n]w[n − m] (Note: the
frequency ω and window function w are different).

Figure 2.2: An example audio signal x[n] with 1000 samples is shown in the first plot. The second
plot shows a shifted window function centered at n = m = 500. The third plot shows how the
window function has been applied by multiplying each sample in the first two plots pairwise with
each other. The result, x[n]w[n − m], is an isolated part of x[n] that is the input to the Fourier
transform step in the STFT.

9


The window function directly defines the characteristics of all sections. For ex-
ample, notice that if the window function is symmetric around n = 0, each isolated
section will be centered around n = m. Also, the width of the window function di-
rectly affects the width of each section. Depending on how many sections we decide
to use and the width of each section, they may overlap with each other. If the win-
dow function is a symmetric gate function of three samples in width, for example,
and we take the STFT on each sample in the input signal, then there will be one
section centered at each sample. Each section will overlap with neighboring sections,
since each section extends one sample forward and backward in time. Overlap isn’t
necessarily unwanted. No overlap could lead to elements in the input signal that lie
just between sections to be filtered out from the STFT result. Too much overlap
would cause less precision in the result, since the frequency components would be
smeared out in time. A window function that works well on almost all cases is the
Hann window, see figure 2.3.

Figure 2.3: A plot of a Hann window of width 256 samples. This would create a section with the
width of 256 samples in the STFT.

Applying the STFT on a wide range of combinations betweenm and ω, then plotting
each result as a pixel on a two-dimensional plot, where the axes are m and ω, we
can visualize the spectrogram of the input signal x[n], see figure 2.4. In short, the
STFT-algorithm reconstructs the sound wave, described by the sound intensity over
time, into a new signal representing sound intensity over both frequency and time.
This is sometimes favorable as key frequency components in the original signal can
be clearly seen in the spectrogram, and equally as important, at what time the
components occurred. Also, noise usually consists of a broad range of frequencies,
which means that the addition of noise in an audio recording is, to put it simply, that
the output of the STFT-transformed recording is increased by a constant amount
for all inputs m and ω. Looking at the spectrogram, where the brightness of each
pixel corresponds to the output of the STFT at specific values of m and ω, the
result of added noise is that the entire image becomes brighter. How much brighter

10


the image becomes depends on the strength of the noise. If the noise isn’t large
enough, the characteristics of the spectrogram would be roughly the same with and
without added noise to the input signal. The difference is that the spectrogram of
the input signal with added noise would be slightly brighter than the other one.
In the time domain however, the addition of noise would make the signal more or
less unrecognizable, especially to a computer. This is why the STFT is used in the
Shazam algorithm.

Figure 2.4: An example spectrogram of a song. In this image, frequency is on the Y-axis (ω), and
time is on the X-axis (m). The pixel intensity is on the Z-axis, and directly corresponds with the
amplitude of the song at frequency ω and time m, or in other words, the STFT output X(m,ω).
In this spectrogram, brighter colors indicate larger STFT outputs. The range of values used for
ω in this spectrogram are 128 values taken linearly between o and 20,000. Each section is 256
samples wide, and the overlap between each pair of neighboring sections is 32 samples.

What characterizes most musical pieces is their combination of pronounced frequen-
cies, what we humans perceive as distinct tones [8]. The pronounced frequencies are
still prominent in the frequency domain even with the addition of noise, so long as
the noise doesn’t drown out them. With the spectrogram, it is easy to spot these pro-
nounced frequencies, and when they occurred. Extracting these pronounced parts
of the spectrogram is a useful step in the process of identifying a song from a noisy
recording. In the next section, the pixels in the spectrogram that are brighter than
all their neighbors are extracted via a maximum filter algorithm.

2.2 Spectrogram maxima

The local maxima in the spectrogram, or peaks, as they will be referred to as here-
inafter, have the highest chance of staying unaffected by noise and other types of
disturbances. A maximum-filter is used to separate these peaks from the rest of the
values in the spectrogram. The maximum-filter consists of two steps. To explain it
visually, the first step involves scanning each pixel in the spectrogram and copying
the value of the strongest nearby pixel in a square of given side length R to it. This
will result in a sort of blur effect where bright parts seem to leak onto nearby pixels,
see figure 2.5.

The second step is to compare this resulting image with the original spectrogram

11


Figure 2.5: The first step of the maximum filter step applied on the same spectrogram filter as in
figure 2.4

image. Each pixel in one of the images is compared to the corresponding pixel in the
other image. When pixels differ in brightness, they become black. When pixels have
equal brightness values, they become white. This results in a new image which we
call a constellation map, where the peaks of the spectrogram are visible (usually not
noise), and everything else (usually noise) is left out. See figure 2.6 for an example
plot of a constellation map. In theory, if a maximum filter was applied on both
the spectrogram of a clean audio signal containing music, and a spectrogram of a
the same audio with small amounts of noise added, the resulting constellation map
would be the same. The constellation map not only filters out noise, but it also
represents key parts of a piece of audio with very little data.

Figure 2.6: A constellation map of a real song. The name comes from the resemblance of the night
sky. Note that each peak (bright pixel) corresponds to a certain time and frequency.

Note that the value of R, of which the maximum filter is applied using, directly af-
fects the density of peaks in the constellation maps. Assuming two nearby pixels in
the spectrogram doesn’t have exactly equal values, only one peak will be extracted
in each square of side length of R pixels (containing R2 pixels). This means that
the number of expected peaks in a constellation map consisting of S pixels in total
is S

R2 . The number of peaks that are generated in this step is inversely proportional
to R2. This will be useful to keep in mind in section 3.1.5, where it is discussed that
the performance of the algorithm greatly depends on the number of peaks.

Now that the most pronounced parts of the recording are extracted, the next sec-

12


tion goes through the step of compiling the constellation map into easily comparable
fingerprints. The fingerprints will be useful when trying to identify a song, later ex-
plained in section 2.4.

2.3 Hashing the maxima

The constellation map generated from the last step is used as the input to this next
step, which is hashing the maximum values. In this step the fingerprints are ex-
tracted from the constellation map. In theory, to identify a song from a recording,
we would now compare the constellation map of the recording with the constella-
tion maps of a set of songs that are candidates. All there is to do is find which of
the candidates line up best with the recording when looking at their constellation
maps. Algorithmically this process would be really slow, since it would require to
try lining up the spectrogram of the recording at all possible discrete time steps in
all the recordings. Then the algorithm would have to count the number of peaks
lining up.

A way to speed this up would be to create a table that shows, for each frequency,
which songs include a peak in that frequency. When trying to match a recording to
the entire set of songs, we could look at the peaks in the constellation map. Then
all the frequencies of these peaks could be noted, and then the table can be used
to quickly find a subset of candidate songs that is known to certainly contain the
song that’s trying to be identified. This would mean that only a subset of the con-
stellation maps need to be compared more precisely, speeding up the process. The
issue here is that since the frequency axis is discrete, there is only a finite number
of frequencies in the spectrogram which can hold peaks. Furthermore, songs usually
contain many different frequencies, so the subset of candidates would be marginally
smaller than the entire set of songs. The speedup would in practice be insignificant.

An insight that builds on the above idea is to put pairs of peaks into the table.
It is much more uncommon for songs to contain a specific combination of two peaks,
than to contain just one a peak at a specific frequency. The subset of candidate
songs that contain a certain pair of peaks is greatly reduced in size, assuming peaks
occur rather uniformly across the spectrogram. For a given pair of peaks, we repre-
sent it in the table by the two frequencies belonging to respective peak, as well as
the difference in time between them. These three numbers can be hashed together
to be used in a hash table, see figure 2.7. We call each of these hashes fingerprints.
Throughout the rest of the study we will use the terms fingerprints and hashes
almost interchangeably. The table now maps fingerprints to the set of songs that
contain that fingerprint.

The pairing process is carried out like this: for each peak pi in the constellation map,
find F other nearby peaks if possible. For each of those nearby peaks pj, create a
pair consisting of peaks pi and pj. To avoid creating duplicate pairs, only look for F

13


Figure 2.7: A close-up of a pair of peaks from a constellation map. The hash value is generated
using a hashing algorithm that takes the frequencies of the peaks, and the difference in time
between them, as input. The result can be a 32-bit number, which would allow a few billion
different fingerprints to be added to the hash table.

nearby peaks to the right of pi. Note that if the constellation map had P peaks, one
would extract at most F ∗ P peak pairs. Also note that each peak can be included
in more than one pair. See a visual example of how nearby points are used to create
pairs in figure 2.8

Figure 2.8: A point Pi circled in blue, and F = 6 nearby points circled in red. Each red line
between two points represent a pair. Note that this figure is a zoomed in version of a constellation
map similar to figure 2.6 to improve visibility

When trying to match a recording to the entire set of songs, we extract all peak
pairs from the recording. The peak pairs are all hashed into single integers. Using
the table and indexing it with the extracted fingerprints, a small subset of song

14


candidates can quickly be found. Combining peaks pairwise is not only useful for
finding a small subset of song candidates, but also for the way we will compare
the recording to these candidates in the matching process, described in the next
section.

2.4 Matching

When a sample has been recorded and the song candidates have been found, the
next step is to determine which candidate matches the recording best. For each peak
pair in the constellation map from the recording, the table is used to find the set of
songs that also contain that peak pair, as well as the location of those peak pairs in
those songs. All the peak pairs are grouped by the song they existed in. For each
song, a similarity score will be calculated. It is done by comparing the difference in
time between each matching peak pair in the recording and reference song. Each
matching peak pair is then ordered into buckets based on the time difference. The
score for the song candidate is the amount of matches in the largest bucket. The
song that is the most similar is the one with the highest similarity score and should
be presented as the best match by the algorithm.

When comparing a small sample of a song with a full length clean reference ver-
sion of the same song, imagine the recording’s constellation map sliding across the
reference’s constellation map. At some point, the peaks - and therefore also the
peak pairs - should line up perfectly. Now imagine that we connect the lined up
peak pairs from both constellation maps with rubber bands. If we continue sliding
the recording’s constellation map over the reference’s again, the rubber bands will
stretch, but they will always have the same length as each other. This is the main
idea behind the matching algorithm. When finding a lot of equal distances between
the matches, it is certain that there will be many peaks lining up somewhere during
a sliding process described earlier. This suggests that the candidate is very similar
to the recording. On the other hand, when the distances between the matches dif-
fer a lot in length, the peaks will not line up nicely anywhere along the reference
constellation map. This suggests that the candidate is dissimilar to the recording.
With this algorithm, we take a shortcut and skip the entire ”sliding” process, which
would be very expensive to simulate. To summarize, it is enough to find many
matching peak pairs with the same distance in time in respect to each other in the
constellation maps to know that the two corresponding songs are similar.

2.5 Signal to noise ratio

Since the recordings that the algorithm works on contains background noise, it is
useful to quantify how strong the noise is compared to the music. The concept

15


of Signal-to-noise ratio (SNR) can be used for this purpose. The SNR is a ratio
between the average power level of the signal and the average power level of noise
[9]. Assuming the signal and the noise are separated into two functions of time, the
SNR can be calculated as follows:

SNR =
Psignal

Pnoise

=

(
Asignal

Anoise

)2

(2.2)

Where A is the root mean square (RMS), which is calculated using the following
formula on a discrete signal a with N samples:

A =

√∑N
n=1 a[n]

2

N
(2.3)

The RMS is an indicator for how loud a signal on average is, which is useful for
calculating the SNR. The SNR is often described in decibels (dB) and can thus be
defined as:

SNRdB = 10 · log10

[(
Asignal

Anoise

)2
]

(2.4)

Applying a law of logarithms, the square inside the brackets can be moved out as a
factor outside the logarithm:

SNRdB = 20 · log10
(
Asignal

Anoise

)
(2.5)

This value tells us how many decibels louder the signal is compared to the noise.
To prove this, we use equation 2.4 and the first part of 2.2 and apply another law
of logarithms.

SNRdB = 10 · log10
(
Psignal

Pnoise

)
= 10 · log10(Psignal)− 10 · log10(Pnoise) (2.6)

Then, using the definition for the power of a signal in decibels:

PdB = 10 · log10(P ) (2.7)

We find an alternate definition for the SNR in decibels:

16


SNRdB = Psignal,dB − Pnoise,dB (2.8)

Thus, for instance, when the signal is 60 dB and the background noise is 40 dB, the
SNR is then 20 dB. When the signal and noise have the same loudness the SNR is
0 dB.

17


3
METHOD

The following sections outline the approach taken on how to use acoustic fingerprint-
ing for music identification. Firstly, the structure of the algorithm is described in
section 3.1. Thereafter, in section 3.2, the choice of data management is explained,
highlighting the use of a database. Lastly, the methods for testing and verification
of the algorithm’s success rate is elaborated in section 3.3.

3.1 Implementation and development of the algo-

rithm

The group utilized a modular and object-oriented programming pattern to enhance
code readability and facilitate development in later stages of the project. This also
made code-reviews easier and ensured a higher standard in the code. The entire
algorithm is created using Python because of its easy use and prebuilt libraries for
signal analysis.

After researching different methods to recreate the Shazam algorithm, the group
decided to use classic signal processing first and foremost. Short time Fourier trans-
form was applied to all the songs in our music library to create spectrograms. The
biggest amplitudes were then identified and saved in pairs as fingerprints in our
database. Then, for every recording, the same process was repeated to create new
fingerprints. These were then to be matched against the fingerprints from known
songs in our database.

The implementation of the Shazam recreation could be separated into a few sub-
tasks. To be able to run the algorithm in the first place, a user interface was needed.
Another task was to implement the code for the recorder, which would be responsible
for capturing audio data and providing it to other parts of the code in a readable for-
mat. A further subtask was to set up the database, and the code that is responsible
for communicating with it effectively. Then there is the algorithm itself, containing
the implementation of the concepts explained under section 2. Last but not least, a
main program is needed to integrate the different project components together and
make sure that they communicate well with each other.

18


3.1.1 The audio recorder

The Recorder class utilizes a Python library called sounddevice. When a Recorder
object is instantiated, a recording channel is opened on the computer and the pro-
gram begins capturing audio on the default microphone device. The audio is saved
into a memory buffer which is regularly read from and copied into a Python list by
a separate thread in the program. With this library, it’s easy to access microphone
metadata such as the recording sample rate, which is important for interpreting the
audio data.

Since the task was to recreate the Shazam algorithm, which can find a song in
real-time while it’s still recording, we needed to implement the Recorder class in
such a way that makes two things sure. The first is that we had to be able to read
from the buffer at any time, even when the audio capturing is still going on. The
second is that the audio capturing can continue even though the audio data has
been accessed. These problems were solved simply by using multiple threads that
each have their own responsibility. One thread extends the audio data list from the
channel buffer. Another reads from it.

The final implementation of the Recorder class has the following API. The method
start() starts the audio capturing and begins writing audio data in the background.
The method is non-blocking. The method stop() stops the audio capturing. The
class has an attribute called waveform, which is a list containing the audio data
as an array of integers describing the recorded signal in time-domain. The method
get waveform() returns the waveform attribute along with the sample rate of the
recording as an integer.

3.1.2 Useful dataclasses

Before the key parts of the algorithm implementations are discussed, it can be useful
to understand some other classes that doesn’t have much functionality, but is rather
convenient still. It is common to utilize the built in Python library dataclasses to
simplify the writing of these classes. With dataclasses, Python classes can be writ-
ten like a struct would be written in the programming language C. By decorating a
Python class with ”@dataclass”, an initializer is automatically created for the class
attributes, meaning no methods have to be written (although they can be). In total,
three dataclasses were used in the project.

The first one regards peaks. It is useful to represent a peak as a type of its own.
The dataclass Peak contains two attributes.

• time (integer): describes the time at which the peak in a constellation map
exists.

19


• freq (integer): describes the frequency of the peak.

The second dataclass regards peak pairs. The PeakPair class contains four at-
tributes and one method.

• timestamp (integer): the sample index at which the leftmost of the two peaks
occur in a constellation map.

• freq1 (integer): the frequency of the leftmost peak.

• freq2 (integer): the frequency of the rightmost peak.

• delta time (integer): the difference in time between the two peaks, counted as
number of samples between them.

• hash () (integer): computes the hash/fingerprint for the PeakPair, using
freq1, freq2 and delta time as inputs.

prime1 = 52711

prime2 = 1000000007

hash = int(freq1 + freq2 * prime1 + delta_time

* prime1 * prime1) % prime2

The third and last dataclass regards song metadata, and is called SongInfo. It is
used to represent songs when inserting and retrieving songs from the database. It
contains four pretty self-explanatory attributes:

• name (str): the name of the song.

• artist (str): the artist’s name.

• album (str): the song album’s name.

• release date (str): the song’s release date, in any text format.

3.1.3 Connection to a fingerprint database

The Connector class works as an interface between the rest of the code and the
database. It plays an important role of the program since it is responsible for con-
figuring and managing the database tables. It handles inserts, retrieval and manip-
ulation of data. The tables are configured to contain relevant primary keys, foreign
keys and datatypes for all attributes. More about the details of the tables and their

20


attributes will be discussed in section 3.2, but a general overview is that tables for
fingerprints and song metadata exist in the database.

The class incorporates functions for both retrieving matching fingerprints and adding
new fingerprints to the database. This paragraph will explain the API of the
Connector class and will go over the five public methods. The first method, in-
sert song(), takes a SongInfo object, and a list of PeakPair objects as arguments.
The method is responsible for adding the song to the database, populating it with
the song metadata, as well as inserting the hashes of all peak pairs. This is done
with two simple INSERT SQL queries. The insert song() method also accounts
for songs that already exists in the database. This means that if the song to be
added and its hashes already exists, they won’t be added twice. The second public
method get matching hashes() has an integer parameter hash value and is re-
sponsible for retrieving all fingerprints matching hash value. Remember that a hash
in this context is synonymous with the term fingerprint. The SQL query selects all
fingerprints that matches the hash value, after which the function calling the query
returns them in a list of tuple objects. Then there are some extra methods. The
delete all songs() method completely wipes the database from all songs and fin-
gerprints. This is useful during development when we occasionally want to reset the
working environment. Furthermore, there is a method song exists() that takes a
SongInfo object as the only parameter, and returns a boolean, indicating whether
or not the song described by the SongInfo object currently exists in the database
or not. Lastly, the public method get song info() takes two strings as parameters,
namely the artist name and song name. The function fetches the rest of the song
metadata from the database and returns it as a SongInfo object.

3.1.4 Creation of spectrogram and peaks

The Spectrogram class was created using the pre-built libraries Scipy and Numpy.
Both of these libraries are external Python libraries, though very popular. This
enabled quick generation of spectrograms for songs and recordings.

frequencies, times, spectrogram =

scipy.signal.spectrogram(waveform, sample_rate, window=’hann’)

The spectrogram() function takes the waveform, its sample rate and the selected
window function as arguments. The Hann window was chosen simply because it
was symmetric and some of the group members had used it before. The resulting
frequencies variable is a list of integers describing which frequencies each row in the
spectrogram corresponds to. The times variable is a list of floating point numbers
describing the timestamp in seconds of each column in the spectrogram. The spec-
trogram variable itself is a two-dimensional Numpy array of floats, which describes
the spectrogram itself. The spectrogram is calculated in the initializer method for
the Spectrogram class. With the Scipy spectrogram function, you can choose the

21


width of each section (window size) and the overlap between sections. We used
the default values, where each section was 256 samples in width, and the overlap
between each pair of neighboring sections was 32 samples.

The public method get peaks() takes the already calculated spectrogram, stored
in the object itself, and applies a maximum filter to it. This is also done using a
function ndimage.maximum filter() from the Scipy library. This function takes
the two-dimensional spectrogram Numpy array as an argument, along with an in-
teger argument radius. The radius of the maximum filter is explained in section
2.2 (referred to as R), and has a direct impact on the number of peaks generated.
Initially, a conservative value was chosen for the radius, with a balance between
execution time and successful generated answers. The maximum filter returns a
new two-dimensional Numpy array of the maximum filtered spectrogram. Peaks are
found by comparing the spectrogram object with the maximum-filtered spectrogram
and retaining the values where equality exists between them. This can be done quite
elegantly using the equality operator between two Numpy multidimensional arrays.

constellation_map = self.spectrogram == maximums

The resulting constellation map variable is a Numpy array of zeroes and ones. The
ones indicate equality. After this, we simply loop through all these ones and zeroes.
Whenever a one is found its time and frequency are used to create a Peak object,
which is then appended to a list. When all peaks have been added to the list, it is
returned from the get peaks() method.

3.1.5 Paring and hashing of peaks

To extract the fingerprints from the list of peaks, we need to create pairings between
them. We only want to create pairings between peaks that are nearby to the right.
A first naive approach was to loop through all P peaks, then for all of them, loop
through all other P−1 peaks. For each pair of peaks in the nested loop, the distance
between them is calculated. The closest F peaks are then used to create pairings
with. This approach, looping through on the order of P peaks twice in a nested
loop fashion, has an algorithmic time complexity of O(P 2). When trying different
maximum filter radii, we could see that P lied between around 10,000 and 500,000
peaks. A quadratic time complexity is not efficient enough to process the list of
peaks, so we needed something better.

A first step was to realize that we could sort the list of all peaks by their time-
values. When looping over each peak Pi, nearby peaks Pj to the right of Pi now
always had the property that i < j. The second loop doesn’t need to look at values
j <= i. This certainly speeds up the process, since approximately half the number

22


of comparisons vanish. However, we are looking for a speedup factor of about 1000x,
not 2x.

A second insight was that once we have found F peaks for peak Pi to pair up
with, we don’t need to keep looking for more. Since the list of peaks are sorted in
time, these F nearby peaks will be found rather quickly. Breaking the inner loop
once F peaks have been found meant that we, in the inner loop, only had to iterate
through all peaks between Pi to Pi+K , where K in fact was a relatively small number
compared to the entire number of peaks. This results in a speedup factor of P

K
which

ended up being a lot closer to 1000.

Even if the subset of nearby peaks had been narrowed down, we hadn’t yet de-
cided on the best way to find the F closest peaks. Paradoxically, the subset of
nearby peaks wouldn’t be known until F nearby peaks had been found. A good
enough approach was to define a bounding box just to the right of each peak. We
considered peaks in this box being nearby, and peaks outside were considered not
being nearby. When looping through peaks Pj starting from j = i+1, and we found
a peak inside Pi’s box, we created a pair using it. This approach made sure we
would find F of the leftmost peaks inside each box. The box was made sure to be
sufficiently large not to contain less than F peaks regardless of the density of the
constellation map. To find a reasonable box size, the density of the constellation
map (or in the more data efficient format, the list of peaks) was first calculated.
Then a simple calculation was done to find a box size that statistically would fit at
least F peaks.

Although it wasn’t necessary, the peak matching algorithm could have been sped
up more. Note that although the tricks mentioned above sped up the algorithm by
constant factors. The time complexity stayed quadratic. There is for example a data
structure called a K-d tree to query the F nearest neighbours in an n-dimensional
space in logarithmic time complexity. In our case, this data structure could be
applied with n = 2, and would reduce the time complexity to O(P ∗ log(P )).

3.1.6 Algorithm class

The spectrogram generation, extraction of peaks, peak pairing, and matching pro-
cess are all managed by a single class responsible for the signal processing algo-
rithm itself. This class is called ShazamRecreation and communicates with other
classes like Spectrogram and Connector. This class has two major public meth-
ods. The first one is add song(), which takes a SongInfo object, a waveform and
its sample rate as arguments. This method calls other pieces of code to find and
pair together peaks into fingerprints, then sends these fingerprints together with
the SongInfo object as inputs to Connector.insert song(), which in turn adds
the computed fingerprints and the song to the database. The other major public
method, find song(), takes a waveform and its sample rate as inputs, finds the

23


best matching song, and returns a SongInfo object for the match as well as the
timestamp at which the match was found. This method also firstly calls other
pieces of code to generate fingerprints, but this time for a recording instead of a
clean song. Nevertheless, the fingerprint extraction process is the same. Then the
method calls Connector.get matching hashes() to find the candidate songs, and
subsequently calculates the score for each of those songs. This is done by putting
fingerprint distances into bins and finding the largest bin for each song, more closely
described under section 2.4. The song metadata, encapsulated in a SongInfo object,
the timestamp at which the match was found, and the certainty score are together
returned from the method as a tuple object. The certainty score is explained thor-
oughly in the next section.

The timestamp can be known thanks to the fact that each PeakPair object stores
the timestamp at which it occurs. When matching fingerprints, the time difference
between them is the same as the time in the song at which the recording is most
likely taken. One minor detail is that the difference in time has to be converted from
sample index to seconds. This can be done easily by dividing by the spectrogram
sample rate, which is calculated as the sample rate of the initial recording divided
by its waveform length (sample count), then multiplied by the number of columns
in the spectrogram.

3.1.7 Certainty score

As explained in section 2.4 when matching a recording to candidates, each candidate
song gets a score based on the similarity between it and the recorded sample. These
scores are the basis on which the song prediction is made. It is relevant to know
how confident a prediction is, meaning how big the score for the predicted song is
compared to the other song scores. If multiple songs get similar score levels, it indi-
cates that the algorithm isn’t very confident about which of them is the best choice.
Oppositely, if one song gets a score hundred times larger than all other songs, it is
very likely that the song is in fact the correct candidate.

For the purpose of knowing how confident a prediction is, we have invented what
we call the certainty score. The certainty score is defined as the ratio between the
scores of the two highest scoring songs. A certainty score of one means that two
songs had equal scores, and both were equally good candidates for being chosen as
the answer. This is the worst certainty score possible, since the algorithm still has
multiple candidates even after the matching process. A certainty score larger than,
say, three, would indicate a quite good candidate that most likely is correct.

24


3.1.8 User interface

Given that the project’s primary focus revolves around recreating the Shazam algo-
rithm, our group opted to craft a user-friendly interface that offers simplicity and
functionality, rather than anything fancy or complex. This approach aims to make
the application easier to use, test and display results.

To get started with the interface development process, we utilized the Python library
Tkinter. With Tkinter we could easily spawn a window on-screen and draw labels
and buttons inside it. The buttons could be hooked up to functions that were called
when they were pressed by the user. Although the Tkinter library offers minimal
design choices, it was perfect for our needs.

From the beginning, we thought that it would be easiest to have a start and stop
button to control the audio capturing process. The initial user interface contained
these two buttons, a blue background, and a title text. The issue was that we hadn’t
yet figured out how to make the recorder non-blocking, meaning we couldn’t stop
the recording using the stop button. In fact the whole user interface was completely
unresponsive as we were capturing audio data. This was because the main thread
was busy reading the audio channel buffer. To temporarily fix this, we had to spec-
ify in beforehand the duration of a recording. When we figured out how to record
audio with a background process, we could start using the stop button. The real
Shazam application doesn’t have a stop button. It only has a start button, because
it listens to the surroundings until a song is found, or a timeout occurs. Eventually,
our stop button was removed for good as we focused on replicating the real-time
identification algorithm which will be explained in section 3.1.9.

The final interface is still very simple. It contains just the start button and a
title (see figure 3.1). When the start button has been pressed by a user, the button
disappears and is replaced by a text displaying ”listening...”. To improve the user
interface while the application is listening for sound, we thought about the feed-
back design principle. By animating the number of dots shown at the end of the
”listening...” text, a user won’t worry that the program has become unresponsive.
Animations usually give a sense that something is still happening, which is a good
way to convey feedback to the user. When a song has been found, the ”listening...”
text is swapped out for the starting title. Song title, artist name, album and year
is displayed in a small information box. Also the start button appears so that the
user can go again if they want to. In case no song is found, an information message
about it is displayed instead of song metadata.

Since we wanted to make the application accessible for more people, we wanted
to include at least one more language in the user interface. For every label in the
user interface, we made sure to add text for both English and Swedish, depending
on the user preference. Also, we implemented a blind mode, which causes each label
in the user interface to be read out loud by a computer generated voice. This was
done by utilizing the external Python library gTTS. Also, the start button would

25


be pressed automatically by the program. This way, a blind user could use the
application to identify a song without having to press any button or reading any
text.

(a) When starting the program the
following user interface is displayed.

(b) During the searching process the
following loading screen is displayed.

(c) When a match is found the interface
shows the matching song with meta-
data.

Figure 3.1: The three different states the user interface can display, namely the starting state,
listening state and match found state.

The final theme used is a built in theme from the Tkinter extension ttkbootstrap.
It gives the interface a clean and simplistic look, what is exactly what was sought
after. In the next section the driver code will be explained, which integrates the
previous parts into the full application.

26


3.1.9 Main program

The main program, as said before, is responsible for piecing together the different
parts of the puzzle. It integrates all the components described in this chapter into a
working product. The main program is a Python script, that when run, executes the
code that causes a window to appear, showing the graphical application interface to
the user. The main program also hooks up the start button to the logic that causes
the recorder and algorithm to start working.

When the start button has been pressed by the user, Recorder.start() is first
called, starting the microphone recording on the computer. Then, an algorithm
is used to find the song in seemingly ”real time”. The algorithm in itself is just
a loop that every three seconds takes the recording and sees if a match is found.
More precisely, every third second the so far recorded waveform is copied and sent
as an argument into ShazamRecreation.find song(), which tries to find the best
match and returns the answer. Running the algorithm takes a few seconds, usually
longer than three. This means that when it is once again time to run the algorithm
on the now three seconds longer recording, the last algorithm call is probably al-
ready running. This causes a problem if find song() is being called on the main
thread, because a new algorithm call cannot be made before the old one is complete.
The solution is to spawn a new thread to run each algorithm call on. Once a call
is complete, the response is saved into a Python object that the main thread can
access. The main thread’s responsibility is merely to regularly spawn new processes
that run the algorithm on the recording, then wait until one of the processes has
been terminated. The first resulting match can then be provided to the user via the
graphical interface.

Although this approach could quickly and in seemingly real-time identify songs,
they were often incorrect. After some investigation, we could see that the first
matching attempt (after three seconds of recording) wasn’t very accurate. Presum-
ably the recording matched a lot of incorrect songs due to its small length. It was
at this moment the need for the certainty score arose. The first matching attempt
usually had a particularly low certainty score, meaning the prediction wasn’t likely
to be correct. As subsequent attempts were made, the certainty score rose. A deci-
sion was made to introduce a certainty threshold, to separate song results based on
the algorithm’s confidence. If a song result had a certainty score below 1.5, it was
ignored and the main process kept waiting for other results. As soon as a song had a
certainty score equal to or above 1.5, it was displayed to the user and the matching
was considered complete.

When a song has been identified and displayed to the user via the user interface, we
wanted to continue playing the song at the timestamp at which the song was found.
This would showcase the timestamp feature better, which is supposed to accurately
detect where in the song the recording is matched. The same way that text was
generated into speech and played to the user in blind mode, the songs could be
played to the user as well.

27


It is unwanted to leave the user waiting for too long when trying to find a song. If a
song is particularly hard to find, we want the algorithm to stop after a long enough
time. The main process makes sure that if 30 seconds have passed since the user
first pressed the start button, all following results are invalid. Also, an information
message is shown to the user about the song taking too long to find.

3.2 Data collection, data management and the

database

The reason a database is used in the first place is to store the fingerprints for each
processed reference song in a permanent place. This way, they can be accessed across
application runs without having to be recalculated every time. Since the number of
hashes are in the millions with approximately a hundred reference songs, a database
management system is used to store the fingerprints and song metadata in a robust
and streamlined fashion. We opted to utilize the SQL-based database system Post-
greSQL for our data storage needs. This decision was influenced by its support for
handling asynchronous queries, a crucial requirement for our application as it op-
erates multiple threads concurrently, constantly querying the database. Moreover,
PostgreSQL offers straightforward setup and local deployment, alleviating concerns
about setting up a server.

The database contains two tables. The first table is for storing the metadata for
each song inserted by the algorithm. This table is called songInfo, and contains the
same pieces of information as the dataclass SongInfo. Since we assumed that a
song could be uniquely identified using its name and artist, we selected the combi-
nation of those two attributes as the primary key for the songInfo table. The other
table, called fingerprints, contains all hashes generated by the maxima hashing step
in the algorithm. Not only are the hashes stored, but also the timestamp at which
their corresponding peak pair occurred in the song, and also a reference to songInfo,
indicating which song the fingerprint belongs to. Since the song name and artist
are the primary keys for the songInfo table, the reference contained both the song
name and artist in the fingerprints table as a foreign key.

Initially the only primary key in the fingerprints table were set to be the hashes,
since we thought that the hashes would be unique, but this caused problems. A
primary key comes with a unique constraint, meaning that there can only exist one
row for every value of the primary key. Only allowing unique hashes would defeat
the purpose of matching fingerprints, since the identification process depends on
finding identical fingerprints. With the hash as the only primary key we weren’t
able to add identifcal fingerprints, since the primary key has that unique constraint.
Consequently it was decided to make all attributes: songName, songArtist, hash
and timestamp the set of primary keys. This allows for identical fingerprints from

28


different songs but at the same timestamps, as well as from the same song but at
different timestamps. An advantage of PostGreSQL is that indices are automati-
cally created for the primary keys. An index is a data structure that greatly reduces
the search time in database queries. The database ended up having the schema
displayed in figure 3.2.

Database schema for fingerprints and songInfo tables.

fingerprints(hash, timestamp,
songName, songArtist)

(songName, songArtist) →
songInfo.(name, artist)

songInfo(name, artist, album,
albumcover,
releaseDate)

Figure 3.2: The schema describes the tables in the database, what attributes they have, as well as
how the primary and foreign keys are set up. In other words, the schema shows the relations and
the relationships between them.

After adding more and more songs to the database, it housed metadata and finger-
prints from 101 songs in the end. The songs stored in the database where taken
from already downloaded music files stored on our own computers.

We deliberately refrained from storing the actual songs in the database, firstly due to
uncertainty surrounding the legal implications of handling downloaded music files,
and secondly due to the fact that the songs are huge in file size.

Utilizing the index for the fingerprints table in the database, lookup times are greatly
reduced. The index is essentially a hash table. A hash table allows inserting and
retrieving key-value pairs in amortized constant time complexity. The keys inserted
into the hash table are the hash values. With most hash tables, their size is dy-
namically updated not to be too large. When inserting or retrieving values for a
key, the modulo operator is applied on the key using the table size. When roughly
half the table is filled with key-value pairs, the table is swapped out for a new table
of double the size, and all old key-value pairs are reinserted into the new table at
new locations. This is also the way Python hash tables (or dictionaries) work. This
means that the hash table will never be over double the size it needs to be, but
most importantly, all of this is handled by the database management system. The
only thing one has to care about when using the index, is inserting and retrieving
key-value pairs. The hash table is dynamically updated to have the smallest address
space possible, while still having amortized constant time complexity O(1). Further-
more, since the database management system constrains the key into the space of
the table itself, it doesn’t really matter how big our fingerprint hashes are. However,
when calculating the fingerprint hash, we still take it modulo a billion to reduce the
fingerprint into a number that fits in four bytes. There are multiple ways to create
a good hash from the three attributes of a PeakPair object. We could concatenate
the binary representation of the three values, and the result will always be a fixed
number of bits, unique for each unique combination of the three attributes. Or we
could hash the three values together using the so called Polynomial Rolling Hash
Function, which doesn’t guarantee uniqueness for each unique attribute combina-

29


tion, but in practice works just as well. This is because collisions are rare. We
implemented the latter approach for the simple reason that we though about that
approach first. Which approach we take does not affect neither the time complexity
of inserting or finding hashes in the database, nor does it affect the address space
in the database hash table. The hash function implementation is presented under
section 3.1.2, where each of the three attributes is multiplied with a prime number
raised to the power zero, one, and two, respectively. The terms are summed and
then taken modulo of. If we were to change any of the three values slightly, the hash
value would be different. Similarly, if we were to swap the order of the three input
values, the hash value would also be different.

3.3 Testing

To ensure that the program worked as intended, tests were performed. First, tests
were conducted to ensure that the algorithm could match clean recordings of songs.
Then, noise was added and new tests were conducted. Since almost all real-life appli-
cations of the Shazam algorithm involves noisy recordings, we tested the algorithm
against a wide variety of noise levels. We tested the application in real environ-
ments, but also in a simulated environment where noise was artificially added to
clean music. Also, some of the algorithm’s hyperparameters were optimized.

3.3.1 Testing without noise

In the requirement specifications for this project (see table 1.1), there are two re-
quirements regarding recognition of songs without noise. To verify these require-
ments, a Python script was written to generate 101 recordings. Since the require-
ment was to test without noise, these recordings were taken directly from our music
library. For each of the 101 songs, a five seconds long recording from a random part
of the song was taken and then inputted to the ShazamRecreation.find song()
method automatically. The script then displayed the number of correct identifica-
tions out of the 101 trials. The test was performed 20 times on all the 101 songs
added to the database. Each time the test was performed, a new five second sample
from each song was used. The maximum number of hashes per peak (F ) and the
radius in the maximum filter (R) were both set to ten in this test.

3.3.2 Real-life noise testing

The real noise tests where conducted simply by testing the algorithm by pressing
start in the application, playing music from an external audio source with talking in
the background. To vary the noise for the different tests, the talking was done closer

30


to add more noise and further to add less noise. The distance to the microphone
from the music source was the same.

The test was conducted on ten different songs and the result of each test was either
a success or a fail (shown in section 4.2). As with the last test, F and R were both
set to ten in this test, and no timeout was set during this test.

3.3.3 Computer automated noise testing

A disadvantage of testing the application against real life recordings was firstly that
they were slow to do, and secondly, it was hard to maintain constant noise levels
across tests. With computer automated tests, hundreds of recordings could be gen-
erated in under a second. Also, we could add equal noise levels to all the generated
recordings, making sure the tests had similar conditions.

A computer automated test that was performed was to test the algorithm across
a wide range of noise levels. For this we utilized the concept of SNR explained in
section 2.5. For SNR values from 60dB all the way down to -20dB, the algorithm
was stress tested on generated recordings of these SNR values.

To artificially add noise to a clean recording that results in a recording of desired
SNR, we rearranged equation 3.1 in the following way. This equation solves for the
desired RMS value Anoise for the noise signal, given the desired SNRdB value, and
given that Asignal is computed.

SNRdB = 20 · log10
(
Asignal

Anoise

)
⇔ Anoise =

Asignal

10
SNRdB

20

(3.1)

To generate noise with a desired RMS value RMSdesired (which in our case is equal
to Anoise), a signal containing random amplitudes was first created, because such
a signal is considered noisy. This signal’s length exactly matched the clean music
signal’s length, since they were going to be added together in a later step. The RMS
value for this noise signal, RMSinitial, is calculated using equation 2.3. To make
this noise signal have the desired RMS value, we can scale each sample in the entire
signal by the factor RMSdesired

RMSinitial
. The noise was then added to the clean music signal.

In the test, 200 random five seconds long recordings were generated for each of the
different SNR values to experiment with, using a Python script. The percentage of
correctly matched songs was automatically calculated and stored. Some multipro-
cessing techniques were utilized to run the test in parallel on a computer with 24
cores. A process pool was created so that 24 different instances of the test could
be run at the same time, each with their own set of SNR values. This resulted in a
speedup that made the test run in approximately five hours instead of the estimated

31


50. The resulting percentages of correctly matched songs for each SNR value was
compiled into a plot, which is later shown in section 4.3. Again, the parameters F
and R were both set to ten in this test.

3.3.4 Hyperparameter testing

To refine the parameters and enhance the performance of our algorithm, we con-
ducted some computer automated tests with different hyperparameter values. The
optimization aimed at refining the hyperparameters F and R, regarding the maxi-
mum number of hashes per peak, and the radius parameter in the maximum filter
to be applied to the spectrogram, respectively. Both these parameters has a direct
effect on the number of fingerprints extracted from a piece of audio, and thus, the
execution time of the algorithm. However, more fingerprints could lead to a more
effective identification. These hyperparameters needed to be selected to result in a
good trade off between performance and accuracy.

During the optimization testing, we systematically varied the values of F and R
within predefined ranges. In the test section, F is referred to as Pairvalue and
R is referred to as Filtervalue. To compute the best values, we created a test
bench that evaluated each parameter combination. For each parameter combina-
tion, the algorithm was tested with those parameter values on clean music recordings
and the resulting performance was automatically stored to then later be analyzed.
The goal of the optimization was to find the combination of hyperparameters that
yielded the highest certainty in the testbench. The thought process was that a
higher certainty would indicate a more confident answer. Our primary focus was
on identifying parameters that maximized confident results rather than algorithm
speed. Later on, we could pick from these parameters in a way that also maxi-
mized the algorithm efficiency. The testing bench’s time complexity is primarily
determined by the number of songs in the database, and the time complexity of
ShazamRecreation.find song(), which partly depends on the hyperparameter
values. The decision was made to optimize only the two parameters F and R, which
we found most important since they have a great effect on the number of fingerprints
generated. Increasing the number of parameters would result in much longer test-
ing times. This is because we need to test all combinations of the parameters. To
further increase the test bench execution speed, we also decided to limit the number
of songs in the database to only eight.

To represent the result of each test bench, the average value of all certainties for all
combinations of parameters was calculated. This was then displayed in a heatmap
with Pairvalue and Filtervalue as the two axes. The brightest pixels in the
heatmap reflected the best parameters in terms of algorithm accuracy. The result
of the optimization and the heatmaps themselves are later presented in section 4.4.

A major oversight that was done in this test was that when trying to match songs

32


using varying values for the hyperparameters, the fingerprints that were already in
the database had been generated on specific values of those hyperparameters. In
practice, the fingerprint extraction process will use the same values for the hyperpa-
rameters when inserting songs and when matching songs. An ideal scenario would
be to use the same hyperparameters for when loading in songs as when identify-
ing songs. However, loading in all songs with all combinations of hyperparameters
would take way too long. Thus, we decided to load songs into the database using
three different combinations of the hyperparameters, and generate heatmaps just
for those three cases.

3.3.5 Code tests

The previous sections mostly discuss testing the whole algorithm against recordings.
In this section, on the other hand, some more specific tests are brought up. These
are tests that verify the functionality of smaller code blocks. In particular, the max-
imum filter and the peak pairing functions are tested.

The first type of test made on both functions is a unit test. A unit test verifies
the functionality of a piece of code by comparing its output to some expected value.
In the unit test for the maximum filter, an eight by eight matrix of integers between
zero and ten is created. The values inside the matrix has been hand-picked by us.
Another eight by eight matrix is created, containing the expected result of running
the maximum filter on the first matrix. The values in this matrix has been calcu-
lated by hand. In the test, the radius parameter R = 3 for the maximum filter.
Finally, the maximum filter is applied on the first matrix. Then, that result is com-
pared with the second matrix. Using the assert keyword in Python, we can make
sure that the program terminates with an error message if the test fails. The same
thing was done for the peak pairing function. Some peaks where manually created,
and the resulting expected peak pairs where calculated by hand. Then, the result
of running the function on the peaks is compared to the expected values, and the
program terminates with an error if they are not equal. In this test, the parameter
F was selected to be equal to one for simplicity purposes.

The second type of test made on both functions is a functional test. In such a
test, a more precise and often slower function is written, to compare the result with.
For the maximum filter, such a function was written that loops through each ele-
ment in the matrix, and compares it to all values in the surrounding box of width
R. This function has four nested loops, since we are looping through each row and
column of the matrix, and for each combination of those, we are looping through
each row and column in the surrounding box. Although a lot slower, this function
quite clearly does exactly what we want it to. An input matrix was generated and
sent as input to both the fast and the slow versions of the maximum filter, and their
outputs were then compared to each other to make sure they were equal. The same
was done once again with the peak pairing function. A slower version was written

33


that didn’t utilize sorting tricks for skipping peaks. The resulting peak pairs from
both functions were compared with each other.

Unit tests and functional tests were not the only tests made on the code. The
interface against the database was tested using an integration test, which essentially
consisted of running the main program and verifying that it could call the methods
in the Connector class to store and retrieve data from the database.

Lastly, an end to end test was performed. An end to end test in our case means
running the entire application, and using it to identify a song correctly. Since some
of the tests explained in previous sections (for example section 3.3.2) have been end
to end tests, no extra test had to be made.

These tests are of course not enough evidence that everything works correctly, since
there may be edge cases not existing in the input data. However, these tests shows
that if they fail, it is definitely something wrong with the implementations. The
further we develop our testbench, the higher confidence we have in the robustness
of the program.

34


4
RESULT

The following sections show figures presenting the result of all tests described in the
last chapter. The tests are partly a verification of the program’s functionality, and
they also showcase its performance.

4.1 Results for tests without noise

The result of the test described in section 3.3.1, where clean recordings were used
directly on the algorithm, is presented here.

35


Test Result
1 101/101
2 101/101
3 101/101
4 101/101
5 101/101
6 101/101
7 101/101
8 101/101
9 100/101
10 101/101
11 101/101
12 101/101
13 97/101
14 101/101
15 101/101
16 101/101
17 101/101
18 101/101
19 101/101
20 100/101

Table 4.1: Number of songs correctly identified from different five second samples

Out of all 20 iterations, the algorithm only failed to recognize 6 recordings. That is
an accuracy of 99.8%.

4.2 Real-life noise test results

The result from testing with low levels of noise gave:

Fail: 5

Success: 5

When testing with significantly larger amount of noise the following result was ob-
tained:

Fail: 7

Success: 3

Success means that the correct song was identified within time, and fail means either
that the song identified was incorrect. The hyperparameters used was Pairvalue

equals four and Filtervalue equals ten. This due to pass the time criteria stated
in table 1.1.

36


4.3 Computer automated noise test results

The result for the automated test described under 3.3.3 is neatly shown in the
following figure. Note that the SNR-axis goes from positive to negative which is
intentional.

Figure 4.1: A plot showcasing the algorithm’s accuracy as it was tested on recordings of different
signal-to-noise ratios.

The algorithm maintains a high certainty even when the noise is almost as loud as
the input signal. There is a steady decline in accuracy when the SNR ≈ 5dB which
means the input is < 1.78 times larger than the noise level in terms of RMS.

4.4 Results of hyperparameter testing

The heatmaps are the result from the parameter optimization test in section 3.3.4,
and describe how the mean certainty of the matched songs depends on the hyper-
parameters Filtervalue and Pairvalue. The figures 4.2, 4.4 and 4.6 illustrate
heatmaps, which represent how the certainty (z-axis) depends on Pairvalue (y-
axis) and Filtervalue (x-axis) when matching tests are performed. Due to the
mistake we made, where we inserted songs using different hyperparameters com-
pared to later when testing, three plots have been generated. In each of the three
plots, a certain Filtervalue and Pairvalue was used to add the songs, while the
plot axes tell us the Filtervalue and Pairvalue which was used to identify songs.

37


The related graph to each heatmap is the column containing the highest certainty
value. The y-axis represents certainty and the x-axis represents the Pairvalue.

(a) Heatmap when Pairvalue equals
ten, Filtervalue equals ten.

(b) Graph over column ten in Figure 4.2.a.

(c) Heatmap when Pairvalue equals eight, Filter-
value equals ten.

(d) Graph over column ten in Figure 4.2.c.

(e) Heatmap when Pairvalue equals four, Filter-
value equals ten.

(f) Graph over column ten in Figure 4.2.e.

Figure 4.2: Heatmaps and the resulting graphs from the highest certainty column for three different
combinations of hyperparameter values when preprocessing songs.

Higher certainty gives brighter color in the heat maps and lower certainty gives
darker color.

38


4.5 Results for code tests

The unit tests and functional tests explained in section 3.3.5 were executed. If any
of the four tests failed, the program would terminate with an error message. If
all tests succeeded, the program would terminate with a message indicating success.
After executing the tests, the message indicating success appeared, meaning all tests
had passed. Also the integration test and the end to end tests were verified, since
multiple songs were successfully found using the complete application, as stated in
section 4.2.

39


5
DISCUSSION

In this section, the results are evaluated in order to create a discussion about the
requirements specification (see table 1.1) and the overall project objectives. Then,
some improvements and ideas for future development are raised.

5.1 Verification of requirements

The first group of requirements regards the database containing fingerprints and
information about songs. The database is filled with fingerprints from 101 songs.
This means that requirement 1.1, which is about having enough songs, is verified.
Requirement 1.2 says that the songs should be diverse genre-wise, by having at least
five different genres. Many would say that genres are hard count since they aren’t
always very well defined. However, we can safely assume that we have at least
five genres in our music database. To name a few, we have pop, electronic music,
disco, hip hop, and Swedish classics. Another requirement for the music database
is that the music is spread out in time. The verification for this was that at least
three songs from the 1980s, 1990s, 2000s, 2010s, and 2020s should exist in the music
database. Since the information we store for each song in the database contains
not only the song and artist names, but also the release date, we could easily look
in the database to see that it contained at least three songs from these decades.
To conclude the requirements regarding the database, all three of them have been
successfully verified.

The second group of requirements is about the recording part of the project. Re-
quirement 2.1 states that the recorded samples cannot be too long. There are two
reasons for including this criteria. Firstly it is unwanted that the application keeps
listening forever, and secondly, the algorithm should be able to identify songs from
short recordings. The main script for the application sends the recording as it is
being recorded when it is three, six, nine, and twelve seconds long. This way, the
longest recording that the algorithm receives as input is twelve seconds, which means
that this criteria is fulfilled. The next requirements is about the recorder code being
non-blocking. As explained in section 3.1.1, separate threads were used to copy the
microphone audio buffer into main memory and for the rest of the code. This was
done exactly for the purpose of concurrency, because this allows for the algorithm

40


to try to find the song at the same time as it is listening to it. With criteria 2.1 and
2.2 fulfilled, the second group of criteria in the requirements specification is complete.

The third group of requirements regards the performance of the algorithm itself.
Requirement 3.1 declares that the identification process cannot be too slow. When
running the application and pressing the start button, the identification result has
to be shown to the user within 30 seconds of the button press. Whether a song is
found or not, some result has to be ready before 30 seconds have passed. One way
to achieve this time limit would have been to optimize the algorithm to always finish
matching in a short time. Since the matching time depends on a large number of
factors, for example the computer it runs on and the characteristics of the recording,
there is no clear goal in terms of speed. Rather, we we consider a slow identification
as being failed, and stop the algorithm prematurely. This is, as explained in section
3.1.9, done by after 30 seconds displaying to the user that the identification failed
because it took too long. By doing it this way, we fulfill the requirement. The
second and third subrequirements for performance is about identifying songs that
does not contain noise. As displayed in section 4.1, most of the time 101 out of 101
songs were correctly identified when no noise was added. It was a must that at least
80 songs could be identified, and we hoped for 95. With the worst result having
close to 97% accuracy, and in the average case 100% accuracy, it is clear that both
these criteria have been met. The following three requirements, 3.4, 3.5 and 3.6 is
about testing with real noise. The tests were conducted by simply recording a song
from a speaker and talking normally in the background. When conducting tests for
3.6 specifically, which states that at least one song should be recognized with strong
background noise, additional talking was done and closer to the music source. As
shown in section 4.2, the result of low levels of noise resulted in a success rate of
50% and with higher levels of noise a success rate of 30%. Thereby, we deemed that
the requirements was passed. Due to the nature of the test it was hard to determine
the level of noise in each of the trials, but by analyzing the graph in 4.3 we can
assume that the overall noise for the low noise level test was about 5dB higher than
the music, and for the high noise level test, it was about 15 dB higher than the
music. Lastly, requirement 3.7 is about the algorithm being concurrent, in the sense
that two different recordings can be identified simultaneously by one computer. As
described in section 3.1.9, new threads are spawned as soon as it is time to run
ShazamRecreation.find song() with a longer recording. With multiple threads
active, the desired concurrency is achieved, and thus the criteria is fulfilled as well.

The fourth group of requirements is about the graphical user interface to the appli-
cation. Criteria 4.1 says that the application should be similar to the real one. The
verification for this is to have a single button to start the recording and identification
process. In section 3.1.8, it is explained that the final interface contains just a start
button and a title text. With a simple interface like this, the user experience will be
very close to the one when using the real Shazam application. After finding a song,
the song and artist names should be displayed, as well as the release year and album
name. This is what requirement 4.2 is about. As described in section 3.1.8, all four
of these variables are shown to the user if a match is found. In case no match is

41


found, instead of the song metadata, a notification message about no match being
found is displayed. Since this is the criteria for requirement 4.3, it is also fulfilled.
Requirement 4.4 and 4.5 regards accessibility. According to verification criteria for
requirement 4.4, the user has to be able to select to use either English or Swedish
in the graphical user interface. The criteria for requirement 4.5 is that the program
must be able to be started in blind mode, where text that is displayed is also played
as audible speech to the user. In the same section about the user interface, it is
stated that both English and Swedish versions of the application exists, and that
blind mode has been implemented. With blind mode, not only is all text presented
as audible speech, but the start button is pressed automatically, so that the user
doesn’t have to do it. Comparing these facts with the verification criteria, it can be
concluded that both the requirements 4.4 and 4.5 were successfully met, even though
they weren’t mandatory. Criteria 4.6 states that after a song has been found, a user
can simply continue using the application and find another song if they would like
to. In the same section it is stated that the start button reappears once the result
has been shown, so that a user can try again. The final criteria regarding the user
interface is about it working on both windows and mac operating systems. As we
have someone in the group that uses a mac computer, this has been verified while
developing the application. To conclude the part in the requirements specification
that is about the user interface, all of the seven sub-criteria have been met.

The final part in the requirements specification regards software testing. The tests
for all five requirements were explained in section 3.3.5. The results for the tests
were presented in section 4.5. The tests were constructed according to the veri-
fication description in the requirements table. Since all tests were successful, the
criteria for requirements 5.1 through 5.5 have been met, which means that the fifth
group of requirements were all verified.

5.2 Analyzing heatmaps and hyperparameters

The testing of the program consisted of multiple tests, mainly to make sure that our
program and algorithm worked smoothly but also to potentially refine some of the
parameters.

The generated heatmaps (see section 4.4) displays how the certainty changed with
different values on the parameters. By analyzing the maps we can see that for both
low and high values of Filtervalue the certainty is low and less effected by the
Pairvalue.

The reason behind this is because the lower the Filtervalue, the higher the number
of peaks in the constellation map (see section 2.4). With a large amount of peaks
present in the search area (see section 2.5), fingerprints get extracted from not only

42


the music, but the noise as well. This might result in a lot of incorrectly matched
hashes. For higher values of Filtervalue, a lower number of peaks are created.
The reason that no matches are found here is due to the lack of peaks in the search
area. There isn’t enough peaks to pair with and therefore not enough data to search
after. This will generate wrong answers.

The column corresponding to a Filtervalue of ten contains the highest certainty
values. By referring to the graphs, which represents the same certainty and Pairvalues
plotted against column ten, we observe that the graph begins to flatten out at a cer-
tain Pairvalue. As Pairvalue increases, the certainty won’t go higher and in some
cases it slightly decreases. The reason behind this is because when the songs in
their entirety was calculated by the algorithm and their hashes were added to the
database, the Pairvalue and Filtervalue had values of almost the same param-
eters as where the maximum certainty was found. If we look at all the heatmaps
we can see a similar pattern. We didn’t get the chance to test different values of
Filtervalue when preprocessing songs due to the lack of time.

In conclusion, the heatmaps show that matching works best when using the same hy-
perparameters during testing as when preprocessing songs, which is expected. When
Pairvalue gets bigger we can see an exception where the certainty gets somewhat
better with a slightly higher value of the variable Pairvalue in comparison to the
value used when adding the music files. The reason for this is that it probably
doesn’t hurt to generate a few more fingerprints from a recording to match with.

5.3 Future improvements

For future projects, it would be interesting to explore the alternative approach of
using artificial intelligence (AI) and machine learning (ML) for music identification.
It could be possible that such a path could result in a faster identification process
and recognize songs with more added noise. This was something we had in mind to
explore and to compare our algorithm with, as stated in section 1.5. The alternate
approach of using AI and ML for music recognition could perform better when it
is trained to adapt to the different surroundings and environments. However, it is
important to recognize that implementing an AI-based solution could cause chal-
lenges, such as data acquisition, model training, and computational requirements,
which would have to be directed for future projects.

Many of the tests that were documented in section 4 took substantial time to execute.
One solution - for further studies - to be more time efficient, could use the graphical
processing unit (GPU) for faster calculations. The central processing unit (CPU)
of the computer handles typically all calculations and does it very well. Although
CPUs can efficiently handle sequential operations with great speed, processing large

43


volumes of data is less efficient. This is because the CPU doesn’t have very many
cores, which means that it’s not very good at multitasking. However, a GPU has
far more cores, although running at lower clock speeds. The GPU uses these cores
for parallel computations, as well as pulls and pushes data from several different
in-/outlets at the same time. Thus, for a GPU to perform better than a CPU, the
calculations have to be simple and there should be many of them. For instance, a
GPU could probably be used both to pair together peaks and to calculate scores for
candidate songs faster. Some parallelization could even have been done on the CPU.
For example, instead of spawning a new thread that tries to run the algorithm on
a recording, we could spawn a new process instead. Threads in Python offers con-
currency, but not parallelization, which means that although multithreading allows
a CPU core to context switch between multiple pieces of code, it cannot truly run
them simultaneously. Using multiprocessing instead allows multiple cores to run
code at the same time, which can lead to a performance boost.

The spectrogram is generated following the procedure outlined in section 3.1.4, utiliz-
ing the theory described in section 2.1. However, some parameters used to generate
the spectrogram were left unchanged. It would be interesting to experiment with
different section widths as well as different levels of overlapping. Investigating these
as additional hyperparameters in future testing could be worthwhile.

The windowing function was applied in the spectrogram generation (see section
3.1.4) and dampens the leakage in the frequency domain. Since the spectrogram
class was one of the first classes created in the project, the Hann window function
was selected without much thought. It would be interesting to explore how the
algorithm behaves when other window functions are applied.

As covered in section 3.1.9, the use of multiple threads was described for running
the algorithm on different recording lengths concurrently. To examine and optimize
the effects of the threads in regards to the matching accuracy we would like to test
the recorder using more threads, with a tighter sleeping interval between them. This
would probably increase the accuracy for the simple reason that we try to match
more recordings from the same song than before, but would also introduce more
computations. Future testing would include finding the best number of threads to
start to maximize accuracy, without slowing down the search time too much. Also
the best time interval between starting threads could be explored.

In the tests displayed in figure 4.2, we saw that the lower Pairvalue the lower
the average certainty gets, but what isn’t shown is that lower Pairvalue also re-
sults in fewer fingerprints created witch then decreases search time. One interesting
aspect that we would have liked to further look into is the possibilities of finding
the best trade-off between speed and certainty of the algorithm by making a more
complete search through the values of the hyperparameters. Considering that the
optimal hyperparameters, as discussed in section 5.2, vary depending on the hyper-
parameters used during the database creation process, the database should therefore
also contain fingerprints created with said hyperparameters.

44


It is important to note that the heatmaps were created with noise-free samples.
This implies that for recordings with background noise, the ideal parameters might
differ from those illustrated in the current heatmaps.

The major problems with testing the program is that it takes a lot of time. There-
fore, we weren’t able to conduct all the tests and optimizations that we wanted
to.

5.4 Scalability

As stated in section 3.2, the database contained 101 songs during most of the tests.
If the database’s size was to be scaled up by adding more songs, it is interesting
to discuss how the algorithm performance would scale in proportion to the number
of songs. Let us assume there are S songs in the database, and that a constant N
fingerprints are extracted from a recording. There are two ways the performance
can scale as S increases. The first case is where we assume that only fingerprints
from the correct song are matched from the database. If so, the post processing
(matching) will be done on the same number of matched fingerprints, which is on
the order of N , which is independent of S. The other case is where we assume
that not only fingerprints from the correct song are matched from the database, but
also incorrect ones. Assuming that a fingerprint from an incorrect song is being
matched with probability p, and that each song has a constant M fingerprints in
the database, the number of matched fingerprints will be on the order of N +SMp.
To clarify, we match with around N correct fingerprints and around SMp incorrect
ones. As S grows larger, the term SMp grows linearly in proportion to S. How
the total sum N + SMp grows in relation to S depends on p, since N and M are
considered constants. The post processing time depends on the number of matched
fingerprints N + SMp. For large enough values of p, the post processing execution
time will scale linearly with the number of songs, S.

In practice, the second case is probably the better model, with quite a low value of
p. From our experience testing the algorithm on values of S between a dozen and
a hundred, the matching time seems to increase as we increase S. However, the
matching time doesn’t seem to tenfold as the number of songs tenfold. A conclusion
can be drawn that some of the processing time comes from searching through the
correctly matched fingerprints, which will stay the same. Likewise, some of the pro-
cessing time comes from iterating through the matched fingerprints from incorrect
songs, which grows with S.

Assuming that fingerprints rarely match with incorrect songs the time complexity
of finding a recording is O(TF

R2 ) because that’s the number of fingerprints extracted
from the recording. Each fingerprint is processed in amortized constant time com-
plexity since we are using a hashmap in the database.

45


6
CONCLUSION

This thesis aimed to create a functioning application that can use acoustic finger-
printing for music identification. Both clean songs as well as songs with noise added
to them, were able to be identified by the algorithm. The successful implementation
of the project underscores its feasibility and that our approach was sui