Music Recommendations Based on
Real-Time Data
Bachelor of Science Thesis in Computer Science and Engineering

MARCUS AURÉN
ALBIN BÅÅW
TOBIAS KARLSSON
LINNEA NILSSON
DAVID HAGERMAN OLZON
PEDRAM SHIRMOHAMMAD

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden, May 2018


Music Recommendations Based on
Real-Time Data
Bachelor of Science Thesis in Computer Science and Engineering

MARCUS AURÉN
ALBIN BÅÅW
TOBIAS KARLSSON
LINNEA NILSSON
DAVID HAGERMAN OLZON
PEDRAM SHIRMOHAMMAD

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden, May 2018

Abstract

This thesis describes the development, implementation and results of a music recommender system that utilizes
real time data, namely time and heart rate, for the recommendations. The recommender system was made by
combining two systems, the recommender system which predicts a number of song features for a specific user
and a ranking system which finds the best matching tracks for these features. Three implementations of the
recommender system were implemented for comparison, namely Deep Neural Network, Contextual Bandit and
Linear Regression. These implementations were tested with offline evaluation which showed that for our prob-
lem, a contextual bandit model had the best accuracy.

Keywords: Recommender system, music recommendations, neural network, deep learning, reinforcement learn-
ing, contextual bandit, linear regression, deep neural network

i


Acknowledgements

First of all, we want to thanks our supervisor K V S Prasad for his guidance and pep talks. We also want to
thank Mikael Kågebäck for his insightful suggestions regarding machine learning, Niklas Broberg for taking
the time to help us get back on track and Lars Norén for assisting with hardware. Finally, we want to thank
Fackspråk for helping us with insight in the art of technical writing.

Marcus Aurén
Albin Bååw

Tobias Karlsson
Linnea Nilsson

David Hagerman Olzon
Pedram Shirmohammad

Gothenburg, Sweden, May 2018

iii


Sammanfattning

Den här rapporten beskriver utvecklingen, implementeringen och evalueringen av ett musikrekommendations-
system som använder sig av realtidsdata så som tid och puls som en faktor i sina rekommendationer. Rekommendations-
systemet består av två separata delsystem, ett som rekommenderar ett antal sångattribut och ett rankingsystem
som sorterar alla låtar i databasen baserat på de rekommenderade attributen. Tre olika typer av rekommenda-
tionssystem har implementerats för utvärdering. Ett baserat på ett djupt neuralt nätverk, en kontextuell bandit
och en linjär regression. Alla tre implementationer utvärderades sedan med hjälp av offline-evaluering som
visade på att den kontextuella banditen gav oss bäst träffsäkerhet.

v


Contents

List of Figures ix

List of Tables x

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Heart Rate and Music . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Content-based Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Hybrid and Other Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2 Reinforcement Learning and Contextual Bandits . . . . . . . . . . . . . . . . . . . . . 10
2.2.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Offline Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Online Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Implementation 14
3.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Mobile Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.5 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.6 Recommender System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6.1 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6.3 Contextual Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6.5 Ranking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Results 23
4.1 Evaluation System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Deep Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Contextual Bandit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

vii


4.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Discussion 28
5.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.3 Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.5 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.6 Ethical Aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.7 Impact on Society . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.8 Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusion 32

7 Bibliographic Notes 33
7.1 Recommendation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

7.1.1 Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.2 Connecting Music to Heart Rate, Activity or Mood . . . . . . . . . . . . . . . . . . . . . . . . 34
7.3 Machine Learning and Neural Networks in General . . . . . . . . . . . . . . . . . . . . . . . . 35
7.4 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

A Privacy Policy I

viii


List of Figures

2.1 An example of how content-based filtering works. . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 An example of how collaborative filtering works. . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 A deep neural network with two hidden layers . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 A diagram over a Neural Network Node . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Contextual bandit process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.6 Example of linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Example of an A/B-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 An overview of the different parts in the recommender system . . . . . . . . . . . . . . . . . . 14
3.2 Flow chart of the mobile application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 A screenshot of the mobile application’s user interface. . . . . . . . . . . . . . . . . . . . . . . 17
3.4 A sequence graph for a song recommendation with an empty server cache . . . . . . . . . . . . 18
3.5 The feature weight function for loudness and BPM plotted in a graph . . . . . . . . . . . . . . . 22

4.1 The blue dots represent the score of a recommendation at each iteration and the red line high-
lights the trend of every 10th score by the deep neural network. . . . . . . . . . . . . . . . . . . 25

4.2 The blue dots represent the score of a recommendation at each iteration and the red line high-
lights the trend of every 10th score by the contextual bandit. . . . . . . . . . . . . . . . . . . . 26

4.3 The blue dots represent the score of a recommendation at each iteration and the red line high-
lights the trend of every 10th score by the linear regression. . . . . . . . . . . . . . . . . . . . . 27

ix


List of Tables

3.1 The hardware specification for the system server . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Feature column types used for our deep neural network and linear regression models . . . . . . 20
3.3 Label and action buckets for tempo and loudness . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Heart rate, time value and rating buckets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1 Wanted song attributes by user during different times and heart rates . . . . . . . . . . . . . . . 23
4.2 A sample of the recommended songs by deep neural network. . . . . . . . . . . . . . . . . . . 25
4.3 A sample of the recommended songs by contextual bandit. . . . . . . . . . . . . . . . . . . . . 26
4.4 A sample of the recommended songs by linear regression. . . . . . . . . . . . . . . . . . . . . . 27

x


Chapter 1

Introduction

With today’s music streaming services, enjoying a wide array of music is more accessible than ever. You don’t
have to buy your own copies of the records, but can explore whole libraries of music from your computer, tablet
or smartphone, and at least in theory this makes it easier to find new favourites. The vast number of songs
may be hard to survey, which is why most streaming services offer recommender systems that aim to figure out
what a specific user might be interested in listening to. Our main thought when starting this project was that
these recommendations are seldom very accurate, therefore we started thinking about ways of making better
predictions of songs that the user actually wants to hear.

In the project, we explore if we can get better recommendations by using real-time data, specifically a user’s
heart rate and the time of day, when making recommendations. The system making the recommendations will
be implemented using several machine learning techniques and will be presented through a mobile application.
The system uses a smart watch to recognize the user’s heart rate in order to give recommendations of songs
according to what kind of music is usually associated with that heart rate and time of day for that specific user.
For instance, if a user is out running, the user’s heart rate is probably higher than normal. The intention is that
the system will learn that this user often takes runs in the morning, and that they prefer to listen to e.g. fast and
loud music while exercising. To get this recommendation the user presses the play or next button in the mobile
application which captures the heart rate and current time and sends it to the system. The recommender system
has been trained to recognize patterns in the user’s listening behaviour. It takes in these parameters and matches
them with other metrics such as the audio features of a song in order to make a recommendation to the user.
The best matching song is then sent back to the mobile application which plays it to the user.

This thesis covers how we have implemented three versions of this type of recommender system and the re-
sults of the different approaches. The thesis starts with a brief background based in current research and moves
on to explaining further why this project is relevant. The following sections in the introduction consists of the
purpose and scope of the project, explaining the goals and delimitations when developing the system. Chap-
ter 2 consists of the theory behind the ideas, concepts and algorithms applied in this project. In chapter 3 the
implementation is described with detailed descriptions of all parts of the system as well as how they are all con-
nected. In chapter 4 the results of the project is presented. In chapter 5 the results are discussed and analyzed,
and the project as a whole is discussed. In chapter 6 the discussion is narrowed down to form a conclusion
encompassing the entire project.

1.1 Background

The volume of data on the web is growing at an increasing rate and filtering through the data can be overwhelm-
ing, thus the need for personalized recommendations is greater than ever. With the rise of the Internet of Things

1


and because a growing amount of people are connected to the internet, a large amount of information can be
gathered. For instance, the GPSes in our cars and phones gathers information about our location and movement
patterns. Another example is that phones together with pulse monitoring devices, such as smart watches, or
connected medical devices gather medical data such as heart rate or blood pressure. Even data of eating habits
can be gathered by smart fridges and kitchen appliances. This continuously changing information can infer
activity and mood which affect our preferences on a real-time basis.

Recommender systems today are usually based on a user’s feedback history regarding a specific group of items.
Such a service presents its users with a number of recommendations but will give the same recommendations
regardless of the users’ current activity, mood or location. Attempts at developing recommender systems that
take contextual information into consideration have been made. Two examples of this includes systems that
considers the day of the week, location and the user’s companions, as well as systems that are aware of the
user’s activity. Both approaches have shown success compared to systems that do not consider these aspects
[38] [36]. Therefore it is likely that taking the aforementioned parameters into account which suggests the user’s
mood and activity gives more accurate recommendations.

Real-time data changes constantly but data connected to a person based on real-time parameters often fol-
low certain patterns. Humankind is a creature of habit and as such we do not tend to deviate from these patterns
randomly. As real-time data can change quickly and an algorithm based on real-time data must be efficient.
We want recommendations that are appropriate for the current state and not previous states. Machine learning
techniques such as neural networks are currently the focus of a lot of research, and they are also becoming
increasingly popular in the field of recommender systems [33]. They can not only handle the ever growing
amount of data, but due to the learning algorithms, actually improve in quality in correlation to the amount of
data considered. As the amount of data has grown and computational power of computers has increased, ma-
chine learning has become a far more viable field than it has been historically. The algorithms are designed to
find patterns and therefore a feasible choice for a recommender system that incorporate real-time data parame-
ters. As the interest in the field of machine learning has increased developer tools and libraries have been created
which has led to machine learning becoming a more approachable subject for developers without expertise in
machine learning.

Real-time sources of data are especially relevant for music recommendations because the correlation between
our music preferences and our current emotional state is high. Certain songs or types of music can alter our
moods in different ways and our preferences regarding music are therefore often connected to our mood [20][3].
Music choices also relate to the listeners current activity. Although we might have a certain taste in music, our
preferences will change according to what we are doing. For instance a person will probably listen to different
songs when they are working out at the gym compared to when they are trying to fall asleep at night [8].

Our heart rate is connected to both our emotional state and our activity level and studies have shown that
listening to different types of music affects a person’s heart rate [35] [12]. This, together with the fact that heart
rate is an easy to access real-time parameter makes it an optimal source of real-time data to incorporate into this
project.

1.2 Heart Rate and Music

There are several studies on how heart rate is affected when listening to music with different tempos which
indicate that heart rate changes according the tempo of the music your are listening to [35] [22] [20]. The fact
that a user’s heart rate is increased when listening to high tempo music is not an indication of that the users
music preferences is changing. However, there is additional research that indicate that there is a relationship
between a person’s heart rate and preferred tempo. In 1995, Iwanaga M found an harmonic relationship between
the heart rate and preferred tempo that showed that the preferred tempo increased as the heart rate increased
[25]. Several studies has since then been made on preferred music tempo while exercising and the results are

2


quite conclusive and support the previously mentioned relationship between heart rate and preferred tempo [8]
[21]. The intent of our system is not to change a user’s heart rate but instead use it as one of the points of data
we base our recommendations on.

1.3 Purpose

The purpose of this project is to create a service which recommends music to a user. The system gives user-
specific recommendations using machine learning and real-time data, mainly the users current heart rate but
also the time of day. The project also serves to examine a few machine learning techniques to find which gives
the most accurate prediction for this specific problem. To be able to examine the machine learning techniques in
a meaningful way, extensive research on the application of machine learning in recommender systems has been
done. In an effort to examine different machine learning techniques, three techniques has been implemented
and evaluated in this project.

1.4 Scope

We have used an already existing streaming service in order to retrieve and play music as it is not in the scope
of the project to develop our own music streaming service. We decided to use Spotify because of their extensive
API:s and SDK:s which are easy to use. Spotify also offers access to a large amount of audio feature data
connected to the music which we retrieve and use in the recommender system.

On the other hand, by using Spotify we affect how our application functions. For instance, we use the song
ids associated with the music which we retrieve from Spotify. The song ids are used to identify the songs, this
leads to problems in case we would want to switch to another streaming service in the future. Other services
most likely generates their song ids in a different way which would mean that we would have to rewrite the part
of the code that connect a song id to a specific song. As we also use the audio features provided by Spotify,
using another service would require us to change how we get audio features. Exactly what changes would be
necessary depends on how, and if, that service provides audio features. Another downside of using Spotify is
that they are utilizing a paywall, i.e. the user needs to own a Spotify "Premium"-account to be able to play the
music in our application. Lastly, to be able to perform experiments we have decided to restrict the amount of
songs played by the app to a single playlist containing 1124 tracks. The songs are from a wide range of genres
and types of music so that it mimics the entire Spotify library. A smaller library of songs allows us to easier
see whether the recommender system is learning and in what direction which would be difficult with the huge
library of music that Spotify has.

When it comes to the mobile application we decided to only develop a mobile application for the iOS-platform.
The reason for this is because of availability and because it’s not currently in the scope of the project to develop
and maintain two separate mobile applications. On the other hand, the backend will in no way be limited to iOS
only which makes it possible to develop similar mobile application to other platforms in the future. To measure
heart rate we are using the Apple Watch due to both availability and because of how the functionality of the
watch connects with an iPhone as it allows us to easily get the current heart rate. Even though the Apple Watch
does not normally measure heart rate continuously it is possible to force the clock to do so.

As we do not have an existing data set of user data that is applicable for this problem and only limited re-
sources for creating our own user data, the amount of data will be limited. Machine learning algorithms often
grow more accurate when there is more data to process and our lack of data might affect the performance of our
algorithms. In an effort to mitigate this problem, the recommender system will only be predicting three audio
features namely mode (major or minor), loudness and tempo. Training the recommender system for fewer audio
features will require less data and make it possible to get more accurate results despite our lack of data. The
main focus of the project is to implement the recommender system and evaluate how different machine learning

3


techniques would perform in a larger system.

There are several other real-time parameters that could be taken into account apart from heart rate and time
in order to possibly achieve better music recommendations, for instance the users location. However, we have
decided to not implement more than specifically heart rate and time based on the assumption using more param-
eters would not increase the accuracy enough to warrant the time needed for such an implementation. As one
of the goals of the project is to test a few different machine learning techniques there will be three implemen-
tations of the recommender system which are contextual bandit, deep neural network and linear regression. We
decided to use these three algorithms because of their wide use, both in general and in recommender systems in
particular, and because they are rather different from each other.

4


Chapter 2

Theory

To give a deeper understanding of the subject, this chapter will present recommender systems in general as
well as the most common filtering approaches they use. Methods of evaluating recommender systems will also
be presented. The recommender systems in this project builds upon content-based filtering, but collaborative
filtering is also explained for further depth and because it is relevant for our discussion. Due to our lack of users,
collaborative filtering can not be implemented in a meaningful way although we would have liked to implement
this alongside content-based filtering.

We have used three different machine learning techniques to implement our recommender system, namely a
deep neural network, a contextual bandit and linear regression. This section will introduce these techniques and
explain the theory behind them as well as give an understanding as to why we chose to implement them into
our system. This is done by presenting examples of successful uses of the three techniques in recommender
systems and by highlighting their advantages and disadvantages in such systems. Both contextual bandit and
linear regression are based on content-based filtering while the deep neural network utilizes both content-based
filtering as well as collaborative filtering. However the collaborative filtering plays a small role even in the deep
neural network and the system is mostly built around content-based filtering as previously mentioned.

2.1 Recommender Systems

A recommender systems might gather a user’s history and other information such as gender, location, age and
other user specific data to be able to give personalized item recommendations to each user. The items could
be anything that a user could buy, use, read, watch or listen to. Recommender systems are used by numerous
companies featured on the web today that aim to sell products or services, including retailers, media services
and social media companies.

Recommendations can be given based on information collected via indirectly monitoring a user and its brows-
ing behaviour, for instance personalized ads based on previously visited sites. Other more direct ways to give
recommendations are, for instance, built upon a user’s explicitly stated review of a product. For instance, if a
user gives a movie a high rating the recommender system would recommend the sequel. Echo Nest, which is
a company that works with music recommendation, looks at both analysis of the actual music as well as text
analysis where they look at how artists are mentioned around the internet [40]. The recommender system in this
project is built around the analysis from Echo Nest, as Spotify uses Echo Nest for their music analysis, as well
as real-time data which is used to personalize the recommendations further. Web-based services and companies
can increase their revenue if the recommender system works well by presenting other relevant items to the user
so that the user will buy more, stay longer or in other ways keep utilizing the service provided. If recommender
systems are well constructed they benefit not only the companies behind them but studies show that users also

5


find accurate recommender systems helpful [28].

There are two major classes of filtering items for recommendations namely collaborative filtering, which takes
into account the users’ histories and their similarities, and content-based filtering, which as the name implies,
recommends items based on the contents of the item such at attached labels and features of the item [15].

2.1.1 Content-based Filtering

Content-based filtering draws from the idea that if a user is interested in an item, the user is also likely to be
interested in similar items. As such content-based filtering uses labels and attributes to filter and group items to-
gether. Depending on the type of the item, different attributes can be extracted and items with similar attributes
are grouped together. This way, only the history of the target user is needed. The content of the items in the
user’s history is matched with the content of the other items in the domain and the items in the domain with
highest similarity to the ones in the user’s history are recommended. One instance where content-based filtering
is commonly used is when recommending news article, web sites or other text based items. This is because it is
easy to extract words from textual data and apply labels for filtering [15].

The main challenge of content-based filtering is extracting relevant information about the items in such a way
that it can provide good recommendations. New techniques to extract information are constantly being created
which is makes content-based filtering a viable choice for recommender systems. As an example, Spotify has
looked into extracting the frequency ranges of parts of songs and comparing them to find similar sounding songs
[10]. Another problem with content-based filtering is to determine what features are relevant and how they are
weighted against each other for a specific set of items or users. When recommending a movie to a user based
on a previously watched movie, the genre and movie director might be more relevant than the release year or
the length of the movie.

Content-based filtering does not have the same issues with new items that collaborative filtering has which
will be discussed in the next section. However a downside of content-based filtering is that the system will
recommend songs that are similar to what you already listen to and as such will not recommend songs that are
different from those tracks. For a new user that has only listened to a few songs, the recommendations will be
influenced solely on these few listenings and as such might be very predictable and unsatisfying for the user.
Although if the extracted content is diverse enough and based on a lot of different parameters the system might
find connections between songs that do not sound very alike to a human ear but might still attract the same
listeners.

To complement user id and real-time parameters the project utilizes content-based filtering to find similar tracks.
The challenge of extracting data does not apply to this project since Spotify provides a vast number of features
connected to a track which will be used as the content used to find similarities between songs in our system.
However, we will only be able to use three of the features, namely loudness, mode and tempo, because we
do not have enough user data. It is not possible to know if these features are the optimal ones to describe a
track although tempo and loudness have been shown to have a connection to heart rate [13][35]. Furthermore,
tempo and mode have been shown to infer how we perceive a song as "happy" or "sad" [16]. Because of the
aforementioned reasons, we made the assumption that these features are well suited for the approach of this
project.

2.1.2 Collaborative Filtering

A system that uses collaborative filtering operates on a user-item matrix to find correlations between users based
on their preferences regarding the items in the system. There are different ways to gather preferences, either
explicitly by feedback given from the users such as "like/dislike-buttons" or numerical ratings, or implicitly
gathered information like number of views, listenings or clicks. The system builds on the notion that if two

6


Figure 2.1: An example of how content-based filtering works.

users have similar ranking histories they are likely to have similar preferences in the future as well. For instance,
if user X and user Y have ranked a lot of items similarly the connection will be high between them and when
user X gives a new high rating on an item that user Y has not yet evaluated the system will recommend that item
to user Y. A group of users with similar taste is usually called a neighbourhood and by observing the neighbours
rankings and user history, predictions and recommendations can be made. The similarity of two users could be
calculated using Pearson’s correlation coefficient [39]:

ρX ,Y =
∑i(Xi− X̄)(Yi− Ȳ )√

∑i(Xi− X̄)2
√

∑i(Yi− Ȳ )2

Where Xi and Yi are the ratings of user X and user Y regarding item i and X̄ and Ȳ is the mean value of their
ratings. A prediction whether user X will like or dislike an item is based on the weighted average of the neigh-
bours’ recommendations of this item.

Collaborative filtering methods can also be used to find correlations between different items as opposed to
different users. User X gets recommended items which have a high correlation to the higher ranked items in
user X’s history. Amazon’s patented recommender system which has been in place since 1998[24] use item-
based collaborative filtering to group items which are often bought together. This way multiple items can be
recommended to new customers.

A problem with collaborative filtering approaches is the so called cold start problem where not enough data
is available for a user or an item. When a new user is added to the system it will not initially have enough
ratings for the system to find sufficiently similar users and thus the accuracy of the predictions will be limited.
Another example of cold start is when a new item is added to the system. The item will not have any ratings
so it won’t be recommended to any users. There are many ways so solve this problem, asking new user for
initial ratings or recommending the most popular items while gathering more information are two approaches
to overcome the new user cold start problem. An approach to solve the new item problem is to implement
content-based filtering in a hybrid approach which will be discussed in the coming section. However the cold
start problem for new users exists for content-based filtering as well because a new user will not have any songs
from which to filter and recommend music.

As mentioned in the scope, our system does not rely on collaborative filtering as we do not have the user
base to support using this way of filtering. The only one of the three implementations that uses collaborative
filtering is deep neural network and even then the usage is limited. Preferably we would have used collaborative
filtering together with content-based filtering and the real-time data however this proved not to be viable.

2.1.3 Hybrid and Other Approaches

Collaborative Filtering and Content-based Filtering each have their own advantages and disadvantages as previ-
ously discussed. A widely used approach is to use these two filtering approaches together in a hybrid approach

7


Figure 2.2: An example of how collaborative filtering works.

to gain the advantages of both approaches and mitigate their disadvantages [15]. As previously mentioned a
hybrid approach mitigates the problem in Collaborative Filtering when a new item is introduced by utilizing
content-based filtering for that item. An advantage over content-based filtering is that the collaborative aspects
of the systems can provide a wider range of recommendations as it is not limited to the features of a specific item.

A solution to the cold start problem for a new item in collaborative filtering is to apply a hybrid recommender
system that utilizes a content-based filtering approach on the new items. This way the system can group the
new items together with other items with existing ratings based on other properties than user interactions and
ratings. This grouping can then be used to recommend the new items to users who are likely to be interested in
them.

A hybrid approach also allows for more parameters to be taken into consideration for each recommendation.
Other filtering methods such as demographic filtering, which is a type of collaborative filtering that groups users
in the same demographic together, can also be implemented into such a system. Our recommender system fil-
ters both content-based as well as using real-time data and therefore would qualify as a hybrid approach. As
previously mentioned we would have liked to use a collaborative approach as well to try to mitigate some of
the mentioned issues with content-based filtering. Using real-time data does not eliminate the problems with
content-based filtering but is intended to refine the recommendations further by personalizing them based on
the current state of the user.

2.2 Machine learning

In this section the three implementations of our feature recommendation system will be introduced and dis-
cussed. The theory behind the implementations as well as why we chose to implement those specific machine
learning techniques will be covered.

2.2.1 Deep Neural Networks

A neural network is a set of algorithms, designed to find numerical patterns in large sets of data. The patterns are
used to classify and cluster the data which in turn can be used to classify new incoming data based on patterns
from previous data. The network is trained by considering large amount of data, called training data. The more
data you can feed into a neural network, the more accurate that network will become. A neural network consists
of an input layer, an output layer as well as one or more hidden layers. The hidden layers are layers of nodes,

8


called artificial neurons, where the computations are made. The programmer does not interact with these layers
which is why they are explicitly called hidden. Instead the neural network tweak the nodes and connections
between said nodes based on training and it is this principle that allows the network to learn on its own. A deep
neural network [17] is a neural network with more than three total layers meaning that it has at least two hidden
layers. See figure 2.3 for an overview of the layers in a deep neural network.

Figure 2.3: A deep neural network with two hidden layers

The nodes perform all computations in a neural network. They receive input either directly from the data
set, if the node is in the input layer, or from other nodes in the network. The inputs are then weighted and the
product of all inputs and their corresponding weights are summed. This sum is then used in an activation func-
tion that is what ultimately decides how much the node will be activated. There are several different types of
activation functions and which one that is the most optimal depends on what kind of data you have, but almost
all of them are non-linear. The output from the node is then either passed to another layer of nodes or used as
part of the output from the neural network if the node is in the output layer. See figure 2.4 for an overview of
how a node works.

Figure 2.4: A diagram over a Neural Network Node

A neural network trained on labeled data, which is called supervised learning, can be used to classify unla-
beled data. As an example, a neural network trained on a data set with pictures of human faces, labeled with
either ’smiling’ or ’not smiling’, could be used to give an accurate prediction on whether an unlabeled picture
of a human face is smiling or not. Supervised learning works by giving the network an input and asking for a
prediction. The labels are then used to determine whether the network was correct or not. If the prediction was
correct the weights that allowed the network to make a correct guess are increased and if the prediction was
incorrect the weights are lowered to make that guess less likely.

9


Neural networks can also be trained on unlabeled data, also known as unsupervised learning. Since the neural
network has no labels to work with, it clusters the data instead of classifying it. This can then effectively be used
to detect similarities or anomalies in large data sets. Using the same example as above where an unsupervised
neural network is trained on a data set of human faces, this time with no labels. The neural network would then
create multiple clusters for the faces, each face belonging to several clusters at the same time. By inspecting
those clusters you could possibly identify labels by inspecting at the data in each group. Groups could be any-
thing from ’smiling’ or ’brown-haired’ to things much harder to classify for a human researcher. The advantage
of supervised learning over unsupervised learning is that the training for the network is predictable. The net-
work will be able to classify exactly what you trained it to classify as opposed to unsupervised learning which
makes its own groupings that can be unpredictable. However the largest downside of supervised learning is that
it requires a large amount of labeled data which is often hard to get and time-consuming to make as it requires
a person to look through the data and add appropriate labels.

Due to their ability to classify and make predictions as well as clustering items, deep neural networks are
becoming increasingly popular to incorporate into recommender systems [33]. For instance, the recommenda-
tion service on YouTube has improved since utilizing neural networks, which consider both user history and
content of the videos to give personal recommendations [27]. In music recommendations deep neural networks
have been successfully used to cluster songs together due to audio features[1] to improve content-based rec-
ommendations as well as incorporate music content in collaborative filtering approaches to avoid the cold-start
problem for new tracks [9][14].

There are several different way to implement a neural network such as multilayer perceptron, autoencoder,
convolutional neural network, recurrent neural network and several more [17]. How each of these implemen-
tations work will not be discussed in this paper. Each of these implementations are viable implementations to
be used within a recommender system, there is not one single method to use for all problems as each have their
own advantages and disadvantages [33]. From our research, we could not find anything indicating that a specific
network would perform better for our problem. Therefore a basic feed forward deep neural network was chosen
for evaluation.

In our project, we have evaluated DNNs trained on labeled data. The networks’ features are the user data,
and its labels are the values of a song feature such as tempo, loudness or mode. The intention is that with
enough data our DNNs will be able to accurately predict what tempo or mode a user prefers at a specific heart
rate and time of day.

2.2.2 Reinforcement Learning and Contextual Bandits

Reinforcement learning (RL) is a type of machine learning in which an agent is trained to maximize a reward
by observing the environment, its state and then taking some kind of action. This can be modelled as a Markov
decision process (MDP) where we have a set of agents and environment states called S, a set of possible actions
A, Pa(x,y) which is the probability that you go from state x to state y with action a and Ra(x,y) which is the
immediate reward gained from going from state x to state y with action a.

The goal of a RL agent is to maximize the cumulative reward for a set of MDPs in which the reward is not
known before the action is taken. A RL agent is therefore using a combination of both exploratory random ac-
tions combined with learned and estimated reward probabilities. The balance between these random exploratory
actions and using the current knowledge is the key to having a high performing RL agent. The exploratory ran-
dom actions help prevent the agent from getting stuck on a single action if said action gives a good reward. For
instance, if the agent takes an action and a good reward is given it might not try another action in the action space
that would give an even better reward. This is what makes RL agents well-suited for recommender systems [4],
their exploratory actions enables the agent to explore.

10


RL differs from supervised learning in that it takes a reward which can be a number in a range. In super-
vised learning a prediction is either correct or incorrect while in RL the reward is a number signifying how good
the prediction was. One type of RL is the multi-armed bandit where the goal is for the network to learn which
of the arms give the best reward or payout. The multi-armed bandit problem can be explained as k one-armed
bandit slot machines that all have a different hidden chance of winning. The task is to find the slot machine that
has the highest chance of winning. A RL agent is a viable choice for solving such a problem by testing each
of the arms in a random manner and then trying the arm that has given the best reward. However this problem
does not take the environment into consideration and instead focuses solely on maximizing the profit of a static
set of rewards. Real world problems rarely have a static set of rewards and the environment or context plays
a role in what the reward of an action is. An example of an agent that takes actions based on the environment
is the contextual bandit, most commonly used to recommend news and articles on websites [23]. A contextual
bandit problem can be stated as: we have k one-armed bandits with a reward chance that changes depending on
the time, these one-armed bandits represent the actions we can take and the time represents the context and state
of our environment. From reading the state the bandit can then make predictions for the most suitable action
to take for that specific state i.e. which bandit the agent should pull based on the rewards received for previous
actions taken in the same context. This is accomplished similarly to the multi-armed bandit problem by testing
actions for each state and then learning which action gives the highest average cumulative reward for each state.

The math behind a contextual bandit is as follow:
A contextual bandit runs t = 1,2, ...,T rounds, the context for our environment is xt ∈ X . Based on the context
the bandit selects an action at ∈ A and the environment returns with a reward rt ∈ R.
The goal of the bandit is to maximize the cumulative reward ∑

T
t=1 rt . To accomplish that the bandit has a set of

policies Π⊆ {X → A} and attempts to find the policy π ∈Π that grants a the largest reward.

Figure 2.5: Contextual bandit process

We decided to implement a contextual bandit as the contextual bandit problem is similar to the one in our
project. The problem involves that we have a state which is determined by the user’s current heart rate and the
time of the day, and we want the system to determine the best audio features for this state. Another advantage of
the contextual bandit is that it mitigate the issue of using content-based filtering to some extent, as it is capable
of taking random action which increases it chances of finding an optimal set of features.

2.2.3 Linear Regression

Linear regression is used to model the relationship between an independent variable x, and one or more depen-
dent variables y. This is done by fitting a straight line to observed data in a way that minimizes the errors, as
can be seen in figure 2.6. For instance if we want to find the relationship between a person’s height and weight
we can consider observed data and plot the weights depending on the heights and find a linear equation with the
aim that the distance from the data points to the line is as small as possible. The standard approach to find the
best fitted line ŷ = b0 +b1x is the least square method which minimizes the sum of the squared offsets from the
data points to the line. To use linear regression for predictions the fitted line is used to look up corresponding
ŷ-values to an input value where no previous data exists. In recommender systems linear regression models
have been used to find relationships between different users in collaborative filtering approaches and use these
relationships to efficiently make predictions [5] [37]. We chose to implement a linear regression algorithm into

11


Figure 2.6: Example of linear regression

our recommender system because it is widely used and it would be interesting to compare a linear approach to
the other non-linear ones.

2.3 Evaluation Methods

There are many approaches to evaluate recommender systems such as user studies, offline evaluation and online
evaluation. In this section the two evaluation methods online and offline evaluation will be described, including
their advantages and disadvantages, as we chose to attempt conducting these method to evaluate our system.
However due to the low amount of user data accessible in this project online evaluation did not prove viable,
which will be discussed in the section about online evaluation.

2.3.1 Offline Evaluation

An offline evaluation is conducted by testing a system on a number of simulated or manually created users. By
simulating a user you are able to evaluate your system on a specific hypothesis. To create and simulate real user
behaviour the offline user is often created on data gathered from real users. Thus the evaluation relies heavily on
data gathered from real users to be able to simulate a behaviour as close to real user behaviour as possible [41].
Offline evaluation is cheap and easy to conduct since it doesn’t require real users and is thus often conducted in
the earlier stages of development. Offline evaluation of recommender systems can for instance be conducted in
an early stage to compare two algorithms and find the most appropriate one. [31].

When conducting an offline evaluation there are three common steps [31]:

Hypothesis:
Similar to other evaluation systems it is important to create a clear hypothesis that presents the test with a clear
purpose in the first step. This could be comparing algorithm A and B to determine which algorithm gives the
best recommendations.

Controlling Variables:
It is important that only the variables being evaluated are changed and all other variables remain static. So for
our previous example of comparing two algorithms, if the algorithms were trained on different data it would be
difficult to prove that the result was not affected by the data the network was trained on.

Generalization Power:

12


Because we evaluate our system in a closed environment we are only answering a narrow set of questions and
thus it is important to make various evaluations to cover all possible scenarios.

A disadvantage of offline evaluation is, as mentioned, that it can only answer a narrow and predefined set
of questions. Another disadvantage is that it is not possible to ask a specific question and monitor why a user
chooses to make a specific choice or give a certain answer, because a simulated user can not be asked about
its intention or reasons for making a specific choice. The advantage of this evaluation system is that it is much
cheaper and faster than other evaluation methods [31].

2.3.2 Online Evaluation

A common method for measuring a recommender system’s impact on user behaviour is through using a method
called online evaluation. When conducting an online evaluation the user is put in contact with a functioning
version of the recommender system to evaluate, and can therefore receive actual recommendations from the
system. During the experiment, the user’s actions are recorded and processed in order to evaluate the system.
According to some researchers the only true way of measuring user satisfaction is through online evaluation
[41][7].

User behaviour can sometimes be hard to measure, as the effect of a recommender system is dependant on
plenty of different variables, such as the user’s intent (i.e. what their needs are), the user’s context (i.e. what
their past experiences look like), and the looks of the system (i.e. the user interface) [32]. For this reason, when
you perform an online evaluation of a system you often run multiple versions of the same systems in order to
see how the different systems perform compared to each others, as it can be hard to interpret numbers alone.
This type of evaluation method is often referred to as A/B-testing [32]. When conducting an A/B-test it is of
importance that the users are assigned at random to a recommender system to test as you otherwise risk getting
a biased result [32].

Figure 2.7: Example of an A/B-test

Advantages of online evaluation in comparison to other evaluations methods include that it is capable of di-
rectly measuring overall system goals, such as user retention or click-through-rate. This allows the person
conducting the experiment to understand how system goals are affected by properties like the accuracy and
diversity of recommendations. On the other hand, a disadvantage of using online evaluation is that it can be
hard to gain a complete understanding of these relationships as properties are varying independently. Online
evaluations are also relatively expensive in contrast to other methods. [32]

In this project online evaluation was supposed to be used in order to evaluate how well the system recom-
mends songs to the users, but due to a lack of data a proper online evaluation could not be conducted even
though it would have been the best method for evaluating the system.

13


Chapter 3

Implementation

In this chapter the implementation of our entire system is described. Each subsystem and all components is
described in detail as well as a system overview which explains the main flow and how all parts of the system
is connected.

3.1 System Overview

The system consists of a server and one or more clients connected over internet. The server keeps a database of
user data and server statistics, a web server that manages client communication, and the recommender system
that is used to give song recommendations to the connected clients. The recommender system contains two
significant parts: a set of feature recommenders and a ranking system. A feature recommender returns a value
for a specific audio feature based on user data. While the ranking system recommends a set of songs based on
these features. Each client in the system is a mobile application that is used to play the recommended music and
measure the user’s heart rate. In figure 3.1 an overview of the system and the data flow directions can be found.

Figure 3.1: An overview of the different parts in the recommender system

14


3.2 Data

We have two large data sets that our recommender system uses as its base for its recommendations. The first is a
set of songs, each song is paired with its respective tempo, mode and loudness, i.e. the features that we selected
as variables for our content-based filtering. At the initial stage of the project, a set of 1124 songs was selected.
The songs were chosen from a variety of different playlists to get a broad mix of artists and genres in order to
mimic the Spotify library, although in a much more compact form. Spotify uses a unique song id to identify
each song. We also use this id to represent the songs in our system. The audio features for all of these songs
was downloaded from Spotify on the 17th April 2018 using their developer API [34] and saved to the database
on our server.

• Tempo - The average estimated tempo of a song measured in beats per minute.
• Mode - The mode of a song which can be major or minor, represented by 1 and 0 internally.
• Loudness - The loudness of a song measured in decibels averaged over the entire song, ranges between

-60 and 0.
The second data set is our user data. We have 233 data points from two real life users, that has been gathered
over several months. There is also three automatically generated data sets with 500 data points each that has
been used to evaluate the different recommender systems. Each data point represents a song being played or
skipped. They are created when a user has finished listening to a song and consists of the variables below.

• User id - A unique user name
• Heart rate - Average heart rate for the user during the song
• Time value - The number of minutes from midnight. Used to identify what part of the day it was when

the song was listened to.
• Song id - A unique song identifier
• Rating - The time the user listened to that song in percentage of that songs total time. Full skip is a ’0’,

half the song is ’0.5’ and the entire song is ’1.0’

15


3.3 Mobile Application

The mobile application is the part of the system which the user interacts with. The application mainly has two
functions: gathering information from the user and to act as a music player. In order to achieve this the applica-
tion has been divided into three parts: data collection, server communication, and music playing.

Figure 3.2: Flow chart of the mobile application

The application gathers both direct and indirect feedback from the user. Direct feedback comes from user
actions such as starting a session or skipping a song. Indirect feedback on the other hand comes from active
measuring the user’s behaviour in the application, which includes continuous measuring of the user’s heart rate
and the time the user listens to each song. In order to measure the heart rate of the user the mobile application
is in need of a smart watch and a corresponding application for the watch. In the current implementation the
mobile application is built for iOS 11 (the latest version of the iPhone- and iPad operative system) and the type
of smart watch used is an Apple Watch with a watchOS-application (built for watchOS version 4) installed.

There are two scenarios where the mobile application and the server communicate. The first scenario is when
the mobile application requests a song recommendation from the server in order to play it to the user. This
scenario plays out every time the user starts a new listening session and when the currently playing song has
ended. The second scenario is when the mobile application sends feedback on a song to the server, this happens
when has song has ended, either because the song has played through, the user decided to skip the song, or the
user stopped playing music all together.

Measuring the heart rate continuously isn’t a pre-built feature for iOS, so in order to accomplish this the iOS-
application needs to implement a few things. First, the application needs to establish means of communication
between the phone and the watch, this can be achieved by using the in-built messaging service for iPhone and
Apple Watch, which enables us to send simple messages between the two devices. Using the service we then
send a message to the watch to start measuring the heart rate of the user. The actual measuring of the heart rate
is accomplished by starting a exercise on the watch, which starts a continuous measurement of the user’s heart
rate. Then every time we receive a new measurement (approx. every 5 sec) we send back the measurement to
the phone, using the messaging service, which process this in order to send it back as feedback to the server.

The music player part of the application is heavily dependant on Spotify’s iOS framework. The framework
is used both to retrieve the songs and their data, as well as for playing the songs. On top of the framework we
have built extra functionality to be able to gather information and in order to build a user interface where the
user can control the music player.

16


Figure 3.3: A screenshot of the mobile application’s user interface.

CPU RAM HDD
Server i5-4570S @ 2.90GHz 16 GB DDR3 1600 MHz 1TB 64MB Cache SATA 6.0Gb/s

Table 3.1: The hardware specification for the system server

3.4 Web Server

The web server’s task is to handle client requests, forwarding them to any relevant back-end system and sending
a response back to the client once it’s been calculated. The server is built with the web framework Django
version 2.0.2 [11] with Python 3.5.2 and it runs on the hardware found in table 3.1.

There are two types of client requests that the web server is configured to accept. Data collection and song
recommendations. A data collection request is initiated from a client when a song has finished or has been
stopped. It is sent as a http POST request with a JSON-encoded data package containing user data as described
in section 3.2. When the server receives the request, the packaged data is parsed with a serializer and saved to
our database if the contained data is valid. The client then receives a standard http 201 response if the data was
valid, otherwise a http 400 response is sent to inform the client that no data has been saved.

A song recommendation request is initiated from a client when a new song needs to be played. The request
comes in the format of a http GET request with a JSON-encoded data package containing the user id, heart rate
and time value.

Once the data is deemed valid the web server then:
1. Checks if a new trained feature recommender is available

(a) If that is the case it then switches to most recently trained model
2. Checks if the song cache for that user is empty

(a) If the cache is empty, it asks the feature recommenders for the recommended song features
(b) Then use the features to populate the cache with the top songs from our ranking system that match

17


those features, adjusted for user history
3. Pop a song from the users cache and returns it to the client.

A sequence graph for how a song request is handled with an empty cache and no new trained model can be
found in figure 3.4.

Figure 3.4: A sequence graph for a song recommendation with an empty server cache

3.5 Database

All data that requires persistency is saved in a postgreSQL 9.5.12 [29] database that runs on the same server as
the web server and the recommender system.

The database has tables for:
• User listening data - User data as described in section 3.2. Used by feature recommenders for training

and by the ranking system to calculate user bias.
• Song audio analysis data - Song feature data as described in section 3.2. Used by the ranking system.
• User information - How many songs a user has played and a unique index for each user. Used by the web

server and the ranking system.
• Django - Data required by Django. Used by the web server.
• Last played - Table to track the last time a user played a specific song. Used by the ranking system.
• Action ids - Table to track the actions taken by the contextual bandit so they can be correctly paired with

the corresponding reward at a later time.

18


3.6 Recommender System

The recommender system consists of two subsystems where the output from the first system becomes the second
systems input. The first subsystem is our three feature recommenders. Each feature recommender takes user
information as the input and returns with the recommended song feature that it’s been configured for. We
have three different sets of feature recommenders, each based on a different machine learning technique and
implemented with Tensorflow which is a machine learning library developed by Google. All three has been
evaluated and trained on the same set of data. Our second subsystem is our ranking system. The ranking system
takes the output from our feature recommenders and combine it with the users song ratings and recent listens to
rank all our songs in our database based on that input. The top ranking songs are returned to the web server.

3.6.1 Deep Neural Network

The deep neural network (DNN) that was evaluted is a set of three almost identically configured DNNs, each
trained on the same user data but with different labels. The model was implemented using a DNNClassifier
estimator in Tensorflow and configured to use a gradient descent optimizer with a static learning rate of 0.01.
The feature columns are identical for all three DNNs and can be found in table 3.2. The label is either mode,
tempo or loudness for the song that was played. Tempo and loudness are bucketized in to the buckets in table
3.3 while mode was represented by either 0 or 1e. Two hidden layers, the first with 5 neurons and the second
with 1 neuron, was used for all three DNNs, which was an estimated middle ground based on the formula 3.1.

When a song recommendation request is sent to the web server, only data for user id, heart rate and time
value are sent with that request. To generate the predictions, the web server then bundles the user data with
rating 1.0 and sends it to our trained models. In order to generate song features that are paired with as high
rating as possible. Once predictions for each of our three song features has been calculated, they are returned to
our web server that forwards them to our ranking system.

Training is done either manually by calling our training script or automatically through a scheduled function
which is configured to run the training every 2 minutes on the server. The training script extracts all user data
3.2 from our database. The song id is used to lookup the mode, tempo and loudness for each entry from our
song data table. All that data is then organized in a format that our DNN model could read and the user features
was sent together with the respective labels to each DNN for training. After the training the models are saved to
a checkpoint, so that the web server can use them to instantiate new trained models to generate predictions from.

To determine the size of the hidden layers we used the following equation as a rule of thumb. It is mainly
a recommendation as there is no single way of determining the optimal size of hidden layers.

Nh =
Ns

(α · (Ni+No))
(3.1)

Nh = upper bound of hidden neurons
Ni = number of input neurons.
No = number of output neurons.
Ns = number of samples in training data set.
α = an arbitrary scaling factor usually 2-10.

19


Name Type
User id Categorical

Heart rate Bucketized numerical
Time value Bucketized numerical

Rating Bucketized numerical

Table 3.2: Feature column types used for our deep neural network and linear regression models

3.6.2 Linear Regression

A linear regression (LR) model was implemented with Tensorflow as a LinearClassifier estimator which is then
instantiated once for every song feature in our scope 1.4. The feature columns are identical for all three LR
objects and can be found in table 3.2. The label is either mode, tempo or loudness for the song that was played.
Tempo and loudness are sorted in to the buckets in table 3.3, while mode only can take a value of 0 or 1.

Predictions are made in a similar way as with our DNN model. The server receives user data, bundles that
data with a rating of 1.0 and sends that to every LR object to get the predicted song features. To train the
LR model we use the same method as mentioned in the previous section about DNN due to the similarity in
implementation of the two.

3.6.3 Contextual Bandit

The contextual bandit setup that was evaluated was a set of three contextual bandits for each user. Each bandit
was then used to predict an action that corresponds to a specific value for the song feature that the contextual
bandit was configured for.

The set of actions that was used by each bandit was the same as the buckets used by our DNN and LR configu-
ration 3.3. As an example, action 6 for the tempo bandit corresponded to a tempo of 131 to 150 BPM. The set
of states is the same for each bandit and each state a unique integer that matched to the unique combination of
bucketized heart rate and time value 3.4. The reward received is calculated from the rating of each song with a
basic reward formula:

y(x) =

{
1 if rating ≥ 0.8
−1 else

Each contextual bandit was configured as a feed-forward neural agent with a gradient descent optimizer. A pre-
diction is done by either by selecting an exploratory random action or the action with the currently maximum
weight for the state received as input. The ratio of selected actions between the exploratory and value maximiz-
ing actions was set to 1:9. Once an action had been chosen, an action id was created and sent with the request.
The id, chosen action and state is saved to a lookup table and then used in training to backtrack the rating re-
ceived to the bandits choices. Training was done either manually by calling our training script or automatically
through a scheduler that was configured to run the training every 2 minutes on the server. The training script
extracted all user data 3.2 that was not already flagged as data has been trained on from our database. The action
id’s for all the extracted data was then used to lookup the chosen action and state for each rating. All that data
was then organized in a format that our DNN model could read and the action, state and rating was then sent
to the model for training. The models was then saved to a checkpoint after the training. The contextual bandits
can also be trained on data that has no action id attached, generated by the DNN or linear regression models.
To train the bandit on that data, the state was directly calculated from the heart rate and time value in the data
point. The song id was then used to lookup the song features and the actions for each bandit was calculated by
bucketizing all the features. The bandits can then be trained as normal with tuples of action, state and rating.

20


Label/Action Tempo (BPM) Loudness(db)
Bucket 0 0↔ 30 −60↔−20
Bucket 1 31↔ 50 −19↔−18
Bucket 2 51↔ 70 −17↔−16
Bucket 3 71↔ 90 −15↔−14
Bucket 4 91↔ 110 −13↔−12
Bucket 5 111↔ 130 −11↔−10
Bucket 6 131↔ 150 −9↔−8
Bucket 7 151↔ 170 −7↔−6
Bucket 8 171↔ 190 −5↔−4
Bucket 9 191↔ ∞ −3↔−2

Bucket 10 - −1↔ 0

Table 3.3: Label and action buckets for tempo and loudness

Bucket Heart rate Time value Rating
Bucket 0 0↔ 40 300 < t < 660 0↔ 0.2
Bucket 1 41↔ 60 659 < t < 960 0.21↔ 0.4
Bucket 2 61↔ 80 959 < t < 1320 0.41↔ 0.6
Bucket 3 81↔ 100 1319 < t or t < 300 0.61↔ 0.8
Bucket 4 101↔ 120 - 0.81↔ 1.0
Bucket 5 121↔ 150 - -
Bucket 6 151↔ 180 - -
Bucket 7 181↔ ∞ - -

Table 3.4: Heart rate, time value and rating buckets

3.6.4 Linear Regression

A linear regression (LR) model was implemented with Tensorflow as a LinearClassifier estimator which is then
instantiated once for every song feature in our scope 1.4. The feature columns are identical for all three LR
objects and can be found in table 3.2. The label is either mode, tempo or loudness for the song that was played.
Tempo and loudness are sorted in to the buckets in table 3.3, while mode only can take a value of 0 or 1.

Predictions are made in a similar way as with our DNN model. The server receives user data, bundles that
data with a rating of 1.0 and sends that to every LR object to get the predicted song features. To train the
LR model we use the same method as mentioned in the previous section about DNN due to the similarity in
implementation of the two.

3.6.5 Ranking System

In order to complete the recommender system, a ranking system is introduced with the task of sorting all songs
in our database. This sorting is based on the recommended song features from the feature recommenders
together with the user id for the user making the request. With these as inputs, the algorithm then proceeds to
iterate through all songs in the database, placing all of them in a dictionary and giving each song a weight value
between 0 and 1. The weight values are calculated based on how well certain features of the songs correspond
to the desired feature values output by the recommender system. Each feature gets a value between 0 and 1, and
the total weight is then calculated by adding all the feature weights together and dividing them by the number
of features used. The individual feature weights are calculated in different ways. Since mode only can assume a
value of either 0 or 1 (minor or major), the mode weight can also only be either 0 or 1, depending on if the mode
is correct or not. Loudness and BPM can assume many different values and we want the weight for them to be

21


higher based on how close to the desired value they are. This is done using the following quadratic equation:

y(x) =

{
−0.01x2 +1 if −10≤ x≤ 10
0 else

Where y represents the feature weight and x represents how close to the desired value the songs feature value
is. This means that the feature weight for these features will be 0 if the feature values of a song is more than 10
BPM or decibel away from the wanted feature value. Figure 3.5 shows a plot of this function.

Figure 3.5: The feature weight function for loudness and BPM plotted in a graph

The weighting algorithm also takes into account how many other songs that has played since the song that
is currently being weighted was last played. If the song was recently played this is converted to a variable close
to 0. The variable then increases linearly until it reaches 1 when it was more than 40 songs since last played,
after that it stays at 1 until it is played again. The total weight of all other parts of the song is then multiplied
with this variable to get the final song weight.

Weight multiplier =

{
Number of songs since last played/40 if Number of songs since last played < 40
1 else

To get the songs weights adjusted for individual preference, a user bias is set by reading all times a specific song
has been played by the current user and calculating an average rating. For songs that have never been played by
the user, user bias is exempted and the song is only weighted on how similar it is to the requested song features.

Song weight =
Loudness weight+Mode weight+Tempo weight+User bias

Number of Features
∗Weight multiplier

If the song has never been played previously by the user, the number of features is set to 3 and user bias is set
to 0. Otherwise the number of features is set to 4 and user bias is included. This is because no user bias exists
before the song has been played and to avoid that this non-existent bias influences the weight we omit the user
bias. Once all songs have been weighted the ranking system finishes by sorting them all and returning a list of
the 10 song id:s with the highest weights.

22


Chapter 4

Results

4.1 Evaluation System

To conduct an evaluation on our networks we must train them first with large amount of user data. Since we
do not obtain that amount of data and are not able to produce it by ourselves, we chose to make an offline
evaluation. As described in section 2.3.1 an offline evaluation should be conducted on real user data and one
should avoid a biased evaluation. However due to our scarce user data we choose to a simulate a user with very
specific music preferences. In our first evaluation we created a user that listens to a song with specific attributes
at five different times’ at day. The attributes used for this user can be seen in table 4.1.

Our hypothesis for this test was to examine if the network would recognize the user’s music preferences during
different times of the day, hence the very specific music preferences to be able to measure the correctness of the
recommendation. So we trained the network on this user 500 and then plot the satisfaction of each recommen-
dation made by the network.

To evaluate how satisfying the recommendations made by the network are we created an algorithm that gener-
ates a "score" for each recommendation based on the preferences of the user. Since our user exclusively listens
to one song at a specific time with precise attributes we know what we can expect from the network depending
on what time it is. We compare the expected attributes with the recommended songs’ attributes and divide it by
3 to obtain the mean. However since each attribute have different ranges, e.g. a songs tempo could be between
40 - 210 BPM, we divide each attribute with its range. This results in each attribute being less than or equal to 1
and thus each attribute affects the score equally. However since the attribute Mode can either be 0 or 1 it affects
the score greater than the other attributes, we see this fit because of mode’s low range it would be a considerable
error from the network to get it wrong. Thus we see fit that it would decrease or increase the score greater than
the other attributes. Another complication was that the difference in range between tempo and loudness( tempo
range: 0 - -60 , BPM range : 0 - 270) was too great. This would mean that tempo would affect the outcome of
the algorithm more than tempo, however because we have "bucketized", i.e split the range of an attribute into

Time Heart rate Tempo(BPM) Loudness(db) Mode
07:00 60 126 -27.5 0
13:00 90 193 -4.9 0
15:30 120 101 -25.8 1
18:00 150 126 -5.8 1
22:00 55 151 -15 0

Table 4.1: Wanted song attributes by user during different times and heart rates

23


10 intervals, there is an equal chance for the tempo as loudness to result in half of its range.

a1 = wanted tempo a2 = recommended tempo
b1 = wanted loudness b2 = recommended loudness
c1 = wanted mode c2 = recommended mode
x = 270 = range of tempo y =−60 = range of loudness

z =
∣∣∣∣x− (a1−a2)

x

∣∣∣∣+ ∣∣∣∣y− (b1−b2)

y

∣∣∣∣+ ∣∣∣∣1− (c1− c2)

1

∣∣∣∣ (4.1)

f (z) = Score =
z
3

(4.2)

The contextual bandit will be trained using a clean database, i.e. when there is no entries in the user data
tables. While the LR- and DNN-model need to have at least one entry in the user data table in order to fulfill
their first training. As we evaluated the contextual bandit first we decided to keep its first entry for the other
models, which is the reason to that all the graphs start off the same.

24


4.2 Deep Neural Network

Graph 4.1 represent the score of each recommendation made by the deep neural network. The Y-axis represents
the score where 1 is the highest score a recommendation could receive and the X-axis represents the number
of iterations. The graph also highlights the trend of every 10th score. The table 4.2 represents a sample of the
recommended songs by the neural network.

Figure 4.1: The blue dots represent the score of a recommendation at each iteration and the red line highlights
the trend of every 10th score by the deep neural network.

Due to the large amount of songs recommended by the network we chose to highlight a sample of the rec-
ommended songs by the network. The order of the table does not have any intent. Score for each song is based
on what time they are played and "Times played" displays how many times overall the song was played at a
specific time of the day.

Time Name Tempo(bpm) Loudness(db) Mode Score Times played
07:00 Warren G - Regulate 95 -13 0 0.87 5
13:00 2Pac - I Get Around 96 -13.9 0 0.79 6
13:00 Bruce Springsteen - I’m on Fire 88 -14.5 0 0.78 4
15:30 Shallo - Lie 108 -11.6 0 0.57 5
15:30 Johann Bach - Violin Concerto BWV 1042 99 -14 1 0.93 7
18:00 Dan Hartman - I can dream about you 113 -14 1 0.93 9
22:00 Diamond D - Day One 91.5 -11 0 0.88 14

Table 4.2: A sample of the recommended songs by deep neural network.

25


4.3 Contextual Bandit

Similar to Graph 4.1 the graph below represents the score for each recommendation made by the contextual
bandit network and the trend line for every 10th score. The table 4.3 represents a sample of the recommended
songs by the contextual bandit.

Figure 4.2: The blue dots represent the score of a recommendation at each iteration and the red line highlights
the trend of every 10th score by the contextual bandit.

Time Name Tempo(bpm) Loudness(db) Mode Score Times played
07:00 Destiny’s Child - Say My Name 67 -3.5 0 0.79 89
13:00 Destiny’s Child - Say My Name 67 -3.5 0 0.77 65
15:30 George Frideric - Messiah, HWV 56 107 -16 1 0.93 64
18:00 ASAP Ferg - Plain Jane REMIX 170 -4.3 1 0.92 75
18:00 Jimi Hendrix - Hey Joe 170 -2.8 0 0.58 1
18:00 U2 - With Or Without You 110 -2.2 0 0.86 7
22:00 Anton Bruckner - Symphony No. 4 61 -19.5 1 0.82 9
22:00 D-Block - Promised Land 150 -1.7 0 0.92 60
22:00 CLC - Hobgoblin 110 -2.2 0 0.86 6

Table 4.3: A sample of the recommended songs by contextual bandit.

26


4.4 Linear Regression

Similar to section 4.2 and 4.3 the graph and table below represent the result for our linear regression system.

Figure 4.3: The blue dots represent the score of a recommendation at each iteration and the red line highlights
the trend of every 10th score by the linear regression.

Time Name Tempo(bpm) Loudness(db) Mode Score Times played
07:00 Sofia Karlberg - Blue Jeans 50 -9.5 1 0.44 6
13:00 Destiny’s Child - Say My Name 67 -3.5 0 0.79 7
13:00 Drake - Diplomatic Immunity 75 -5.2 1 0.47 5
15:30 Anton Bruckner - Symphony No. 4 61 -19.5 1 0.9 5
15:30 Destiny’s Child - Say My Name 67 -3.5 0 0.48 14
15:30 Luke Bryan - Huntin’, Fishin’ ... 77 -4.2 1 0.84 9
18:00 Monsta X - SHINE FOREVER 80 -3.5 1 0.91 7
22:00 Destiny’s Child - Say My Name 67 -3.5 0 0.8 11
22:00 VIXX - Shangri-La 77 -3.5 0 0.82 11

Table 4.4: A sample of the recommended songs by linear regression.

27


Chapter 5

Discussion

5.1 Scope

During the project we realized that, even with the use of a limited list of songs, we would run in to problems
related to a lack of users to gather data from as well as a lack of usage. A large amount of users and data is
needed to train a system like this, which is something we have not had access to. This problem was amplified
by the lack of a stable production build at several stages of the project, limiting the amount of data that could
be collected. The low amount of real life user data was the primary driving factor behind the usage of offline
evaluation.

While we early on suspected that the lack of data could be an issue, we had initially planned to use a ma-
chine learning system that could give us a predicted song id directly. This meant that the number of possible
combinations for a given rating was affected by both the number of songs in our playlist and the user data. Even
with our limited number of songs this meant that we needed

Number o f songs ·Timebuckets ·Heart ratebuckets = 31 472

31 472 number of plays to explore all combinations of songs together with our buckets for time, heart rate and
getting a rating for them for a single user. Not all of them need to be explored to get decent recommendations
since many of those combinations will always be left unexplored in real life usage. A user might never play
songs in the middle of the night or he might never exercise and thus never reach the higher buckets of heart rate.

By trying to generalize the songs in to features, the problem was somewhat alleviated. Instead of having to
play every song, we now needed to explore all combinations of features at every heart rate and time bucket.

Tempobuckets ·Modebuckets ·Loudnessbuckets ·Heart ratebuckets ·Timebuckets = 6160

The system can then also be scaled up easier, adding more songs to the database does not affect what tempo
a user prefers at a certain heart rate and time. This method of identifying what songs that should be played
comes with its own disadvantages. Generalizing songs down to the values of three different attributes is not
sufficient to classify songs properly. For instance if a user prefers high tempo music at high heart rate, he
might be recommended both speed metal songs together with speedcore songs due to their similarity, even
though he might dislike one of those genres. While the user bias we implemented in our ranking system could
identify specific songs that a user dislikes, we have no way of excluding entire genres and having to skip all
3731 speedcore songs that exist on Spotify is not a very user friendly system. More features could have been
introduced, this would have increased the amount of data required and the time to train and make prediction.
But there is also no guarantee that including more features would enable us to identify all different types of
music nor give more accurate predictions. Using more types of real time data, such as location, could have

28


improved recommendation accuracy. As an example, a user might always like high tempo music when located
at the gym, regardless of heart rate. Using that as an extra input together with the other user information might
been beneficial to the results. We did not use it as we decided that the possible gain was not worth the extra
complexity, training and data required. Static data such as age, sex and country could be interesting to introduce
but was deemed an unviable option for us due to the limited number and variation of users we had at hand.

5.2 Scaling

Training the models took up towards 12 hours at the end of the project even with our low amount of user data.
While better hardware and optimizing the code could make the process significantly faster it is unlikely that it
could be made fast enough in a real system with millions of songs and users. A long training time makes the
system close to unusable as the feedback from the users need to be implemented in to our recommendations as
fast as possible. If a user has skipped a song to indicate that he does not like high tempo music at his current
time and heart rate, the system should not continue to recommend high tempo music.

The way our ranking system is built is also a roadblock for scaling up the system. We currently iterate through
every song in our database and calculate an individual weight for them. This is already an issue with the current
number of songs in our database, the ranking takes close to 3.5 seconds which makes for a very unresponsive
application. This could most likely be optimized heavily in our code. For instance with better caching to reduce
the number of queries to our database or by excluding large parts of the song data set that is unlikely to be
ranked highly.

To make our system more responsive, caching was introduced to avoid users from making too many CPU
time expensive requests to our recommender system. This introduced another problem, by the time a user got
to the last song in his cached recommendations, his or her heart rate might have changed. This would mean that
a recommendation would be inaccurate due to change in heart rate and a very different set of songs should have
been recommended for that heart rate.

5.3 Recommendations

By having a recommender system with two different sub systems we introduced a possible weakness. If our
system returns with low-rated song recommendations to the user it is hard to pin point which part of the system
that is to blame. We could have a good audio feature recommendation but a bad ranking, a good ranking but a
poor audio feature recommendation, or both of the parts could be performing poorly. This also leads to a greater
issue when training the machine learning systems, as if the ranking system performs poorly the algorithms might
receive low rewards for good recommendations which would teach it not to give that recommendation again.
Currently, we have no proper solution to this problem as combining the two parts into one would demand
extremely large data sets and in the case of the contextual bandit probably would not work at all as it would
have too many possible actions in order to be trained properly. Another solution to the problem could involve
splitting the ranking system into multiple smaller parts where each part gets a specific task. This sort of solution
could make it easier to find malfunctions in the system but could also make it more complex as, for instance, a
poor ranking could have multiple different sources.

5.4 Results

The evaluation displays that in our application the contextual bandit preforms best. Unlike linear regression
and deep neural network the contextual bandit understands the connection between what time of day the user is
listening and preferred tempo at that specific time. The LR and DNN models recommends songs with a good

29


average score without consideration of the time. Thus leading to many repeated recommendations. Because
contextual bandit identifies this connection, we see a much larger variation in recommendations. The fact that
our contextual bandit is configured to use random exploratory actions 10% of the time also accounts for a greater
variation in song selection. One possible reason for the high average rating the contextual bandit scored might
be because our limited amount of songs and our moderate amount of iterations.

Due to using few attributes it was difficult for all network to recommend song similar to each other. The
networks were not able to understand the different genres and thus a classical song and a Hip Hop song could
receive equally high score if their song features were similar. This is displayed in the recommended song Des-
tiny’s Child - Say My Name and VIXX - Shangria-La where these two songs have very similar attributes but
are very unlike.

Our evaluation did not evaluate the influence of the heart rate on the recommendation which was one of the
main purposes of this project. This might have been possible if conducting an evaluation on real users. How-
ever this would require that we would have someone use the application for a very long time enabling the
network to have an appropriate training but due to the projects short deadline this was not possible. Another
possible solution was that one of the team members would use the application, however as mentioned in 5.1 this
would not be viable.

5.5 Workflow

At the initial stage of the project, a lot of time was spent on researching machine learning due to the group
members inexperience with the subject. This delayed the actual implementation of the product and our data
collection. While a prototype was completed in reasonable time, bugs and the lack of a stable production build
also hampered data collection and user testing. Stricbibliotter rules for submissions to the main branch of our
git repository should have been discussed and agreed upon at an early stage.

The capacity of our server also limited how fast we could iterate on our recommender system. Since training
was slow, group members often had to wait a day to see how their code changes affected the recommendations.
This was debilitating to progress when many small changes had to be made.

5.6 Ethical Aspects

There are definitely ethical aspects to consider for this project. The most obvious aspect is the matter of data
collection and data storage. Our service will handle some possibly sensitive data, such as the user’s heart rate
at a given time and the user’s music preferences. It is important that people who use the service are made aware
of which data is gathered and stored, as well as how this data is handled. Since the data that would be gathered
if our service were to be released to the public could be of interest to various third parties, it is important that
we as a group discuss our views on matters like data gathering for commercial use and make our views clear to
eventual users of the service. Since some sensitive data is to be handled, it is also important to make sure our
database and the software connecting to the database are secure and that important data is encrypted, making
it not vulnerable to data breaches. As of writing this paper, the intention is not to release the application to the
public which is why there is no concrete conclusion on how data storage and collection will be handled. We do
however have a privacy policy in use, which is referenced in appendix A.

30


5.7 Impact on Society

Better music recommendations might not change our world but the use of real time data in automated systems
is something that is becoming more and more prevalent. Using more data to give better assessments and predic-
tions is something that can be applied to a multitude of problems. Machine learning is a powerful tool coupled
with the cheaper and more powerful processors and GPUs we have available to us due to Moore’s law and
the enormous amount of data that is currently being generated on the internet. While it is wrong to think that
machine learning techniques is a magic solution that can be applied to any problem, with careful planning and
consideration it can be used to huge success.

5.8 Experience

As mentioned in the section 5.5, the group lacked experience and knowledge of machine learning as well as
recommendation services. This proved to be a hindrance for the project as a lot of time was spent initially
researching these subjects. If we had that knowledge before the project, the implementation could have been
done earlier and with less issues along the way, specifically for the machine learning part. As we did not
know what implementation to initially go with for the feature recommendation system, a lot of time was spent
researching specifically what implementation would be viable. When we decided on a implementation, largely
due to the insight of Mikael Kågebäck, actually getting the system implemented took longer than expected. This
meant that we had less time for experimentation than we would have liked. Therefore, if someone decides to
build upon this project we would recommend previous knowledge of both machine learning and recommender
systems.

31


Chapter 6

Conclusion

While basing music recommendations on a combination of real time data and user history is not in itself a bad
idea and using machine learning as a way to recommend items is a well-proven technique, our implementation
has a plethora of issues. To have a system that could be scaled to realistic levels, generalization between users
is needed, perfect personalization is not feasible. While we could see that our recommendations got better over
time, one large reason was that our simulated user had completely static music preferences. Machine learning
cannot handle quick variations in user preferences due to the long time of training and is better used as a way
to give more generalized recommendations. Using offline evaluation also prevented us from exploring if there
was a clear link between the selected song features and a users heart rate since the simulated user was created
with a set song feature preference at different times and heart rates.

Due to our limited ability of testing the application with online evaluation, our result has limited use. As
we could only evaluate the system with simulated users we do not know how an actual user would rate the
system. Our result says that the feature recommender system actually improves but we do not know for sure if it
improves in such a way that it increases user satisfaction. It would have been interesting to try using other audio
features than the ones implemented. However, that would require online evaluation to determine whether these
audio features gave a better result. Because offline evaluation can only answer whether the application can find
the optimal value for the implemented audio features. Another interesting experiment would be to exchange
heart rate for another real-time parameter to see whether heart rate is actually a viable parameter to utilize. But
as this would also require online evaluation, it was not possible for us to experiment with this.

If this project was to be repeated we would suggest limiting the scope to one part of the system, either making
a good ranking algorithm or making feature suggestions. For the other part we would recommend to use an
already existing library or similar work. This would allow the work to be more in depth as both parts of the pro-
gram ended up being rather time-consuming and broad subjects. We would also recommend using a contextual
bandit as it was the one that seemed most suitable for this problem. As mentioned in the discussion the amount
of data proved to be a limiting factor and having access of a data set with usable data, or the ability to gather
such data, would enable better training and testing.

32


Chapter 7

Bibliographic Notes

7.1 Recommendation Systems

Recommender systems: Principles, methods and evaluation [15] – Thorough review of different recommenda-
tion techniques and their strengths and weaknesses, useful in background to the approach of this project.

Evaluating Recommendation Systems. [31] – Paper on how to test and evaluate recommendations. Very useful
for evaluation of result of this project.

A User-Centric Evaluation Framework for Recommender Systems [28] – A paper that evaluates users expe-
riences regarding quality of recommendations.

A Contextual-Bandit Approach to Personalized News Article Recommendation [23] – Paper that presents a
contextual bandit algorithm for making personal recommendations. Gives insight in to how contextual bandits
can be implemented and how to evaluate them.

Reinforcement Learning based Recommender System using Biclustering Technique [4] – A paper where a rein-
forcement learning algorithm for giving recommendations are developed. A biclustering technique is used to
reduce the state and action space. Relevant as it shows an example of how reinforcement learning can be useful
in giving recommendations.

Restricted Boltzmann Machines for Collaborative Filtering [30] – A paper that describes how RMB can be
used for giving recommendations, which is applicable on very large data sets. The article is a bit dated because
there have been much development in machine learning since 2007 but could still be relevant.

A smartphone-based activity-aware system for music streaming recommendation [38] – Music recommenda-
tion based on the users current activity and mood. The study does not use heart rate but still possesses useful
information on activity recognition, machine learning techniques and classification.

Clu-PoF-A Novel Post Filtering Approach for Efficient Context Aware Recommendations [36] – A paper on
how to utilize contextual information about user and item when giving recommendations.

Two Decades of Recommender Systems at Amazon.com [24] – Article that describes Amazon’s item based
collaborative filtering recommender system and its development. Useful example.

A User-Centric Evaluation Framework for Recommender Systems [28] – Paper that presents a framework that
evaluates different recommender systems based on user experience. Useful in how it narrows down what con-

33


stitutes a satisfying recommender system.

How music recommendation works — and doesn’t work [40] — Article that describes how Echo Nest, which is
the company that Spotify uses for recommandations, analyze and label music.

7.1.1 Neural Networks and Deep Learning

Deep Learning based Recommender System: A Survey and New Perspectives. [33] – Extensive survey of recent
years research and advancements in recommender systems using deep learning. The paper also points out some
open problems in the field and describes the newest trends in deep learning techniques for recommender sys-
tems.

Wide and Deep Learning for Recommender Systems [19] – Paper that present a wide and deep learning that
utilizes both wide linear models and deep neural networks to give recommendations.

Content-aware collaborative music recommendation using pre-trained neural networks. [9] – Paper that shows
that incorporating music content in a collaborative filtering approach can solve the "cold-start problem". Rele-
vant for understanding of music recommender systems.

Deep content-based music recommendation. [1] – Conference paper that shows that a deep neural network
based on audio signals can be used to predict latent factors for music recommendations that user data can’t
show.

Recommending music on Spotify with deep learning [10] – Sander Dieleman interns at Spotify and uses deep
learning to cluster songs together based on the model from the paper [1] where he was co-author. Interesting
with an industrial perspective and connects to the data (Spotify’s song data) used in this thesis.

Deep Neural Networks for YouTube Recommendations [27] – Paper that describes on a high level how YouTube
recommendations are being made. Helpful with real world examples of neural network recommender systems.

Hybrid Collaborative Filtering with Neural Networks [14] – Paper that introduce a neural network to perform
collaborative filtering with side information to avoid cold start.

DeepPlaylist: Using Recurrent Neural Networks to Predict Song Similarity [6] – A paper that uses a DNN
to see if two songs are similar based on lyrics and/or sound. Very relevant in recommender systems to find
similar songs so not only the most popular songs get recommended.

7.2 Connecting Music to Heart Rate, Activity or Mood

Relationship Between Exercise Heart Rate and Music Tempo Preference [8] – Article on how music tempo
preference correlates to exercise heart rate, supports the choice of heart rate as real time parameter for music
recommendation.

Music can make the heart beat faster [12] – Article on how music can affect the heart rate and blood pres-
sure, not very relevant and we will probably not use it.

Stress-relieving music [20] – Study of the development of personalized music recommendations to lower stress
levels. It is probably not rel