Comparison of Database Management Systems

A comparison of common Database Management Systems
Bachelor’s thesis in Computer Science and Engineering

Simon Arneson, Joakim Berntsson, Victor Larsson,
Simon Larsson Takman, Erik Nordmark, Pedram Talebi

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2016


Bachelor’s thesis 2016:DATX02-16-16

Comparison of Database Management Systems

A comparison of common Database Management Systems

Simon Arneson, Joakim Berntsson, Victor Larsson,
Simon Larsson Takman, Erik Nordmark, Pedram Talebi

Department of Computer Science and Engineering
Chalmers University of Technology

Gothenburg, Sweden 2016


Comparison of Database Management Systems
A comparison of common Database Management Systems

SIMON ARNESON
JOAKIM BERNTSSON
VICTOR LARSSON
SIMON LARSSON TAKMAN
ERIK NORDMARK
PEDRAM TALEBI

The Author grants to Chalmers University of Technology and University of Gothen-
burg the non-exclusive right to publish theWork electronically and in a non-commercial
purpose make it accessible on the Internet. The Author warrants that he/she is the
author to the Work, and warrants that the Work does not contain text, pictures or
other material that violates copyright law.

The Author shall, when transferring the rights of the Work to a third party (for
example a publisher or a company), acknowledge the third party about this agree-
ment. If the Author has signed a copyright agreement with a third party regarding
the Work, the Author warrants hereby that he/she has obtained any necessary per-
mission from this third party to let Chalmers University of Technology and Uni-
versity of Gothenburg store the Work electronically and make it accessible on the
Internet.

© SIMON ARNESON, 2016.
© JOAKIM BERNTSSON, 2016.
© VICTOR LARSSON, 2016.
© SIMON LARSSON TAKMAN, 2016.
© ERIK NORDMARK, 2016.
© PEDRAM TALEBI, 2016.

Supervisor: Morten Fjeld, Department of Applied Information Technology
Examiner: Niklas Broberg, Department of Computer Science and Engineering

Bachelor’s Thesis 2016:DATX02-16-16
Department of Computer Science and Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Department of Computer Science and Engineering
Gothenburg, Sweden 2016

iii


Abstract

Data can be referred to as objects or events stored on digital media. The evolution
of IT solutions has sprung an interest in developing systems to store this type of
data more efficiently. The modern method of storing general data is by utilizing a
database, which is an organized collection of logically related data.

Different types of databases suit various needs, therefore the purpose of this project
is to compare the leading database solutions with each other. The databases chosen
for this project are: MySQL, Neo4j and MongoDB.

Furthermore we collaborate with two companies in order to acquire hands-on in-
formation of how companies work with data. This is done by developing modules
for their web applications. The insight and data obtained through these companies
is used to create test cases which is then benchmarked on the different database
management systems.

The test results indicates that MongoDB has overall the best performance. However,
during a few complex test cases MongoDB lacks in performance. Neo4j performs
worse on more test cases than MongoDB, and MySQL is by far the most consistent
of these three technologies. The outcome is similar to our expectations and confirms
parts of the related work that has been researched.

Keywords: database, benchmarking, test cases, company analysis, nosql, dbms,
relational

iv


Sammanfattning

Ordet data innefattar flera tolkningar och kan förklaras som antingen objekt eller
som händelser lagrade på digitala medier. Utvecklingen av IT-lösningar har främ-
jats av en strävan efter att lagra data så effektivt som möjligt. Idag lagrar man
huvudsakligen data genom att använda en databas, vilket i grund och botten är en
organiserad samling av logiskt relaterad data.

Eftersom data kommer i många variationer, lämpar sig olika databasteknologier
för olika situationer. Syftet med detta projekt är därför att jämföra några av de
ledande databaslösningarna. Valet av databaser blev därför följande: MySQL, Neo4j
och MongoDB.

I vårt projekt samarbetar vi med två företag för att förvärva praktisk information
om hur företag arbetar med data. Detta görs genom att utveckla moduler för deras
webbapplikationer. Den kunskap som erhålls genom att analysera företagen används
för att skapa verklighetsbaserade testfall som sedan genomförs empiriskt på de olika
databassystemen.

Testresultaten visar att MongoDB presterar generellt bäst. Under några komplicer-
ade testfall underpresterade dock MongoDB. Neo4j presterar sämre på fler testfall
än både MongoDB och MySQL. Testresultaten visar även att MySQL är den mest
konsekventa av de tre teknikerna. Resultaten stämmer överens på flertalet punkter
med våra förväntningar och bekräftar delar av det relaterade arbete som vi under-
sökt.

Nyckelord: databaser, tester, testfall, företagsanalyser, sql, nosql, relationsdatabaser

v


Definitions

Data Objects and events stored on digital media
Database An organized collection of logically related data
Database refactoring A change to a database schema that improves its

design
Entity A certain type of data in a model
General data Commonly occurring, non-complex data such as

user profiles, products, orders etc.
Information Data processed to increase the knowledge of the

observer
Large-scale
applications Applications handling big data sets and many

queries
Model Generally defined structure of data and its’ rela-

tions

vi


Names & Abbreviations

Agile Collection name for iterative methodologies
API Application Programming Interface
Auto-sharding Automatically horizontal partitioning of a

database
Benchmarking Something that can be used as a way to judge the

quality of similar things
CMS Content Management System
CPU Central Processing Unit
CRM Customer Relationship Management
CRUD Create Read Update Delete
CSV Comma-separated value
Cypher Neo4j’s query language which is based upon SQL
Daemon process A process that runs in the background and per-

forms a specified operation at predefined times or
in response to certain events

Database Storage solution for data
Database Schema Schematic describing constraints to be applied on

a database model
DBMS Database Management System
IoT Internet of Things
JavaScript Lightweight, prototype based programming lan-

guage
JSON JavaScript Object Notation
Map reduce Query and aggregation technique
NoSQL Generic name for databases which is not based on

SQL
OS Operating System
Python Multi-paradigm programming language
RAM Random Access Memory
RDBMS Relational Database Management System
Schemaless The documents stored in the database can have

varying sets of fields, with different types for each
field

SKU Stock Keeping Unit
SQL Structured Query Language
SSD Solid State Drive
Startup Newly emerged, fast-growing business
URL Uniform Resource Locator
VM Virtual machine

vii


Acknowledgements

We would like to take the opportunity to express our sincere gratitude to the de-
partment faculty members Niklas Broberg and Graham Kemp, for the productive
discussions during this project. We would also like to express gratitude to our su-
pervisor Morten Fjeld, for all of the support both before and during the project
phase.

We also want to thank Marco Schirone, librarian at Chalmers University of Tech-
nology, who helped us during the project with questions regarding the reference
management system.

We are grateful towards all of the companies that have taken their time do interviews
with us and answered our questions. We are especially grateful towards Raceone
and Skim for the great collaboration and supplying us with test data.

Finally, we would like to thank Stefan Svensson, our contact person at Skim, for
always taking the time to set up weekly meetings and always being eager to give us
feedback.

viii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Scope & Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory & Tools 3
2.1 Types of Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Relational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 NoSQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Structured Query Language . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Database Management Systems . . . . . . . . . . . . . . . . . . . . . 5

2.3.1 Relational - MySQL . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.2 Graph - Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.3 Document - MongoDB . . . . . . . . . . . . . . . . . . . . . . 8
2.3.4 Key-value - CouchDB . . . . . . . . . . . . . . . . . . . . . . . 9

3 Related Work 13
3.1 Data Management in Cloud Environments by Katarina Grolinger et al. 13

3.1.1 Key-value stores . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1.2 Document stores . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.3 Graph databases . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.4 Conclusion of the Study . . . . . . . . . . . . . . . . . . . . . 14

4 Approach 17
4.1 Market Analysis Company . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . 17
4.1.2 Suggested Software Models . . . . . . . . . . . . . . . . . . . . 18
4.1.3 Analysis of Application Usage . . . . . . . . . . . . . . . . . . 19
4.1.4 Suggested Test Cases . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.5 Suggested Data Collection . . . . . . . . . . . . . . . . . . . . 21

4.2 Race Sharing Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . 22

ix


Contents

4.2.2 Suggested Software Models . . . . . . . . . . . . . . . . . . . . 22
4.2.3 Analysis of Application Usage . . . . . . . . . . . . . . . . . . 25
4.2.4 Suggested Test Cases . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.5 Suggested Data Collection . . . . . . . . . . . . . . . . . . . . 26

4.3 Reference Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 Development of Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4.1 Test Framework . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.2 MySQL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.4.3 Neo4j . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.4 MongoDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.5 Benchmarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.1 Virtual Machines . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5.2 Execution of Test Cases . . . . . . . . . . . . . . . . . . . . . 33
4.5.3 Compiling of Results . . . . . . . . . . . . . . . . . . . . . . . 33

5 Results 35
5.1 Skim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Tables of Average Times . . . . . . . . . . . . . . . . . . . . . 35
5.1.2 Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2 Raceone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.1 Tables of Average Times . . . . . . . . . . . . . . . . . . . . . 38
5.2.2 Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.3 Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.3.1 Tables of Average Times . . . . . . . . . . . . . . . . . . . . . 41
5.3.2 Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Discussion 45
6.1 Interpretation of Results . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.2 Expectations of Test Cases . . . . . . . . . . . . . . . . . . . . . . . . 46

6.2.1 Skim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
6.2.2 Raceone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.3 Similarities to Related Work . . . . . . . . . . . . . . . . . . . . . . . 49
6.4 Social Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

7 Conclusion 51

8 Outlook 53
8.1 Additional Companies . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.2 CouchDB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.3 Analyzing CPU Load . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.4 Displaying of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

Bibliography 55

A Appendix 1 I
A.1 Screenshots of the company modules . . . . . . . . . . . . . . . . . . I
A.2 Additional Raceone Test Cases . . . . . . . . . . . . . . . . . . . . . II

x


List of Figures

2.1 Example of Car data stored in a table. . . . . . . . . . . . . . . . . . 3
2.2 Abstracted overview showing the position of a DBMS in an applica-

tion environment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Seven nodes with different relationships. . . . . . . . . . . . . . . . . 7
2.4 Visual representation of a collection with documents in MongoDB. . . 9
2.5 Graphical illustration of a B tree [1]. . . . . . . . . . . . . . . . . . . 10

4.1 Modelling of Skim in MySQL. . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Modelling of Skim in Neo4j. . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Modelling of Skim in MongoDB and CouchDB. . . . . . . . . . . . . 19
4.4 Modelling of Raceone in MySQL. . . . . . . . . . . . . . . . . . . . . 23
4.5 Modelling of Raceone in Neo4j. . . . . . . . . . . . . . . . . . . . . . 23
4.6 Modelling of Raceone in MongoDB and CouchDB. . . . . . . . . . . . 24
4.7 Graphical illustration of the test environment. . . . . . . . . . . . . . 29

5.1 Average times for Skim’s test cases for data scale 1. . . . . . . . . . . 36
5.2 Average times for Skim’s test cases for data scale 10. . . . . . . . . . 37
5.3 Average times for Raceone’s test cases for data scale 1. . . . . . . . . 39
5.4 Average times for Raceone’s test cases for data scale 10. . . . . . . . . 40
5.5 Average times for Reference’s test cases for data scale 1. . . . . . . . 42
5.6 Average times for Reference’s test cases for data scale 10. . . . . . . . 43

A.1 Screenshot of the Skim module - SKU list view. . . . . . . . . . . . . I
A.2 Screenshot of the Skim module - Image SKU matching view. . . . . . I
A.3 Screenshot of the Raceone portal - Create an event view. . . . . . . . II
A.4 Screenshot of the Raceone portal - Event list view. . . . . . . . . . . II

xi


List of Figures

xii


List of Tables

2.1 Relationship between RDBMS and MongoDB terminology. . . . . . . 8

4.1 Test cases for Skim . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 Skim entities and their suggested amount. . . . . . . . . . . . . . . . 21
4.3 Test cases for Raceone . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.4 Raceone entities and their suggested amount. . . . . . . . . . . . . . 26
4.5 Reference entities and their suggested amount. . . . . . . . . . . . . . 27
4.6 Reference entities and their suggested amount. . . . . . . . . . . . . . 33

5.1 Skim: Average execution times for data scale 1. . . . . . . . . . . . . 35
5.2 Skim: Average execution times for data scale 10. . . . . . . . . . . . . 36
5.3 Raceone: Average execution times for data scale 1. . . . . . . . . . . 38
5.4 Raceone: Average execution times for data scale 10. . . . . . . . . . . 38
5.5 Reference: Average execution times for data scale 1. . . . . . . . . . . 41
5.6 Reference: Average execution times for data scale 10. . . . . . . . . . 41

6.1 Expectations and their outcome for Skim . . . . . . . . . . . . . . . . 46
6.2 Expectations and their outcome for Skim . . . . . . . . . . . . . . . . 47
6.3 Expectations and their outcome for Skim . . . . . . . . . . . . . . . . 48
6.4 Expectations and their outcome for Skim . . . . . . . . . . . . . . . . 49

xiii


List of Tables

xiv


1
Introduction

Data is a concept with many definitions where books, images, and videos can all con-
tain different forms of data. Data refers to objects or events stored on digital media
[2], and the evolution of IT solutions has created a large interest in managing data
for the web. The modern way of storing general data, such as texts and numbers, is
within a database. Databases are divided into different branches which are suitable
for different types and amounts of data. The storage, security and accessibility of
data requires a great deal of attention and competence. These requirements can be
fulfilled by some companies, but it can be a difficult task for small companies due
to the lack of resources.

While the subject of databases and data is growing in complexity, more work has
to be done to compile the information to a comprehensible amount. Different com-
panies needs different approaches to store their particular data, especially in order
to avoid reworking their entire model in a near future.

1.1 Purpose

The focus of this project is to compare and analyze the performance of different
database management systems, DBMS’s. Two companies are chosen as collaborators
in order to understand what type of data that is stored and how it is used.

The insight and data derived from the companies are used to create test cases of
data usage and analyze these on the most used database technologies. The result is
meant to be easily understandable and help in the selection of a database strategy.

1.2 Problem definitions

There are several obstacles when benchmarking and comparing DBMS’s, the first is
the selection of technologies to analyze. The chosen technologies have to be widely
used and represent the majority of DBMS’s. There are numerous database categories
containing different variations of database technologies, therefore the selection has
to be done with great care.

1


1. Introduction

Another aspect to consider is the hardware to host the systems on. Some systems
are meant to run on certain hardware and by choosing the same type of machine for
all DBMS’s, some issues may be solved but new might arise.

The next problem is the libraries chosen to communicate with the databases. By
using different networking protocols, the results of the benchmark can be misleading.

The final problem is related to the human factor, where even if the problems above
are solved, the implementations of clients using the technologies and libraries may
be done insufficiently or incorrectly.

1.3 Scope & Delimitations

Three types of databases will be considered during this project. In each of these
types, one implementation will be chosen which will represent the whole category.
The chosen implementation can also be used in multiple ways, which limits this
project to researching a few of the implementations and evaluating the most promis-
ing version.

A substantial area that will not be considered in this research, is different sets
of hardware. Databases can behave differently on different hardware, however, to
include this would make this project too broad. Therefore we will choose a single
reference machine to be used by all databases.

In the beginning of the project CouchDB was included as a candidate to be com-
pared. However, it was later excluded due to our priorities.

2


2
Theory & Tools

The theory behind this project is focused on the implementations of the chosen
database management systems and the technologies they are based upon.

2.1 Types of Databases

A database can be defined as a collection of logically related data [2]. There are
several types of databases and two widely used technologies are: relational and not
only Structured Query Language, NoSQL.

2.1.1 Relational

Since the first commercial implementation of the relational database was released
by Oracle in 1979, the choice of database solutions within companies has been
dominated by the relational database model [3].

A relational database organizes data items into tables, called entities, where each
column in the table represents an attribute of that entity. Every row in the table
represents one data item, for example one product, and each row must have an
unique key. An example of a table can be seen in Figure 2.1. Values in a table can
be a reference to a value in a column in another table, and this can be seen as a
relationship between columns. [4]

Figure 2.1: Example of Car data stored in a table.

In relational databases there is a term called constraint, which is a rule that cannot
be violated by database users [2].

3


2. Theory & Tools

2.1.2 NoSQL

Many companies run their business and interact with customers primarily through
web- and mobile applications, which imposes greater demands on performance and
creates new technology requirements. Below is a list of common technological re-
quirements of modern applications [5]:

• Support large numbers of concurrent users (tens of thousands, perhaps mil-
lions)

• Deliver highly responsive experiences to a globally distributed base of users

• Always be available – no downtime

• Handle semi- and unstructured data

• Rapidly adapt to changing requirements with frequent updates and new fea-
tures

Due to the new technological requirements, enterprise technology architecture have
to be renewed and adapted to manage real time data in an efficient way. Unprece-
dented levels of scale, speed, and data variability are some of the new technical
challenges which must accommodated. For instance, "over 1 billion people use Face-
book every month, and every day there are more than 2.7 billion Likes and over 2.4
billion content items shared with friends" [6]. These circumstances made companies
lean towards a new type of database technology, NoSQL. NoSQL is a term for vari-
ous types of databases, each with their own specific characteristics. Below is a list
of common types of NoSQL databases:

• Graph database - Neo4j

• Document database - MongoDB

• Key-Value store - CouchDB

The main difference between relational databases and NoSQL is how their data
models are designed. Unlike relational databases where a fixed model is based on
static schemas and relationships, the data model in NoSQL is more flexible and
does not require predefined schemas. By using the schema-less model, users are
able to make modifications in their database in real-time, which fits well with agile
development. Due to the flexible data model, modifications in an application leads
to fewer interruptions in the development and less database administration time is
needed.

2.2 Structured Query Language

A query language can be described as a tool to write predefined functions for ex-
tracting data from a database. As early as 1970, Edgar F. Codd published a paper
where he explained how to access information in databases without knowing the

4


2. Theory & Tools

structure of the information. This was the start of structured query language, SQL.
[7]

SQL is the most well known query language, and extensions of SQL is used by both
relational database management systems, RDMS’s, and NoSQL databases. There
are several implementations of the SQL standard and all of them share the CRUD
cycle. CRUD is an acronym for: Create, Read, Update, and Delete, which are the
main operations in the SQL standard.

SQL statements are used to extract information from a database. A statement in
SQL is composed out of several keywords, each with its own purpose. A select
statement fetches data from the database, it represents the Read operation of the
CRUD cycle. The following list consists of the most common keywords used in a
select statement to read and structure data [8].

• FROM - Which tables to retrieve data from.

• WHERE - Compares the result and eliminates everything that evaluates as
false.

• GROUP BY - Group the result by one or more columns.

• HAVING - Filters groups based on criteria specified in the query.

• ORDER BY - Sorts the result in ascending or descending order.

• DISTINCT - Eliminates all duplicates in the result.

Below is an example of querying the id, price and name of a product stored in the
Products table ordered by the price followed by the name [8].

1 SELECT prod_id, prod_price, prod_name
2 FROM Products
3 ORDER BY prod_price, prod_name;

2.3 Database Management Systems

A database management system, DBMS, is a layer between an application and
data, which allows the usage of a database approach through CRUD operations.
The central positioning of a DBMS is crucial to allow multiple applications and
users to interact with the same data, and thus establishing a communication tunnel
[2]. This is illustrated in Figure 2.2.

5


2. Theory & Tools

Figure 2.2: Abstracted overview showing the position of a DBMS in an application
environment.

Furthermore, a DBMS is responsible for enforcing data integrity, managing concur-
rency and restoration of data. These three areas are important when expanding the
usage of a DBMS to allow more connections and users [2].

2.3.1 Relational - MySQL

MySQL is the second most popular database management system [9]. The MySQL
software is released as open source under the GPL licence and it is available on
over 20 platforms [10]. It is a relational database management system and uses its
own implementation of the SQL standard [11] which differs in some aspects from
Transact-SQL, an implementation created by Microsoft and Sybase used in Micosoft
SQL Server [12], and PL/SQL, an implementation used in Oracle Database [13]. The
main difference between these different implementations is how they handle local
variables and procedural processes, such as stored procedures and triggers. [12] [13].

2.3.2 Graph - Neo4j

Graph databases are built upon the idea of graphs, where a graph is a set of vertices
and edges which forms nodes and relationships. In our everyday life we can see
several similarities to graphs and this common occurrence is why it can be easier to
model some things in a graph database. An area where graph databases stands out
is during more complex querying of relationships, e.g. for shortest path, minimal
spanning tree and other similar algorithms. The most used NoSQL graph database
is Neo4j. [14]

There are several differences between Neo4j and RDBMS, where: Neo4j is built
on relationships, not entities; Neo4j does not slow down from increasing data sets,
since only the number of traversed nodes make a difference; and Neo4j is schema-less,
which means it is more flexible to make additions to its model. [14]

6


2. Theory & Tools

According to Ian Robinson et al. [14], the most adopted graph data model is called
labeled property graph model, where a model is defined as an abstraction or sim-
plification of the real world. This data model is made up of nodes, relationships,
properties and labels. Nodes contain properties, in form of key-value pairs, and
roles, in form of tags. A relationship has a direction, name, start node and an end
node. Like nodes, relationships may also have properties. An example of seven
nodes with different relationships can be seen in Figure 2.3.

Figure 2.3: Seven nodes with different relationships.

Ian Robinson et al. also mentions Cypher, the standard query language in Neo4j.
Cypher is a graphical querying language, with arrows and tags to indicate direc-
tion and explanation of a relationship. Cypher’s keywords are very similar to the
keywords found in SQL and includes the following:

• WHERE - Similar to the SQL WHERE keyword.

• CREATE and CREATE UNIQUE - Used to create nodes and relationships.

• DELETE - Used to delete nodes and relationships.

• MERGE - Updates a node if it exists, otherwise it creates a new node.

• SET - Adds a new property to an existing node or relationship.

• FOREACH - Allows for iteration over a result set.

• UNION - Combines two result sets into one set of results.

Below is an example for querying friends of Jim who know each other as well [14].

1 MATCH (a:Person {name: ’Jim’})-[:KNOWS]->(b)-[:KNOWS]->(c),
2 (a)-[:KNOWS]->(c)
3 RETURN b, c

7


2. Theory & Tools

Validation of a graph model is pretty loosely defined by Ian Robinson et al. and
there are no clear strategies to validate a model. One alternative is to begin at the
start node and follow its path, while reading the name of each node and relationship
along the way out loud. If this creates sensible sentences, then the model is likely
to be a good representation. [14]

Common use cases for Neo4j and graph databases in general is for social networking,
recommendations and geospatial calculations. A considerable reason for the effec-
tiveness in these areas is the graph and tree related algorithms that can be applied.
[14]

2.3.3 Document - MongoDB

MongoDB is the leading NoSQL database [9]. MongoDB is a schema-less docu-
ment oriented database where data is stored in Binary JSON, a binary-encoded
serialization of JSON-like documents [15].

MongoDB is composed of collections, documents, fields and embedded documents
in contrast to relational databases, which consists of the traditional tables, rows,
columns, and table joins. The Table 2.1 shows the relationship between RDBMS
and MongoDB terminology.

RDBMS MongoDB
Database Database
Table Collection
Tuple/Row/Record Document
Column Field
Table Join Embedded Documents
Primary Key Primary Key (Default key _id provided by MongoDB)

Table 2.1: Relationship between RDBMS and MongoDB terminology.

A collection can be seen as a container for the MongoDB documents and since it is
a NoSQL database, no predefined schemas are required. MongoDB stores its data
as documents and documents have dynamic schemas, which means that documents
in the same collection can have different sets of fields. However, every document in
MongoDB has a special key, called id, that is unique within a collection. [16]

MongoDB is delivered with a JavaScript shell which allows a user to query a Mon-
goDB database without the use of a library. MongoDB shell uses the javascript
dot notation syntax to expose several different types of database operations to the
user, such as insert, update, and remove. Each of these functions accept queries
formatted as JSON. The MongoDB query language can therefore be considered to
be JSON based. [16]

The first of the two code snippets below demonstrates how to add a document with
four different fields to the users collection. The second is updating the username

8


2. Theory & Tools

field of the document that was added in by the previous snippet. In addition, Figure
2.4 is attached, showing a visual representation of the concepts of documents and
collections.

1 db.users.insert_one({
2 _id: "507f1f77bcf86cd799439010",
3 username: "user_1",
4 email: "user1@gmail.com",
5 password: "SuperHash"
6 })

1 db.users.update(
2 {’_id’:’507f1f77bcf86cd799439010’},
3 {$set:{’username’:’user_1_new’}}
4 )

Figure 2.4: Visual representation of a collection with documents in MongoDB.

2.3.4 Key-value - CouchDB

CouchDB is a schema-less NoSQL database with a focus on modularization and scal-
ability. CouchDB uses JSON to represent data. Opposed to MongoDB, CouchDB
uses a B+ tree storage engine which allows searches, insertions and deletions in log-
arithmic time. Additionally, CouchDB supports MapReduce, which is a query and

9


2. Theory & Tools

aggregation technique. [17]

A large number of data structures are linear, since each node can have either zero
or one predecessor and successor. Trees are defined differently, where each node can
have several successors, or children, yet still only one predecessor, or parent. As
stated by Koffman and Wolfgang [1], when searching for an element, linear models
have to check each element while sorted trees can limit the continued search to one
of the sub-trees. Due to this, linear models have an access time of ordo n while trees
can achieve log(n).

B+ trees are also self balancing which means that they try to keep the height of
parents’ sub trees equal or near equal. While binary trees only allow up to two
children, B+ trees supports n number of values for, and children of, a node. Figure
2.5 shows a B tree with children on a per value basis, where each node has multiple
values.

Figure 2.5: Graphical illustration of a B tree [1].

Another aspect of CouchDB is MapReduce. MapReduce consist of two methods, a
map function and a reduce function. The mapping function handles the selection of
data while the reducing function aggregates the documents. Both methods return
key-value pairs, which are very efficiently queried from a B+ tree. [17]

1 function map(doc) {
2 if (doc.type === ’RACE’)
3 emit(doc.date, 1);
4 }
5 function(keys, values) {
6 return sum(values);
7 }

Above is an example of a map and reduce function. When enabling the grouping
setting, these methods will return all dates and the number of races taking place on
that particular date.

When developing with CouchDB, the recommended approach as stated by Ander-
son et al. [17] is to declare views with pre-defined mapping and reducing functions.
These views can be accessed with HTTP requests to http://COUCH_IP:COUCH_
PORT/DB_NAME/_design/VIEW_NAME. This allows for easy GET/POST/DELETE/PUT

10

http://COUCH_IP:COUCH_PORT/DB_NAME/_design/VIEW_NAME
http://COUCH_IP:COUCH_PORT/DB_NAME/_design/VIEW_NAME


2. Theory & Tools

requests with built in authorization and validation in all languages, since the only
requirement is a standard HTTP request library.

Finally, Anderson et al. [17] mentions the deployment of a CouchDB database,
which covers scaling through replication, load balancing and clustering. Replication
offers synchronization between copies of the same database on different machines.
This will enable a distributed network of access points to lower the latency for users
all over the world. The access points can also include load balancing, so the machines
are used evenly across the network. Load balancing and replication can be used to
create clusters and thus induce fail tolerance and a higher performance cap.

11


2. Theory & Tools

12


3
Related Work

In order to achieve a broader understanding and obtain a glance of what to expect,
the group studied some related works of database analysis and evaluated the most
promising study below.

3.1 Data Management in Cloud Environments by
Katarina Grolinger et al.

The project has three major focus areas. The first area concerns getting a general
perspective of the database management domain. This is done by summarizing and
categorizing leading NoSQL and NewSQL solutions.

Another focus area of the research is analysing the characterizing differences between
the selected data store solutions. The comparison is done in order to act as a
guideline towards developers and researchers when they are in the position to choose
a data store solution.

The final focus area is about taking all the previous work into consideration and
present opportunities and challenges that may be relevant for future iterations, in-
volving the same subject matter.

What we found particularly interesting about this paper was the comparison of the
different NoSQL databases. The Results are compiled to help practitioners to choose
the appropriate data stores for different use cases. [18]

As stated in Section 2.1.2, NoSQL databases can further be sub-classified based on
their data models, and this paper follows the classification provided by Hecht and
Jablonski [19].

The three major categories are key-value stores, document stores and graph database
[18].

3.1.1 Key-value stores

The study claims that since key-value stores like CouchDB are schema-less, they
are well suited to store distributed data, but are not suitable for scenarios requiring

13


3. Related Work

relations or structures. It is also stated that key-value stores are appropriate when
applications access a set of data using a unique key. Three typical examples are
raised in the study where a key value database suits well: storing web session infor-
mation; user profiles and configurations; and shopping cart data. These use cases
suit CouchDB well since the data is accessed through user identification and never
queried based on the data content.

Furthermore, the study also found that key-value databases are not the best solution
when it comes to highly interconnected data, due to all the relationships that needs
to be explicitly handled in the client applications instead. [18]

3.1.2 Document stores

As mentioned in Section 2.3.3 MongoDB store its data as documents. Therefore,
applications dealing with data that can be easily interpreted as documents, such as
blogging platforms and CMS, are well suited for MongoDB.

Furthermore, it is also stated that a second use case for document data stores is ap-
plications storing items of related nature, but with different structure. Consider, for
example, where the document data stores are used to log events from enterprise sys-
tem. The events are represented as documents and events from different sources log
different information. However, since the documents use dynamic schemas, chang-
ing log formats does not need any database refactoring. Same task for relational
database would be tedious, in which a new table needs to be created for each new
format or new columns needs to be added to existing tables. [18]

3.1.3 Graph databases

Graph database is the final category that is mentioned. The study found that graph
databases, such as Neo4j, is typically used in location-based services, recommenda-
tion engines, and complex network-based applications including social, information,
technological, and biological networks. A use case that is mentioned in the study
is that user location history data, which is used to generate patterns that associate
people with their frequently visited places, could be efficiently stored and queried
with Neo4j.

Closely related data and complex queries similar to multiple joins in relational
databases are also scenarios were graph database is suitable according to the study.
[18]

3.1.4 Conclusion of the Study

The study of Katarina Grolinger et al. concludes that NoSQL databases are espe-
cially appropriate as alternatives to traditional relational databases in areas where
huge volume of data is going to be handled.

14


3. Related Work

In addition, their study claims that a relational database is appropriate in scenarios
where the structure is known in advance and unlikely to change.

Finally, the researchers of the study claim that it is essential to consider the in-
vestments already made in tools and personnel training prior to selecting the most
appropriate database. [18]

15


3. Related Work

16


4
Approach

This project was divided into four sections: pre-study, implementation, testing and
compilation of results. The pre-study consisted of: researching the database tech-
nologies; conduction interviews with and developing applications for companies; and
analyzing the type of data being used by companies.

Creating test cases was done by discussing the observations of the different compa-
nies. Then a benchmarking suite was built with all test cases represented. Finally
the cases were run in an interference free environment.

4.1 Market Analysis Company

We collaborated with the company Skim, which was founded in 1979 and has more
than 100 employees.

Skim is an international market analysis company which offers pricing and portfolio
management to other companies [20]. They majority of Skim’s data consist out of
images from their clients products. These images require a lot of processing and
matching to be useful in an analysis.

To facilitate for their employees they have a small group of developers working with
creating an internal customer relationship management, CRM, tool to solve these
problems. The CRM platform is a cloud based web portal and has several modules
to simplify the communication between Skim and its clients.

4.1.1 System Description

Our task was to develop an image uploader module, which will be part of their
CRM platform. The image uploader is a convenient way for Skim to define rules
and requirements for the images sent by Skim’s clients. By helping the clients
fulfill the rules set by Skim, time and effort can be saved for both parts, as well as
communicational improvements between them. The main purpose of our module
is to upload images, and then match them to a corresponding stock keeping unit,
SKU. A SKU contains multiple attributes, which together defines a product. The
image uploader consist of the views: upload, quality assurance, match and confirm.

17


4. Approach

The finished application is shown in Figure A.1 and A.2.

4.1.2 Suggested Software Models

Figure 4.1, 4.2, and 4.3 are the suggested models for this domain. The models follow
the best practices based on external resources, and was based on the use cases that
we received from Skim [2] [21]. The developed models was used throughout the
development, as well as later in our database tests.

Figure 4.1: Modelling of Skim in MySQL.

18


4. Approach

Figure 4.2: Modelling of Skim in Neo4j.

Figure 4.3: Modelling of Skim in MongoDB and CouchDB.

4.1.3 Analysis of Application Usage

The upload phase is responsible for uploading SKU-lists and images. SKUs can be
created either by adding them manually, or uploading a CSV-file. The headers for

19


4. Approach

a SKU-list can be different between projects, which lead to a more complex model
than initially planned. By introducing headers and values for each SKU, i.e each row
value specifies its value with the header, which will give some redundancy. Several
image formats can be uploaded to the system and the meta data is extracted and
stored in the model before saving the file on the file system. A URL for the image
is also stored in the image model.

After uploading the necessary entities, Skim needs to assure that the images upholds
the requirements and rules set for the project. At this stage it is possible to comment
on images, mark images as invalid and delete images. Deleted images is stored in
a separate collection in the project. The rules for a project is defined as simple
attributes in the project.

To ensure the validity of images, they need to be matched to a SKU. At this point,
only images that are valid will be shown. When an image is matched, it is removed
from the general image list of the project and stored in a list on the SKU. This
means that the images are stored in three different lists: the trash list, the image
list or within a SKU.

Finally, a confirmation stage is presented. This phase will allow you to get an
overview of all SKUs with the images matched to them. When a SKU is deemed
done, it can be marked as approved and will not be shown in the system anymore.

4.1.4 Suggested Test Cases

The suggested application usage led to the creation of test cases. Our purposed
cases are shown in Table 4.1.

20


4. Approach

Test Cases
Name Description
Pair Image with SKU Pairing, or matching, an image with a SKU by removing the

image from the project’s image list and inserting it into a
SKU. This will be a common occurrence in the portal.

Fetch Users This case will make a request for all users in the database.
Fetching users is a typical case for CRM systems.

Add rows to SKU Adding rows to a SKU-list will take a predefined SKU row
and add that to a project. This will occur frequently in Skim’s
portal and can be problematic with the dynamic SKU headers.

Fetch SKU Fetching a SKU includes querying all values and headers as-
sociated with the SKU. This will be done in several parts of
the portal.

Comment on image Commenting on an image will require: the image id; the id
of the commenting user; and the message text as parameters.
In Skim’s portal this will be their main feature for communi-
cating with clients regarding the image quality.

Fetch a user’s comments Fetching all the comments made by specific user will take a
user id as a parameter. This does not occur in Skim’s portal
at the moment, but could for example be used in a My Profile
view where a user could keep track of all current conversations.

Table 4.1: Test cases for Skim

4.1.5 Suggested Data Collection

We purpose the amount of test data specified in Table 4.2.

Entity Amount
User 100
Project 10
Collaborator 10
Project Image 100
SKU 20
SKU Value 15
SKU Image 2
Image Comment 5

Table 4.2: Skim entities and their suggested amount.

As indicated by Table 4.2, each project will have 10 collaborators, 100 images, and
20 SKUs. All SKUs have 2 images and 15 values each, and all images will have 5
comments.

21


4. Approach

4.2 Race Sharing Solution

Raceone is a startup company founded and located in Gothenburg. Their product
is a mobile application for sharing and following different sport races live [22]. The
participant registers on a race, and during the race the application will send coor-
dinates to a backend, which will analyze the movements. Positional updates will be
continuously sent to followers in real time. The coordinates and other information
will be saved in their database to enable analysis after a race’s completion.

4.2.1 System Description

Our task was to develop Raceone’s platform for organizers of various sports race
events. Organizers are able to register and handle their events in a simple and
convenient way. The application consists of different steps for the organizer to go
through, in order to provide sufficient amount of information to set up a new race.
Below is a list of the main parts of the application:

• Race registration

• Map registration: Upload, Optimize, Start/Goal and POI’s

• Confirmation

The finished application is shown in Figure A.3 and A.4.

4.2.2 Suggested Software Models

Figure 4.4, 4.5, and 4.6 are the suggested software models for this domain. The
models follow the best practices based on external resources [2] [21]. The developed
models was used throughout the development, as well as later in our database tests.

22


4. Approach

Figure 4.4: Modelling of Raceone in MySQL.

Figure 4.5: Modelling of Raceone in Neo4j.

23


4. Approach

Figure 4.6: Modelling of Raceone in MongoDB and CouchDB.

24


4. Approach

4.2.3 Analysis of Application Usage

The model and application contains the following main components:

• Organizer

• Event

• Race

• Map

The enclosing layer is the user or more specifically the organizer. The organizer
entity holds information such as name, email and password, but primarily it holds
the different events. Therefore, the first step in the application is to register an
organizer.

The organizer can either choose to create a new event or duplicate a previous event.
When creating a duplicate, the new event will have a reference to the predecessor
and the old event will have a reference to the newly created successor. The duplicate
contains the same information as the previous event.

Every event contains one or more races. If the organizer choose to copy a previous
event, all the races in that edition will be copied as well. However, since the new
copy is a new edition of that event, the races will also become new editions. Just like
events, races contains links to predecessor and any future successor. A race contains
all race-related information such as name, date, type of race, maximum number of
participants and start groups with associated start time. It also holds information
about the race-map.

A race-map is made up of several points which consists of longitude, latitude and
altitude. The organizer uploads a race-map to the specific race, and then has the
opportunity to optimize and adjust the map. The adjustments could for instance
be to change different points of interests, POI, for instance start or finish line.

4.2.4 Suggested Test Cases

The suggested application usage and interviews with Raceone has led to the creation
of test cases. Our purposed cases are shown in Table 4.3. Some additional test cases,
which will not be explained in detail in this report, are shown in Table A.2.

25


4. Approach

Test Cases
Name Description
Duplicate event The duplicate of an event will contain the original event’s

races and maps, without duplicating the connected partici-
pants and followers.

Fetch race This case will return a race with the properties: the names
of the participants’ and their number of followers; the map
and; the start and end points of the race. This includes some
special querying, counting the number of followers, and also
fetching many entities, i.e coordinates.

Follow participant Following a participant will create a relationship between two
users, more precisely, between a user and another user’s ac-
tivity.

Fetch participant This test case consists of two similar queries, finding the top
ten participants which are: participating in the most races;
and the participant with the most followers. This is used
when displaying followers in the Raceone application and is a
common occurrence.

Insert coordinates Inserting coordinates will be done in batches of one hundred
and linked to an activity of a race. For Raceone this is done in
the backend to avoid writing to the database too often during
a race.

Table 4.3: Test cases for Raceone

4.2.5 Suggested Data Collection

We purpose the amount of test data specified in Table 4.4.

Entity Amount
User 100
Organizer 10
Event 10
Race 5
Race Coordinate 100
Activity 10
Activity Coordinate 50

Table 4.4: Raceone entities and their suggested amount.

As indicated by Table 4.4, each event will have 5 races. A race will include a map of
100 coordinates and 10 activities, where each activity has one participant and one
follower, and 50 coordinates as the route traveled so far.

26


4. Approach

4.3 Reference Point

In Section 1.2 we mention the issue of equal conditions for each technology. When
libraries, hardware and human factor plays a part, it is complex to get a point
of reference. We have created a simple model and two test cases which are very
minimal. This is meant to give an indication of the intermediates between the
databases and our test cases, since the cases should be close to identically performed
by all technologies.

For a reference, we have chosen a very simple data model and amount of test data.
The only entity will be a mock entity, called blob, with only one attribute and no
relationships. Furthermore, we purpose the amount of data shown in Table 4.5.

Entity Amount
Blob 1000

Table 4.5: Reference entities and their suggested amount.

The first test case is not linked to any data sets. Its purpose is to fetch nothing, or a
minor package, and return it. This should indicate what the response times are for
the different databases. The second case is connected to the model, which consists
of one entity and no relationships, and will return all entities found. This can be a
reference point for the storage strategy of each database technology.

4.4 Development of Tests

Tests have been developed using the programming language Python and the libraries
for DBMS communication as mentioned earlier. The logic is derived from the sug-
gested test cases for all companies. We have developed our own test framework
since none exist with the capabilities we were looking for. This framework has the
responsibility of automating the benchmarking and synchronizing the results to a
spreadsheet for later analysis.

During the development of the test cases, we felt that more time was needed than
originally planned. Therefore, a choice was made not to include CouchDB and use
the additional time for Neo4j, MySQL and MongoDB. The theory and preparations
are finished, so CouchDB can easily be added at a later time.

27


4. Approach

4.4.1 Test Framework

An internal structure was built to modularize the process of writing the test cases,
which allowed distribution of work load while keeping consistency between tests.

The first obstacle was writing tests which realistically extracted data from the
DBMS, meaning that because test cases can not use the randomness induced by
a user selecting something in a real application, this had to be substituted by a
process imitating that behaviour.

The following strategies were considered and tested:

1. Creating the entity before extracting it.

2. Finding the first entity in the database.

3. Extracting all entities and a randomly chosen.

The strategy that best suited our test purposes was number (3).

Intuitively, fetching and randomly choosing an entity would take time, which should
not be included in the total time of a test, as it was only done to prepare for the
test. This was solved by taking inspiration from unit testing, where each test can
include setup, run and teardown methods. With this particular structure, the time
elapsed could be measured exclusively for the running part of the test case.

Setup creates, fetches and saves entities along with relationships that are necessary
to run the test and to recover the initial state when the test is done. The running
method performs the bare minimum to accomplish the purpose of the test case and
the teardown cleans up any potential residue to keep the database in its initial state.

A test class was then created for each technology with all cases as methods, where a
method would return that case object with the three methods mentioned above. The
test collections would also include initialization methods for the different companies.

We also implemented data scaling, as a feature for our Benchmarker. This means
that we defined a standard data set, defined in section 4.1.5, 4.2.5 and 4.5, and
added methods for expanding the data set with a multiplier. The multiplier can be
supplied as an argument, with 1 as default, and will change the initialization of data
in the test collections.

Finally there is a benchmarking class which is the backbone of the test environ-
ment. This class initializes the data and executes the tests one by one while timing
each execution. When the tests are done, the results are synchronized to a Google
Spreadsheet for analysis.

Figure 4.7 is a graphical illustration of the test framework. The test collections in
red, synchronization module in green, case in purple and the Benchmarker in blue.

28


4. Approach

Figure 4.7: Graphical illustration of the test environment.

We used the test framework by running the benchmarking class with arguments for
what database and company to test, and an integer for the data scaling.

1 python benchmarking.py neo4j skim 10
2 python benchmarking.py all all 10

4.4.2 MySQL

The tests written for MySQL used the Python library mysql.connector to commu-
nicate with the MySQL database. This library utilizes the TCP protocol to create
a dedicated connection with a MySQL database. The library connects via TCP/IP
sockets to increase the communication performance and SSL to unsure communica-
tion security [23].

To connect a MySQL database using this library is rather simple, all needed is this
connect statement:

1 cnx = mysql.connector.connect(user="username", password="secret",
host="ip-address", database="database_name")

29


4. Approach

When a connection has been established you use the context returned by the connect
statement to fetch a cursor which you can use to run queries. You can then fetch
the results in form of a cursor which you then step through to fetch the data and
print it. It is also important to close the cursor when it is no longer in use, since
the library only allows one active cursor per connection to the database.

1 cursor = cnx.cursor()
2 cursor.execute("SELECT id FROM sku")
3 result = cursor.fetchall()
4 for row in result:
5 print(row)
6 cursor.close()

The most time consuming part of writing these tests were the table creation and
initialization of the data.

All queries were written beforehand and tested in an application called DbVisual-
izer. DbVisualizer is a universal database management tool allowing the user to:
connect to a SQL database; run queries against it; and browse the data stored in the
database. The application is partly built on open source software and is available
to students both as a free version with fewer features as well as through educational
discounts on the professional version [24]. This application allowed us to connect to
our MySQL server and run queries against it as well as browse the data stored in
the database.

4.4.3 Neo4j

To interact with Neo4j through Python, there are several libraries and drivers that
can be used. A minimal solution is the Neo4j Python Driver, which supports writing
Cypher statements, sending them to the database, executing them, and returning
the results [25]. There are several wrappers for this driver, an example is Py2neo,
which utilizes the restful API service for Neo4j [26].

Below is a usage example of the driver [25].

1 from neo4j.v1 import GraphDatabase, basic_auth
2

3 driver = GraphDatabase.driver("bolt://localhost",
auth=basic_auth("neo4j", "password"))

4 session = driver.session()
5

6 session.run("CREATE (a:Person {name:’Arthur’, title:’King’})")
7 result = session.run("MATCH (a:Person) WHERE a.name = ’Arthur’ RETURN

a.name AS name, a.title AS title")
8 for record in result:
9 print("%s %s" % (record["title"], record["name"]))

10 session.close()

30


4. Approach

The first version was written with the wrapper Py2neo, but for performance reasons,
we changed to the standalone driver. Py2neo uses Neo4j’s restful API service, which
was troublesome for us since the libraries for MySQL and MongoDB use the TCP
protocol and it performs faster. After the switch to the Neo4j driver, the test cases
performed better, and more on par with MySQL and MongoDB.

Another thing to note on the performance of the Neo4j Python library is the usage
of parameters. Our first version of the tests were using Python’s format string to
insert the parameters directly into the query, as shown in the code below.

1 out = self.session.run(
2 ’START sku=Node(%d) ’
3 ’MATCH (value:SKU_VALUE)-[of:OF]->(sku:SKU) ’
4 ’RETURN value’ % inner_self.sku_id
5 )

This was later converted to what the documentation encourages, sending a dictio-
nary with the parameters.

1 out = self.session.run(
2 ’START sku=Node({sku_id}) ’
3 ’MATCH (value:SKU_VALUE)-[of:OF]->(sku:SKU) ’
4 ’RETURN value’,
5 dict(sku_id=inner_self.sku_id)
6 )

There are two main benefits of using a dictionary for the parameters. The first
one is that queries becomes more efficient thanks to the reusable plans of the query
planner. The second benefit is that it reduces the risk of malicious query clauses
that derives from inputs that are not properly filtered. These two benefits are stated
by Neo Technology [25].

The second benefit is not relevant for our use cases, but it is worth mentioning.

4.4.4 MongoDB

In order to be able to work with MongoDB through Python, a Python driver called
PyMongo was utilized. PyMongo is a tool recommended by the official MongoDB
website. By using this tool it is possible to connect to the running Mongod in-
stance, which is the primary daemon process of the MongoDB system. The Mongod
handles data requests, manages data access, and performs background management
operations [27].

1 from pymongo import MongoClient
2 client = MongoClient()
3 db = client.test

31


4. Approach

The example above shows how to connect to a test database with help of the Mon-
goClient tool which is imported from PyMongo.

During the process of implementing the data models for the test cases, some flaws
in the structure of the models were noticed. Instead of embedding a complete doc-
ument with corresponding fields inside another document, the embedded document
was moved to a separate collection, and only an id reference was embedded in the
document. This resulted in more effortless queries and fewer complex embedded
documents, and more reads could be done with fewer criteria.

1 db.images.insert_one({
2 "name": "image_" + str(nbr),
3 "originalName": "original_name",
4 "extension": "jpg",
5 "encoding": "PNG/SFF",
6 "size": 1024,
7 "height": 1080,
8 "width": 720,
9 "verticalDPI": 40,

10 "horizontalDPI": 50,
11 "bitDepth": 15,
12 "createdAt": "2016-03-03",
13 "accepted": False,
14 "comments": comments
15 }).inserted_id

An example of using references is embedding an image as an object inside another
document. The above snippet shows how image documents in our MongoDB test
model were added to an image collection. A unique id is added to each image
document and this results in less complex query when fields of images are requested.

4.5 Benchmarking

Our benchmarking was run on virtual machines hosted by Digital Ocean, which is
a cloud computing service [28]. A test run included all companies and DBMS’s,
though only one data scale, and ran to completion. There were a total of three runs,
one for each data scale, before the compiling of results began.

4.5.1 Virtual Machines

To run our tests we used virtual machines hosted by Digital Ocean. Three different
machines were setup with MongoDB, MySQL and Neo4j, with equal specifications,
and are specified in Table 4.6.

32


4. Approach

CPU Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz (utilizing 2 out of 8 cores)
RAM 2 GB (DIMM)
OS Ubuntu 14.04.4 x64
SSD 40 GB

Table 4.6: Reference entities and their suggested amount.

In addition to our DBMS machines, we had a test machine, whose sole purpose was
to run the test cases. This was done to separate the execution of Python scripts
from the DBMS so they would not interfere with each other. All machines were
located in the same private network, to avoid latency issues.

4.5.2 Execution of Test Cases

We ran all tests multiple times during the development to analyze preliminary results
and evaluate our implementations. After all test collections were done, we ran all
tests in batches, one batch for each data scale. As mentioned in Section 4.4.1, we
built our test framework to allow data scaling. During our execution we used two
different data scales, 1 and 10.

We ran the batches in ascending data scale order, and when the first was completed
we analyzed the results to see if there were any issues before continuing with the
subsequent batch.

4.5.3 Compiling of Results

The results were analyzed in Google Spreadsheet, where we went through the total,
average and peak times. The average times were put in separate spreadsheets for
each data scale and we created diagrams for each company. On the horizontal axis
of the diagram we had the test cases and execution time average in milliseconds on
the vertical axis. We used bar diagrams for the purpose of displaying the results.

33


4. Approach

34


5
Results

The results consists of tables of average execution times for the test cases. There are
two tables per company, where the first table shows the results for data scale 1, and
the second for data scale 10. The tables are used as source data for the following
diagrams, which gives an overview of the resulting average times.

In cases where the resulting time is -1 milliseconds, ms, the test was too demanding
to write and it was therefore discarded.

5.1 Skim

The results are for all suggested test cases for Skim and each case was run 1000
times.

5.1.1 Tables of Average Times

Table 5.1 and 5.2 are extracts from the spreadsheet of Skim’s benchmarking results.

Data Scale 1 [ms]
Case MySQL Neo4j MongoDB
Pair image and SKU 5,87 5,71 1,33
Fetch users 7,47 27,40 2,12
Add rows to SKU 15,42 7,56 2,81
Fetch SKU 3,78 2,15 1,60
Comment on image 4,89 10,63 1,36
Fetch all user’s comments 6,57 18,62 -1,00

Table 5.1: Skim: Average execution times for data scale 1.

35


5. Results

Data Scale 10 [ms]
Case MySQL Neo4j MongoDB
Pair image and SKU 4,12 5,09 1,19
Fetch users 5,71 170,97 9,63
Add rows to SKU 12,18 6,16 2,92
Fetch SKU 2,74 1,98 1,61
Comment on image 3,79 10,89 1,22
Fetch all user’s comments 6,22 16,93 -1,00

Table 5.2: Skim: Average execution times for data scale 10.

5.1.2 Diagrams

Figure 5.1 and 5.2 are diagrams created with the data from Table 5.1 and 5.2. Figure
5.2 has been split into two diagrams for clarity.

Figure 5.1: Average times for Skim’s test cases for data scale 1.

36


5. Results

Figure 5.2: Average times for Skim’s test cases for data scale 10.

37


5. Results

5.2 Raceone

The results are for all suggested test cases for Raceone and each case was run 1000
times.

5.2.1 Tables of Average Times

Table 5.3 and 5.4 are extracts from the spreadsheet of Raceone’s benchmarking
results.

Data Scale 1 [ms]
Case MySQL Neo4j MongoDB
Follow 4,37 6,11 1,76
Unfollow 4,37 6,20 1,77
Unparticipate 4,85 9,48 2,10
Fetch participants 2,67 7,44 -1,00
Fetch participants 2 18,51 10,15 -1,00
Fetch coordinates 3,09 8,92 1,43
Remove coordinates 6,56 10,84 2,62
Fetch hot races 3,95 14,31 39,78
Remove race 4,96 -1,00 1,32
Duplicate event -1,00 -1,00 -1,00
Fetch race 17,30 35,98 3,28
Insert coordinates 85,83 913,06 183,89

Table 5.3: Raceone: Average execution times for data scale 1.

Data Scale 10 [ms]
Case MySQL Neo4j MongoDB
Follow 3,71 6,73 2,35
Unfollow 3,64 7,34 2,14
Unparticipate 5,38 14,21 2,54
Fetch participants 8,72 38,24 -1,00
Fetch participants 2 63,28 58,37 -1,00
Fetch coordinates 4,97 33,80 1,83
Remove coordinates 5,80 11,81 2,57
Fetch hot races 18,18 55,62 306,97
Remove race 7,28 -1,00 1,50
Duplicate event -1,00 -1,00 -1,00
Fetch race 11,97 193,55 3,57
Insert coordinates 145,31 940,03 206,41

Table 5.4: Raceone: Average execution times for data scale 10.

38


5. Results

5.2.2 Diagrams

Figure 5.3 and 5.4 are diagrams created with the data from Table 5.3 and 5.4. Both
figures have been split into two diagrams each for additional clarity.

Figure 5.3: Average times for Raceone’s test cases for data scale 1.

39


5. Results

Figure 5.4: Average times for Raceone’s test cases for data scale 10.

40


5. Results

5.3 Reference

The results are for all suggested test cases for Reference and each case was run 1000
times.

5.3.1 Tables of Average Times

Table 5.5 and 5.6 are extracts from the spreadsheet of Reference’s benchmarking
results.

Data Scale 1 [ms]
Case MySQL Neo4j MongoDB
Tiny Get 1,54 1,91 1,49
Small Get 47,04 151,25 11,52

Table 5.5: Reference: Average execution times for data scale 1.

Data Scale 10 [ms]
Case MySQL Neo4j MongoDB
Tiny Get 1,39 3,32 1,10
Small Get 274,98 1164,35 58,48

Table 5.6: Reference: Average execution times for data scale 10.

41


5. Results

5.3.2 Diagrams

Figure 5.5 and 5.6 are diagrams created with the data from Table 5.5 and 5.6.

Figure 5.5: Average times for Reference’s test cases for data scale 1.

42


5. Results

Figure 5.6: Average times for Reference’s test cases for data scale 10.

43


5. Results

44


6
Discussion

The purpose of this project is to compare different database management systems.
The derived results indicate many interesting factors which will be discussed below,
however the majority of the initial expectations comply with the final results.

6.1 Interpretation of Results

The results indicate that the different technologies have individual strengths and
weaknesses.

The majority of MongoDB’s measurements are close to two milliseconds, which is
considerably faster than the results of MySQL and Neo4j. Test cases that differenti-
ates from the low results are complex relationships and joining of data. A summary
of MongoDB’s results would suggest that it has fast querying overall, with some
unimplemented test cases, indicated with a value of -1. The unimplemented test
cases were hard, or impossible, to implement considering the chosen model. Cur-
rent model for MongoDB is optimized for the most common use cases for Skim and
Raceone, and is therefore not optimal for some of the other test cases.

The results for Neo4j are generally lacking in performance compared to both MySQL
and MongoDB. There is only one test case where Neo4j is the fastest, fetch partici-
pants type 2. There are some cases, fetch hot races, Add rows to SKU and fetch SKU,
where Neo4j is faster than MySQL or MongoDB. However, Neo4j’s main weakness
throughout the results is when queries are fetching large data sets, where in some
cases Neo4j is up to 20 times slower than MongoDB.

The results indicates that MySQL is the most consistent of the three database
technologies. While MongoDB and Neo4j fluctuate between test cases, MySQL was
more stable.

Considering the scaling of data, MongoDB handles the increase without any major
spikes in the results. However, Neo4j slows down considerably on some cases such
as fetch users and small get. This indicates that the performance of Neo4j decreases
when scaling up the data size.

45


6. Discussion

6.2 Expectations of Test Cases

Expectations were developed with the help of the related work and theory of this
project.

6.2.1 Skim

Table 6.1 and 6.2 contains individual test cases for Skim with their expectation and
outcome.

Test Cases
Name Evaluation
Pair Image with SKU Expectation: The expectation was in favor for MongoDB,

since the model for MongoDB has a separate collection for
the SKUs.

Outcome: It was proven that the expectation was correct.
MongoDB is in favor due to the model for MongoDB which
has SKUs as a separate collection, instead of a nested. The
only operation which has to be performed is adding the image
ID to the SKU and removing it from the list of image IDs in
the project. For MySQL, it is quite similar, aside from one
big difference - reference constraints. MySQL will have to
find the image and the SKU to perform the update. Finally
Neo4j is not based upon constraints, although it needs to
find the nodes to be able to connect them, thus the same
events as for MySQL will occur.

Fetch Users Expectation: The expected outcome for fetching users was
that MySQL and MongoDB will have similar response times.
Mainly because they only need to return a single collection
while Neo4j have to work with a larger data set, which will
be slower even with indexing.

Outcome: After the test run it turned out that the
expected outcome for Fetch users was not correct. From
the results it is clear that MongoDB has a shorter average
execution times than MySQL, the outcome for Neo4j was
correct.

Table 6.1: Expectations and their outcome for Skim

46


6. Discussion

Test Cases
Name Evaluation
Add rows to SKU Expectation: Since the model for MongoDB has a separate

collection for SKUs, expectations were that it has the
potential of having the best performance for this case.

Outcome: Outcome of this test case proved that the
expectation was correct.

Fetch SKU Expectation: The expectation for this test case was that
it will have similar querying times for all technologies, since
the models are comparable in this regard.

Outcome: The results indicated once again that Mon-
goDB performs best, which contradicts the expectation. This
may be due to the fact that collections in MongoDB are less
heavy to fetch than tables in MySQL, or nodes in Neo4j.

Comment on image Expectation: Regarding the performance for this test case,
the expectations were that MySQL will perform the best,
this due to the fact that MySQL holds the comments in a
separate table and only needs to do an insert into this table
to perform the task. In addition, it was also expected that
the performance of MongoDB will be very dependent on the
implementation of the code running this query. Expectations
regarding MongoDB was that if this test was separated into
commenting on images within SKUs and commenting on
images not connected to a SKU the performance will be
acceptable. If the test was not split up and one query is
to cover both these cases, the performance will decrease
drastically.

Outcome: Results for this test case was as expected.
Important to mention is that the MongoDB test was based
on having two different commenting functions as suggested
in the expectation.

Fetch a user’s comments Expectation: The expectation was that MySQL will be
best suited for this task, since the comments are stored in
a separate table and each comment contains the user id.
MongoDB and Neo4j shared issues with each other since they
have to look through every image for comments and then
look at the creators of these comments. MongoDB had yet
another issue, since it has to search both the Image collection
and the SKU collection to find all comments. This is because
some of the images are stored within a SKU and some of
them are stored in a separate collection.

Outcome: MySQL performed best as it was expected
and from the results it is stated that it was not possible to
implement the MongoDB solution for the test case.

Table 6.2: Expectations and their outcome for Skim
47


6. Discussion

6.2.2 Raceone

Table 6.3 and 6.4 contains individual test cases for Raceone with their expectation
and outcome.

Test Cases
Name Evaluation
Duplicate event Expectation: The expected outcome for duplicating an

event was that it will be very simple to solve for MongoDB,
perform reasonably well in Neo4j and lack performance for
MySQL. This is due to all of the nested models this deep
copy would have to go through. In MongoDB all of the data
connected to a race is stored within a race document, which
makes it easy to duplicate.

Outcome: The expectation can be neither approved
nor disapproved, since the test case was not evaluated for
any of the technologies.

Fetch race Expectation: Similarly to duplicating events, this case’s
expected outcome is that it will perform well in MongoDB
by returning a whole document. In Neo4j and MySQL the
query has to fetch properties from multiple locations and
combining them, thus increasing the complexity of the task.

Outcome: The results indicated that the expectation
was correct for all threee technologies.

Follow participant Expectation: The expectation for this test case is that this
case will be very fast for all chosen databases, and possibly
slightly slower for MySQL, as a result of its referential
constraints.

Outcome: The outcome for this test case confirmed
the expectation.

Table 6.3: Expectations and their outcome for Skim

48


6. Discussion

Test Cases
Name Evaluation
Fetch participant Expectation: MongoDB’s weaknesses includes complex

sorting and joining queries. Therefore, the expected outcome
for this case was that Neo4j and MySQL will be the fastest
for this case.

Outcome: The expectation can not neither be approved nor
disapproved, since the implementation of MongoDB’s is not
completed.

Insert coordinates Expectation: This sort of batch creation should favor
MongoDB.

Outcome: The expectation is partially confirmed, since
MongoDB performed well on this test case. However, the
technology that performed the best was MySQL, which was
not mentioned in the expectation.

Table 6.4: Expectations and their outcome for Skim

6.3 Similarities to Related Work

In Section 3.1.2 it is stated that an appropriate use case for document data stores
is for applications storing items of related nature, but with different structure. This
could be compared to the test case add rows to SKU, from Section 4.1.4, where each
SKU has different columns, and amount of columns.

By analyzing the average time per operation for the test case add rows to SKU in
the diagram, 5.1.1, it is demonstrated that MongoDB and its document model is the
fastest of all technologies in this regard. Thus, the results of the test case accords
with the study by Katarina Grolinger et al.

The research conducted by Katarina Grolinger et al., which is presented in the
related works Section 3.1.3, argues that graph databases performs well when it
comes to complex queries. Neo4j performed best of all database systems in the test
case fetch participants which is an example of such a query.

The same section also states that a common usage of graph databases is when han-
dling location history of a user. The test case insert coordinates stores coordinates
synced from a race participant’s movements along the track. In contradiction to
what is argued by Katarina Groliner et al. Neo4j had a very high average time per
operation in this test and the other compared database systems performed better.

49


6. Discussion

6.4 Social Aspect

For several years, Chalmers has put a lot of time and effort into its innovative
program, Chalmers Ventures. They supply new companies with supervisors and
early investments to get them past the initial startup phase. [29]

An issue with the current concept is the neglecting of competence assistance. While
the supervisors aid in business and legal aspects, they do not always provide enough
competence assistance in the extent that is necessary.

The test cases and results of this project can be used as a tool for companies in
their startup phase to steer the solution in the right direction. This in turn can be
further developed to be more comprehensive and cover more technological areas for
companies.

50


7
Conclusion

The related work done by Katarina Grolinger et al. in Section 6.3 is mostly confirmed
by our results. An aspect in which our results differ from their results is regarding
storing location history in a graph database. The reason for this could be insufficient
modelling and test implementation during our project. Assuming that our test
inserting coordinates would execute each insertion by itself instead of doing the
insertion in batches, Neo4j would most likely perform better.

For smaller applications with general data, we would recommend the use of Mon-
goDB for its agile compatibility and fast querying. In large-scale applications and
solutions where the integrity of data is of greater importance, MySQL or other re-
lational DBMS’s would be more compatible. A relational DBMS allows constraints
to be added to a model, which in turn leads to a more controlled and defined envi-
ronment for the data. Our results for MySQL were also consistent and satisfactory
in general.

The test results have proven Neo4j to be effective on complex queries with highly
coupled data, which was suggested in Section 3.1.3. Further, we have concluded
that Neo4j is not suited to fetch large amounts of general data. As mentioned in the
interpretation of the results in Section 6.1, the test performance of Neo4j decreases
when scaling up the data size. This may have been caused by a lack of correct
indexing.

A conclusion which can be made from what is stated above is that different tech-
nologies are better suited for different purposes and database models. Therefore, it
is an advantage for companies to examine what type of technology is appropriate
for their application area. As a final conclusion, we would like to emphasize the im-
portance of evaluating the existing knowledge within a company to avoid mistakes
being made because of insufficient expertise, while still having the future in mind.

51


7. Conclusion

52


8
Outlook

Considering the scope and limitations of this project, there are several areas that can
be improved or added. At a later stage in the project, during the implementation
of the DBMS, the same decision was regarding CouchDB.

Another thing to mention is the resulting measurements for the test cases. Initially,
there was plans to include load percentage in addition to execution times, though
due to issues such as synchronization it was omitted.

8.1 Additional Companies

Adding more companies with different types of data sets would be very beneficial for
continuous work in this area. Social networks would work as an example of closely
connected data sets, which is where graph databases are good to utilize. Examples
of social networks that were discussed are: Facebook, and Reddit.

Aside from social networking, companies with data for recommendation services or
big data would be interesting additions.

8.2 CouchDB

In Section 2.3.4 a short introduction of CouchDB is presented. The research suggests
that its strengths in fast access to simple objects and the pre-calculation of views
would result in interesting results for the test cases. The preparations are made in
the benchmarking model and for the test cases, so to add this technology, minimal
additions have to be made to the entirety.

8.3 Analyzing CPU Load

In addition to timing the execution of the tests, it would be beneficial to see the load
on the CPU during the execution. If two technologies have similar performance, this
knowledge could indicate if one technology is less demanding while being equally

53


8. Outlook

fast. The issues that were run into with this feature were that the collection of CPU
loads would have to be conducted on the DBMS machines and then sent to the test
machine. This would require some logic on the DBMS machines as well, to keep
track of the current test case and send the results back to the test machine.

8.4 Displaying of Results

An interesting topic throughout this project has been how to present the results in
the most efficient manner. This could be done by developing a simple website with
a step-based process to establish what a user needs. The step process could include
questions to guide a user through the website.

54


Bibliography

[1] E. B. Koffman and P. A. T. Wolfgang, Data Structures: Abstraction and Design
Using Java, 2nd ed. Wiley, 2010.

[2] J. A. Hoffer, V. Ramesh, and H. Topi, Modern Database Management, 10, Ed.
Pearson, 2011.

[3] Db-engines, “DB-Engines Ranking per database model category,” 2014.
[Online]. Available: http://db-engines.com/en/ranking_categories

[4] R. Ramakrishnan and J. Gehrke, Database Management Systems, 2nd ed.
McGraw-Hill Companies, 2000.

[5] CouchDB, “Why NoSQL ?” 2016. [Online]. Available: http://www.couchbase.
com/nosql-resources/what-is-no-sql

[6] Jay Parikh, “A New Data Center for Iowa | Facebook News-
room,” 2013. [Online]. Available: http://newsroom.fb.com/news/2013/04/
a-new-data-center-for-iowa/

[7] E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,”
1970. [Online]. Available: http://www.acm.org

[8] B. Forta, Sams teach yourself SQL in 10 minutes, 4th ed. Pearson, 2013.

[9] Db-engines, “DB-Engines Ranking - popularity ranking of database
management systems,” 2015. [Online]. Available: http://db-engines.com/en/
ranking

[10] Oracle, “MySQL :: MySQL Community Edition,” 2016. [Online]. Available:
https://www.mysql.com/products/community/

[11] ——, “MySQL :: MySQL 5.7 Reference Manual :: 1.3.1 What is
MySQL?” 2016. [Online]. Available: http://dev.mysql.com/doc/refman/5.7/
en/what-is-mysql.html

[12] Microsoft, “Tutorial: Writing Transact-SQL Statements,” 2016. [Online].
Available: https://msdn.microsoft.com/en-us/library/ms365303.aspx

[13] S. Moore, “Oracle Database PL/SQL Language Reference, 11g Release 1
(11.1),” 2009. [Online]. Available: http://docs.oracle.com/cd/B28359_01/
appdev.111/b28370/overview.htm#

55

http://db-engines.com/en/ranking_categories
http://www.couchbase.com/nosql-resources/what-is-no-sql
http://www.couchbase.com/nosql-resources/what-is-no-sql
http://newsroom.fb.com/news/2013/04/a-new-data-center-for-iowa/
http://newsroom.fb.com/news/2013/04/a-new-data-center-for-iowa/
http://www.acm.org
http://db-engines.com/en/ranking
http://db-engines.com/en/ranking
https://www.mysql.com/products/community/
http://dev.mysql.com/doc/refman/5.7/en/what-is-mysql.html
http://dev.mysql.com/doc/refman/5.7/en/what-is-mysql.html
https://msdn.microsoft.com/en-us/library/ms365303.aspx
http://docs.oracle.com/cd/B28359_01/appdev.111/b28370/overview.htm#
http://docs.oracle.com/cd/B28359_01/appdev.111/b28370/overview.htm#


Bibliography

[14] I. Robinson, J. Webber, and E. Eifrem, Graph Databases: New opportunities
for Connected Data, 2nd ed. O’Reilly Media, 2015.

[15] MongoDB Inc., “BSON - Binary JSON,” 2013. [Online]. Available:
http://bsonspec.org/

[16] K. Chodorow, MongoDB: The Definitive Guide, 2nd ed. O’Reilly Media, 2013.

[17] J. C. Anderson, J. Lehnardt, and N. Slater, CouchDB The Definitive Guide,
1st ed. O’Reilly Media, 2010.

[18] K. Grolinger, W. a. Higashino, A. Tiwari, and M. A. Capretz, “Data
management in cloud environments: NoSQL and NewSQL data stores,”
Journal of Cloud Computing: Advances, Systems and Applications, vol. 2, no. 1,
p. 22, dec 2013. [Online]. Available: http://www.journalofcloudcomputing.
com/content/2/1/22

[19] A. B. M. Moniruzzaman and S. A. Hossain, “NoSQL Database: New
Era of Databases for Big Data Analytics - Classification, Characteristics
and Comparison,” Nosql database: New era of databases for big data
analytics-classification, characteristics and comparison, vol. 6, no. 4, pp. 1–14,
2013. [Online]. Available: http://arxiv.org/pdf/1307.0191v1.pdf

[20] SKIM, “SKIM | About us.” [Online]. Available: http://skimgroup.com/history

[21] H. Garcia-Molina, J. Ullman, and J. Widom, Database Systems: The Complete
Book 2nd Edition, 2nd ed. Pearson, 2009.

[22] Raceone, “HOME | RaceONE,” 2015. [Online]. Available: http://www.raceone.
com/

[23] Oracle, “MySQL :: MySQL Connector/Python Developer Guide :: 1 Introduc-
tion to MySQL Connector/Python,” 2016. [Online]. Available: https://dev.
mysql.com/doc/connector-python/en/connector-python-introduction.html

[24] DbVis Software AB, “About Us - DbVis Software,” 2016. [Online]. Available:
https://www.dbvis.com/company/

[25] Neo Technology, “Drivers,” 2016. [Online]. Available: http://neo4j.com/docs/
developer-manual/current/#driver-manual-index

[26] N. Small, “Py2neo v3 Handbook.” [Online]. Available: http://py2neo.org/v3/

[27] MongoDB Inc., “Python Driver (PyMongo) — Getting Started With MongoDB
3.0.4,” 2016. [Online]. Available: https://docs.mongodb.com/getting-started/
python/client/

[28] DigitalOcean, “What is Cloud Hosting?” 2016. [Online]. Available:
https://www.digitalocean.com/what-is-cloud-hosting/

[29] Chalmers Ventures AB, “About the Programs,” 2015. [Online]. Available:
http://www.chalmersventures.com/offer

56

http://bsonspec.org/
http://www.journalofcloudcomputing.com/content/2/1/22
http://www.journalofcloudcomputing.com/content/2/1/22
http://arxiv.org/pdf/1307.0191v1.pdf
http://skimgroup.com/history
http://www.raceone.com/
http://www.raceone.com/
https://dev.mysql.com/doc/connector-python/en/connector-python-introduction.html
https://dev.mysql.com/doc/connector-python/en/connector-python-introduction.html
https://www.dbvis.com/company/
http://neo4j.com/docs/developer-manual/current/#driver-manual-index
http://neo4j.com/docs/developer-manual/current/#driver-manual-index
http://py2neo.org/v3/
https://docs.mongodb.com/getting-started/python/client/
https://docs.mongodb.com/getting-started/python/client/
https://www.digitalocean.com/what-is-cloud-hosting/
http://www.chalmersventures.com/offer


A
Appendix 1

A.1 Screenshots of the company modules

Figure A.1: Screenshot of the Skim module - SKU list view.

Figure A.2: Screenshot of the Skim module - Image SKU matching view.

I


A. Appendix 1

Figure A.3: Screenshot of the Raceone portal - Create an event view.

Figure A.4: Screenshot of the Raceone portal - Event list view.

A.2 Additional Raceone Test Cases

Unfollow a participant
Unfollowing a participant removes the relationship created by following a user in a
race.

Unparticipate from a race
This case will remove the relationship between a user and race, mainly an activity.

Fetch coordinates
This test will fetch all coordinates for a specific activity.

Remove coordinates
Removing coordinates will be done to optimize a route, or list of coordinates. This is
done in batches, where IDs of the coordinates to be removed is sent to the database.

Fetch hot races
Fetching hot races consists of returning the 10 races with the most number of par-

II


A. Appendix 1

ticipants and followers combined. This can be used by Raceone for sorting lists of
races.

Remove race
A race in linked to several entities in the database - coordinates, event and activities
where each activity has followers and coordinates. If this is done, it is usually when
there are none or a few participants since it is made to rollback a mistake.

III


	List of Figures
	List of Tables
	Introduction
	Purpose
	Problem definitions
	Scope & Delimitations

	Theory & Tools
	Types of Databases
	Relational
	NoSQL

	Structured Query Language
	Database Management Systems
	Relational - MySQL
	Graph - Neo4j
	Document - MongoDB
	Key-value - CouchDB


	Related Work
	Data Management in Cloud Environments by Katarina Grolinger et al.
	Key-value stores
	Document stores
	Graph databases
	Conclusion of the Study


	Approach
	Market Analysis Company
	System Description
	Suggested Software Models
	Analysis of Application Usage
	Suggested Test Cases
	Suggested Data Collection

	Race Sharing Solution
	System Description
	Suggested Software Models
	Analysis of Application Usage
	Suggested Test Cases
	Suggested Data Collection

	Reference Point
	Development of Tests
	Test Framework
	MySQL
	Neo4j
	MongoDB

	Benchmarking
	Virtual Machines
	Execution of Test Cases
	Compiling of Results


	Results
	Skim
	Tables of Average Times
	Diagrams

	Raceone
	Tables of Average Times
	Diagrams

	Reference
	Tables of Average Times
	Diagrams


	Discussion
	Interpretation of Results
	Expectations of Test Cases
	Skim
	Raceone

	Similarities to Related Work
	Social Aspect

	Conclusion
	Outlook
	Additional Companies
	CouchDB
	Analyzing CPU Load
	Displaying of Results

	Bibliography
	Appendix 1
	Screenshots of the company modules
	Additional Raceone Test Cases