Automatic Testing of Graphical User
Interfaces

A comparison of using edge detection and neural networks to
identify interactive areas in graphical user interfaces.

Master’s thesis in Systems, Control and Mechatronics

Heinerud, Joel
Nilsson, Tomas

Department of Electrical Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2017


Master’s thesis EX057/2017

Automatic Testing of Graphical User Interfaces

A comparison of using edge detection and neural networks to
identify interactive areas in graphical user interfaces.

Joel Heinerud
Tomas Nilsson

Department of Electrical Engineering
Division of Systems and Control

Chalmers University of Technology
Gothenburg, Sweden 2017


Automatic Testing of Graphical User Interfaces
A comparison of using edge detection and neural networks to identify interactive
areas in graphical user interfaces.
Joel Heinerud Tomas Nilsson

© Joel Heinerud, 2017.
© Tomas Nilsson, 2017.

Supervisor: Mikael Lundgren, Benchnode Technology AB
Examiner: Fredrik Kahl, Deptartment of Electrical Engineering, Chalmers Univer-
sity of Technology

Master’s Thesis EX057/2017
Department of Electrical Engineering
Division of Systems and Control
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2017

iv


Automatic testing of graphical user interfaces
A comparison of using edge detection and neural networks to identify interactive
areas in graphical user interfaces
Joel Heinerud, Tomas Nilsson
Department of Signals and Systems (S2)
Chalmers University of Technology

Abstract
Automatic testing is a part of the development process when creating GUIs. Tra-
ditionally, this is a time consuming process that is performed both by the supplier,
and the company ordering the GUI. By using image processing, this process can
be performed dynamically without having to specify rules manually. In this the-
sis, two methods for identifying interactive areas have been studied. These methods
are identification using edge detection as well as convolutional neural networks. The
evaluation was done by comparing how accurate these methods were when faced with
GUIs of varying complexity. The neural network approach was found to perform
better in all aspects, which led to a testing framework that can navigate through a
novel GUI automatically. The result of searching through a GUI is presented in the
form of a graph which can be used to study GUI complexity.

Keywords: Touch-Screen, GUI, Image Analysis, Computer Vision, Automation, Test
Evaluation, Machine Learning, Deep Learning, Perceptual Hashing, GoogLeNet,
Inception module.

v


Acknowledgements
This thesis was produced at Benchnode Technology AB, located at Lindholmen,
Gothenburg. We would like to express out deepest gratitude to them for letting us
conduct our work using the office of Benchnode as well as their equipment. The
support from Mikael Lundgren, Mattias Fredriksson, Mikael Westerlind and other
employees who have given us valuable insights and tips along the way has been very
helpful. We would like to thank Fredrik Kahl, our examiner for his insights and help
with the report. Additional thanks to the wonderful community of the TensorBox
Gitter page, who have shared their tips and tricks, as well as tales of troubles and
tribulations so that others may be spared from making the same mistakes. Finally,
thanks to our friends and families for your support.

Joel Heinerud and Tomas Nilsson, Gothenburg, May 2017

vii


Contents

Abbreviations xi

Nomenclature xi

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Preliminaries 5
2.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 General Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Theory 9
3.1 Perceptual Hashing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Average Based Hash (aHash) . . . . . . . . . . . . . . . . . . . 10
3.1.2 Block Based Hash (bHash) . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Gradient Based Hash (dHash) . . . . . . . . . . . . . . . . . . 11
3.1.4 DCT Based Hash (pHash) . . . . . . . . . . . . . . . . . . . . 11
3.1.5 Hamming Distance . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2 Edge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.1 Gaussian Filtering . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Sobel Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3.1 The Makings of a Neural Network . . . . . . . . . . . . . . . . 15

3.3.1.1 Convolutional Layer . . . . . . . . . . . . . . . . . . 16
3.3.1.2 Pooling Layer . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1.3 Concatenation Layer . . . . . . . . . . . . . . . . . . 18
3.3.1.4 Dropout Layer . . . . . . . . . . . . . . . . . . . . . 19
3.3.1.5 Fully Connected Layer . . . . . . . . . . . . . . . . . 20
3.3.1.6 Softmax Layer . . . . . . . . . . . . . . . . . . . . . 20
3.3.1.7 Activation Function . . . . . . . . . . . . . . . . . . 20

ix


Contents

3.3.2 Training a Neural Network . . . . . . . . . . . . . . . . . . . . 21
3.3.2.1 Training Set . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2.2 Validation Set . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2.3 Testing Set . . . . . . . . . . . . . . . . . . . . . . . 23
3.3.2.4 Overfitting . . . . . . . . . . . . . . . . . . . . . . . 23

3.3.3 GoogLeNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.3.1 The Inception Module . . . . . . . . . . . . . . . . . 24

4 Method 27
4.1 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.1 MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Android Device . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.3 Android Studio . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.1.3.1 Android Debug Bridge . . . . . . . . . . . . . . . . . 28
4.1.4 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1.5 TensorFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.5.1 TensorBoard . . . . . . . . . . . . . . . . . . . . . . 29
4.1.5.2 TensorBox . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.1 Edge Detection Method . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 Identification Using TensorBox . . . . . . . . . . . . . . . . . 32

4.3.2.1 Preparing Training Data . . . . . . . . . . . . . . . . 32
4.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.5 System Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 GUI Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.7 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.7.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5 Results 37
5.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Test Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

6 Discussion and Concluding Remarks 43
6.1 Further Development . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

A Appendix 1 I
A.1 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Hashing comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . V

B Appendix 2 VII
B.1 Python Packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
B.2 Android Debug Bridge: Code samples . . . . . . . . . . . . . . . . . . VIII

Bibliography IX

x


Contents

Abbreviations

adb Android Debug Bridge
CNN Convolutional Neural Network
CONV Convolutional layer
DCT Discrete Cosine Transform
GUI Graphical User Interface
JSON JavaScript Object Notation
MSE Mean Squared Error
NN Neural Network
ReLu Rectified Linear Units

Nomenclature

dh Hamming Distance
η Learning rate
fk Kernel size
fs Stride length during a filter operation
IN A normalized image
P Padding size to add to an input volume
r Result of comparision
u0 The initial image from the GUI
uss Steady state image of GUI
x Identified areas of interest
x̂ Classified objects
x̂i Selected object to try an interaction on
zI Input volume size of a NN-layer
zO Output volume size of a NN-layer

xi


Contents

xii


List of Figures

3.1 Visualization of the Gaussian filter . . . . . . . . . . . . . . . . . . . 13
3.2 Neural network basics . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Zeiler-Fergus Architecture . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Convolutional layer visualization . . . . . . . . . . . . . . . . . . . . . 16
3.5 Image Convolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 Stride lengths effect on output size . . . . . . . . . . . . . . . . . . . 17
3.7 Maximum pool visualization . . . . . . . . . . . . . . . . . . . . . . . 18
3.8 Concatenation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.9 Visualisation of dropouts effect on a neural network[1]. . . . . . . . . 19
3.10 Learning rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.11 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.12 GoogLeNet Inception v.3 . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.13 Inception module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.14 Inception module with dimension reductions[2]. . . . . . . . . . . . . 25

4.1 Schematic view of the early system. . . . . . . . . . . . . . . . . . . . 27
4.2 Grayscale conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Results of Gaussian filtering . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Sobel filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.5 Boundary boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.6 Training image before and after resizing. . . . . . . . . . . . . . . . . 33
4.7 The three steps of the annotation process. . . . . . . . . . . . . . . . 33
4.8 The proposed workflow. . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.1 Identified icons on an iPhone GUI with a plain background. . . . . . 38
5.2 Identified icons on an iPhone GUI with a background that has a color

gradient. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 Identified icons on a generic image with a slight gradient. . . . . . . . 39
5.4 Example output, simple . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Example output, complicated . . . . . . . . . . . . . . . . . . . . . . 42

A.1 The proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . II
A.2 State diagram and transitions over the used program . . . . . . . . . III
A.3 Resulting state machine of a early terminated run on a complex GUI

structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.4 Hashing test images . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
A.5 Hashing test images . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

xiii


List of Figures

xiv


List of Tables

5.1 Identification results, plain background. . . . . . . . . . . . . . . . . . 37
5.2 Hamming distance of the dHash values between the images in Fig-

ure A.4, downscaled to 7x6 . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Hamming distance between the dHash-values of the images in Fig-

ure A.4, downscaled to 11x10 . . . . . . . . . . . . . . . . . . . . . . 40

xv


List of Tables

xvi


1
Introduction

The use of touch screens have become increasingly widespread since they have proven
to excel in multiple applications over the last few years. The expansive smartphone
and tablet markets have helped speed up the transition from traditional, physical
knobs and switches to virtual, graphical user interfaces (GUI). In the automotive
industry, this transition has enabled greater flexibility when designing the interior
of the car. It also enables for the possibility to update and change the look and
functionality of the dashboard even after the car has been manufactured simply by
updating the software.

Part of the late stage testing of software, either completely new software, or just a
new version, is testing of the GUI. The GUI can vary greatly from version to version,
and sometimes even within versions depending on user settings. It is necessary to
verify that old functions are not broken by each new release. This is called regression
testing. Regression testing is usually a trivial task for a human, it can however be
rather time consuming. To an automated testing system, this might not be the case
and is often a quite complex problem. Fonts, window placement, colors, graphics,
sizes and a number of other things may have been altered in the software update,
complicating testing between versions.

Currently there exist some automated GUI testing solutions, but they require precise
and time consuming configurations that can be complex, and they will need to be
changed every time an object in the GUI is changed, added or removed. Most
existing solutions, such as the "Linux desktop testing project"[3], aim to test a
predefined scenario and are not able to test unknown GUIs. Programs like this
often require access to the event stream of the software to capture and send events
to traverse the program. These system tests are generally performed manually to
find pieces of the software that is not working as intended.

Benchnode Technology AB is currently supplying NEVS with various test strategies
for verifying the integrity and function of the CAN network. A growing need to
test the touchscreen in the vehicle in an orderly manner has been identified. As
such, an efficient, automated, and general testing solution for late stage testing of
an unknown touch screen GUI is desired.

1


1. Introduction

1.1 Related Work

Evaluating a GUI using image analysis techniques has been suggested as a potential
improvement in existing testing solutions. The article Automatic GUI Test by using
SIFT matching [4] compares using methods such as SIFT versus Random FERN
for identifying features in an image. The article suggest a combination between the
two for identification of image features.

Once objects have been identified, a method of conducting the tests is needed.
Monkey testing[5] is, as the name suggests, a rather naive approach to testing. The
system will try to find flaws in the GUI by applying more or less random inputs.
A benefit of monkey testing is that little or no information about the system is
needed, which makes it useful when testing unknown systems. A downside is that
due to the randomness of the test procedure, found bugs can be hard to reproduce.
If a specific function of the GUI should be tested repeatedly and with speed, a
systematic approach is needed. Information about how the GUI might look and
behave is needed to test a given function. T. Daboczi et.al[6] proposed several
approaches to GUI testing. One of which uses a state machine to keep track of the
behaviour of the GUI. The testing is then carried out by using a planning algorithm
based on the desired result. The accuracy of the actual result is then verified.

1.2 Purpose

This thesis aim to explore the possibility of using image analysis and machine learn-
ing techniques to create an adaptive and flexible testing tool for touch screen user
interfaces. The solution needs to be dynamic, and suitable to many different test
cases. This would reduce the need for specialized test rigs and brand new solutions
for each new iteration of a product.

The main focus of the thesis will revolve around evaluating ways to determine the
current interactive objects on the screen. This can then be used to interpret the cur-
rent state of the system. An object is defined as an interactive (something happens
if it is clicked) area of pixels within the GUI.

The thesis will aim to answer the following questions:

• What is an effective way of identifying objects in a GUI?

• What is an effective way to distinguish between interactive objects and passive
areas?

• What is a suitable way to model the structure of a GUI?

A measurable goal that all other tasks revolve around, is to automatically find ob-

2


1. Introduction

jects in an arbitrary touch screen GUI and test if they respond in an appropriate
manner when clicked on. The algorithms behind identifying various interactive ob-
jects on the screen is the main focus of the thesis. The efficiency and accuracy of
performing the task of automatically testing a user interface will lead to an evalua-
tion process to find the best method.

The task of deciding whether the response of a certain interactive object was appro-
priate or not will be fulfilled if there was any visual response at all.

1.3 Contributions

With this thesis, it has been proven that it is possible to identify interactive objects
and navigate through a novel GUI using a combination of convolutional neural
network and perceptual hashing. Compared to a classical edge detection algorithm,
the neural network performs with higher accuracy and above all, greater flexibility.
The network has not been trained on the GUI which it is tested on, but on training
data gathered from other sources. The neural network was trained on generic images
of mobile phone GUIs, which proves that the GUI on which the method would be
applied can indeed be unknown.

The results from building a graph structure could prove to be helpful when verifying
the complexity of a GUI. When designing a car, it is important to provide the driver
with simple and logical input controls. After all, the driver should be looking as
little as possible at the touch screen, and instead be focused on the road and the
surrounding traffic. This means that GUIs should be as shallow as possible, i.e. it
should be possible to navigate to all important functions relatively quickly. However,
when mapping the depth of a GUI, the identification process cannot be guaranteed
to perform perfectly every time. This means that the results from mapping a GUI
should be used with care.

3


1. Introduction

4


2
Preliminaries

This chapter of the thesis describes the prerequisites, defines the problem, goes
through assumptions and limitations that have been made and briefly describes the
set up necessary to run the proposed testing solution.

2.1 Problem Description

The task of testing a touch-screen GUI can mean a great number of things. In the
scope of this thesis, it is defined as the task of systematically finding and testing each
and every clickable area. When clicked, the GUI should react in some (preferably
meaningful) way. If that is the case, the transition should be mapped into a graphical
representation of the GUI. In the scope of this thesis, any kind of response will count
as meaningful.

2.2 General Objectives

In order to overcome the problem of finding GUI objects and testing how they affect
the system, several subproblems will have to be solved. These problems are in turn
broken down into smaller, more manageable parts.

Identification - Finding possible clickable areas. Ultimately, this translates to a
number of (x,y)-coordinates.

Data acquisition - The task of collecting data about the current state of the GUI.
The data consists of images that will be used in multiple stages of the project.

Testing - The feat of automatically making the touch screen react to an input.

Control - In order to test the GUI, a way to interacting with the GUI is needed.

Modelling - To enable efficient search procedures and to help understand how the
system is designed.

5


2. Preliminaries

Evaluation - Collecting data from the previous steps and presenting it in a mean-
ingful way to the user. This is also needed to answer the questions proposed
in the preliminary aim.

These subproblems and how they were approached are described in full detail in
Chapter 4.

2.3 Limitations

These limitations mainly focus on simplifying the problem into separate, controllable
elements, and to reduce complexity.

Fonts can vary greatly in both style and size, adding a complexity to the identifi-
cation methods needed without adding a large value to the problem. Because
of this, only a single, or a small set of fonts are considered.

Buttons is also something that can vary a lot, from a small simple gray button
to an animated graphic or icon. The button could be tied to a menu, list
or another multistate function requiring multiple navigational steps to access.
The buttons were assumed to be relatively symmetrical with a sharp edge and
a contrasting color.

Size differences of the interactive objects were assumed to be limited. Tiny objects
(only a few pixels wide) are unlikely to appear in a real system.

Menu depth is assumed to be small, since deeper menus tied behind previous
menus quickly results in a combinatorial explosion of test cases. If handling
a menu with two layers is successful, the solution should be expandable to
menus with greater depths.

Image quality , such as the amount of noise, highly impacts the success rates of
image analysis techniques and can introduce an element of randomness that
should be avoided during the development of the algorithms intended for this
thesis.

Image resolution can greatly affect computational complexity, thus a limit to the
image resolution and size were established.

Input methods for a touchscreen device can vary from a simple single tap with
one finger, to more complex multi-touch gestures, all with different effects
depending on the state of the device. Thus, a limitation to the number of
gestures were established. Only the simplest of inputs in the form of a one
finger tap is considered.

Device platform is limited to Android due to the wide array of open source tools
available for developers. This made it possible to quickly establish a commu-

6


2. Preliminaries

nication channel between an Android device and a PC.

GUI complexity is limited. A simple GUI with only a few object types is consid-
ered.

Interactive area is limited. Parts of the screen area are limited to prevent the
phone from trying to make calls, or open other unwanted applications.

7


2. Preliminaries

8


3
Theory

To understand how the GUI testing tool works, explanations of the theory behind the
methods used are provided in this chapter. This chapter consists of brief overviews
of the key theory components of the solution. The chapter describes perceptual
hashing, edge detection and neural networks. More in depth explanations, especially
on the matter of Convolutional Neural Networks, can be found in the references.

3.1 Perceptual Hashing

Being able to identify and quantify similarities between, or even in, images is an im-
portant part of the image analysis toolbox. There are a multitude of different ways
to measure image similarity. Examples of such methods are; keypoint matching,
where SIFT[7] is arguably the most commonly used, different image histogram[8]
methods, keypoint recognition using random fern[9], or perceptual hashing[10]. This
project has used perceptual hashing because it is simple to implement while provid-
ing a solution that can calculate a similarity measurement with enough speed and
accuracy for the scope of this thesis.

Perceptual hashing is a way to identify and compare various forms of multimedia
content, such as images, music or text. It is done by creating fingerprints of the
media that are analogous if the original features are similar. This is commonly used
to find content that are similar, or that have been modified slightly. Most perceptual
hashing algorithms work by trying to look at the general structure of the content,
instead of the details to find matches that are similar. If the hashes are stored in a
database, it is a one time calculation per image to compute the perceptual hash of
an image and then the comparison search for a matching hash is relatively quick.

Perceptual hashing differ from encryption hashing where a small adjustment to the
data will result in a completely different hash value in the way that small adjustments
to the data will only result in a small difference to the hash value, so it can be used
as a comparison measurement.

9


3. Theory

3.1.1 Average Based Hash (aHash)

The average hash algorithm is one of the more basic perceptual hashing algorithms.
It uses averages of the low frequency components of an image. The resulting hash
is static even if the input image is scaled or if the aspect ratio changes. Altering
brightness, contrast or colors will not change the hash value dramatically, however
minor modifications such as changing text, moved objects can be missed. The basic
algorithm is described below, simplified from a blog post on Dr. Neal Krawetz
blog called "Hacker Factor"[11]. The average hash is very quick, but compared to
other perceptual hashing methods, it is rather inaccurate since it easily misses minor
modifications to an image. The algorithm operates according to:

1. Reduce color by converting the image, I, to grayscale

2. Reduce and normalize the image to a predetermined size

3. Compute the mean value, M , of the image

4. Compute the hash values by comparing each element in the image to the mean
according to

h(i) =

0, I(i) < M

1, I(i) ≥M
.

3.1.2 Block Based Hash (bHash)

The basic block hash function is generally performed as described by the first algo-
rithm proposed by Bian Yang[12] et al. and is based on the mean values of blocks
of the image. Block hash methods can incorporate rotational operations to better
handle cases where the images being compared are rotated, or overlap parts of the
selected blocks. This has been left out in the algorithm below and are described in
more detail in the previously mentioned article. The block hashing algorithm adds
quite a few computations. Thus, it is slower than the average hash, but in turn it
is more accurate.

1. Reduce color by converting the image, I, to grayscale

2. Reduce and normalize the original image into a predetermined size

3. Divide the normalized image, IN , into a sequence of blocks as

Ia = {I(1), I(2), . . . , I(n)}.

4. Encrypt the indices of the previous sequence to obtain a new order of the
blocks, Ib.

10


3. Theory

5. Compute the mean value sequences of the image blocks as

Ma,b(m) = median(Ia,b(m)) (m = 1, 2, . . . , n).

6. Normalize the mean value sequence into a binary form and obtain the hash
values as

h(i) =

0, Ma(i) < Mb(i)
1, Ma(i) ≥Mb(i)

(i = 1, 2, . . . ,m).

3.1.3 Gradient Based Hash (dHash)

The dHash algorithm works by tracking the gradient of the images[13]. It utilizes
very few operations, is very fast and it is simple to implement. As with the aHash
algorithm, dHash is not affected by scaling or aspect ratio operations. The algorithm
also robustly handles other image operations such as gamma correction and color
profile changes. Changes in brightness or contrast will not significantly change the
resulting hash either. The simplicity in the dHash algorithm makes it very fast
to generate the hashes, about equally fast as the average hashing algorithm. Its
accuracy is better than the block hash algorithm[14], and nearly as good as the
DCT based hash [13], which is covered in the next section.

1. Reduce color by converting the image, I, to grayscale

2. Reduce and normalize the original image into a preset size, I

3. Compute the relative gradient direction and obtain the hash values as

h(i) =

0, I(i) < I(i+ 1)
1, I(i) ≥ I(i+ 1)

.

3.1.4 DCT Based Hash (pHash)

The pHash open source perceptual hash library[15] implements an algorithm that
creates the hash by evaluating color frequency patterns in the images using a Discrete
Cosine Transform[16]. When computing the mean of the DCT, the first term is
removed, which excludes flat, solid color and image information from the hash. The
computational time of a DCT based hash is quite high, but the accuracy of the
algorithm is high[17].

A simplification of the algorithm is provided below.

1. Reduce color by converting the image, I, to grayscale

2. Reduce and normalize the image to a preset size

11


3. Theory

3. Compute the DCT to separate the image into a collection of frequencies and
scalars as Idct

4. Reduce the DCT to just contain the lowest frequencies in the image by selecting
the top-left 8× 8 values of Idct

5. Compute the mean, M , of Idct, excluding the first term.

6. Reduce the DCT further by comparing each value of the Idct to the mean value
and turn it into a binary form and obtain the hash as

h(i) =

0, Idct,i < M

1, Idct,i ≥M
.

3.1.5 Hamming Distance

The Hamming distance (dh)[18] for two strings of symbols of equal length is the
number of substitutions required for one string to be equal to the other. In a set of
two binary strings of length n this is equal to the number of flipped bits between
the two.

For example, A = 0101 & B = 1010 has a Hamming distance of dh = ∑n−1
i=0 |Ai −

Bi| = 4, since all bits between the two are different.

The Hamming distance can be used as a similarity measurement between two per-
ceptual hash strings of equal length, provided the two hashes were constructed using
the same algorithm. This is useful in cases where small changes between the com-
pared objects should be allowed, if it is under a certain threshold.

3.2 Edge Detection

Edge detection is a set of mathematical methods that aims to identify changes in
intensity (brightness) in an image. In this section, Gaussian filtering and the Sobel
operator are briefly described.

3.2.1 Gaussian Filtering

When performing edge detection, it is often the major gradients that are the most
interesting. It is usually the case that some kind of lowpass filtering has to be
performed before applying the edge detection filter in order to obtain good results.
Otherwise, it is hard to discriminate between the actual edges in an image, and
edges appearing due to noise.

12


3. Theory

The value of each pixel in an image is recalculated using a filter according to the
Gaussian function according to

G(x) = 1√
2πσ2

e− x2
2σ2 . (3.1)

In the two-dimensional case, seen in Figure 3.1, the Gaussian function is instead
defined as

G(x, y) = 1√
2πσ2

e−x2+y2

2σ2 (3.2)

where (x, y) is the horizontal and vertical distance to the center of the filter for that
particular coordinate and σ is the standard deviation of the Gaussian distribution.
This yields a circular filter that is converted to a n by n matrix which then can be
convolved with the image. The size n(σ) of the matrix calculated using

n(σ) = 2× ceil(2× σ) + 1 (3.3)

which will always produce an odd filter size. The ceil() function rounds up to the
nearest integer.

0

45

0.2

0.4

40

0.6

35
45

0.8

4030

1

10
-3

35

1.2

25

1.4

30
20

1.6

25

1.8

15 20

1510

10
5

5

0 0

Figure 3.1: Visualization of the Gaussian filter with σ = 10.

13


3. Theory

A Gaussian filter with σ = 1 will produce a 5× 5 matrix according to

1
100


0.2969 1.3306 2.1938 1.3306 0.2969
1.3306 5.9634 9.8320 5.9634 1.3306
2.1938 9.8320 16.2103 9.8320 2.1938
1.3306 5.9634 9.8320 5.9634 1.3306
0.2969 1.3306 2.1938 1.3306 0.2969

 .

3.2.2 Sobel Filtering

The Sobel operator is the filtering that performs the heavy lifting of the edge de-
tection process. The result of a Sobel operation is an approximation of the pixel
intensity derivative. It consists of two 3× 3 matrices Sx and Sy (one for each axis),
where

Sx =

1 0 −1
2 0 −2
1 0 −1

 and Sy =

 1 2 1
0 0 0
−1 −2 −1

 (3.4)

In order to produce an approximation of the intensity derivatives, Sx and Sy are
convolved with an image I resulting in two new matrices Gx and Gy according to

Gx = Sx ∗ I and Gy = Sy ∗ I (3.5)

Combining Gx and Gy as

G =
√
G2

x +G2
y (3.6)

gives the resulting total gradient magnitude.

3.3 Neural Networks

Artificial neural networks is a computational model used in the field of machine
learning that is inspired by the way biological neural networks process information.
Just like humans, artificial neural networks need to learn from a set of known data
before they can be expected to solve a problem successfully. The goal is to infer
different levels of abstract rules from these training sets, and by learning these rules,
the accuracy of the neural network classifiers hopefully increase.

Neural networks and machine learning have been used to solve a wide array of differ-
ent problems, not limited to image processing. Even though, at the time of writing

14


3. Theory

this thesis, classification using images is arguably the most common application of
neural networks.

This section aims to give a brief explanation of the core concepts when using a
neural network as an image classifier.

3.3.1 The Makings of a Neural Network

The elemental part of the neural network is the artificial neuron, Figure 3.2a. The
mathematical model of the neuron, seen in Equation (3.7), is used to sum up each
weighted input along with a bias, b. This sum is then fed through an activation
function, f , to produce the output, y. The activation function is described further
in Section 3.3.1.7.

y = f

(
n∑

i=1
(wixi) + b

)
(3.7)

These neurons are then arranged in so called layers, a common representation of
this can be seen in Figure 3.2b where each node represents an artificial neuron. The
weights of the neurons in the layers are adjusted when training a network to work
on a certain problem.

(a) An artificial neuron. (b) A neural network layer.

Figure 3.2: The basics of a neural network, the neuron, and a layer of neurons[19].

These layers can vary in their form and function, and different types of layers will be
discussed in the coming sections. A large part of designing a network is arranging
all these different layers. Traditionally the classical arrangement has been to simply
put one layer after the other, such as in the 8 layer Zeiler-Fergus Architecture[20],
seen in Figure 3.3. Lately, more complex architectures has emerged, such as the
GoogLeNet discussed in Section 3.3.3.

15


3. Theory

(a) As visualised by M. Zeiler and R. Fergus[20].

(b) A simplified tree structure.

Figure 3.3: The 8 layer Zeiler-Fergus architecture neural network.

3.3.1.1 Convolutional Layer

A convolutional layer (CONV) is the main part of a convolutional neural network
(CNN) and it is in essence a feature identifier. Early in the network, the features
are simple characteristics such as edges, lines, curves and colors The features they
describe get more abstract further in through the network.

It’s first after training that the CONV weights will correspond to the different fea-
tures. An example of a trained layer visualiation can be seen in Figure 3.4 where
some low level features can be identified, such as straight lines, colours and the like.

(a) Visualization of the weights. (b) The activation map looking at a
picture of a cat.

Figure 3.4: Typical visualization of a trained first layer CONV[21].

The CONV performs a convolution between the input from a previous layer and a
filter and outputs the result to the next layer in the network.

The convolutional layer slides a filter (a kernel) over the input. The region covered

16


3. Theory

by the kernel at a certain position on the input is called local receptive field. Each
local receptive field position corresponds to a hidden neuron in an array as it slides
over the input, this results in an so called activation map. It is sometimes also
referred to as a feature map as it corresponds to a certain feature. An illustration
of two steps of the convolution can be seen in Figure 3.5.

(a) The first step. (b) The second step.

Figure 3.5: The local receptive field (5 × 5, fk = 5) slides through the input
(15 × 15, zI = 15) with a stride length of one and connects to a corresponding
hidden neuron.

A CONV has three so called hyper parameters that define it. The parameters are,
kernel size, fk, stride, fs, and padding, P . Together with the input volume size, zI ,
they dictate the size of the output volume, zO. These parameters can be tuned to
achieve different results, such as a reduction in output volume to reduce complexity.
In Figure 3.6 an example on the effects of the stride length on output size can be
seen.

Figure 3.6: Visualization of the effect of stride length. Stride is here set to three,
and with a kernel size of 5, the input volume (14× 14) becomes a 4× 4

matrix.

17


3. Theory

3.3.1.2 Pooling Layer

Once a specific feature has been identified in the previous layer, the feature’s exact
location is not as relevant as the relative position to other features in the input
volume. To accomplish this, a pooling layer is applied, where the most commonly
used is a max-pool. By applying a maximum filter, only the largest responses in
each subregion is considered and sent through to the next layer. It is a sliding filter
with the same stride length as the filter size, outputting the maximum value in every
subregion. An example of this can be seen in Figure 3.7.

Other pooling functions are sometimes used, such as average pooling or `2-norm
pooling. In average pooling, the average of each subregion is fed forward through
the layer, x̄ = (x1+...+xn)

n
. Likewise `2-norm pooling feeds the euclidean distance,

‖x‖ =
√
x2

1 + ...+ x2
n to the next layer.

This layer also reduces the dimension of the input, where the reduction factor de-
pends on the filter size and stride length. This helps keeping computational cost
down, as well as providing a way to reduce the risk of overfitting the network on the
training set.

Figure 3.7: Visualization of a maximum pool with a filter size and stride length
of two. This is reducing the input volume dimension by 75%, while keeping the
features relative position.

3.3.1.3 Concatenation Layer

The concatenation layer is very self descriptive, it concatenates the different filter
outputs into a single output. This is for example used as the last layer in an inception
module, as described in Section 3.3.3.1, to gather the output from the parallel filters
into a one output, as seen in Figure 3.8.

18


3. Theory

Figure 3.8: Visualization of a concatenation of three parallel filters into one output.

3.3.1.4 Dropout Layer

The dropout layer[1] is a layer that is only used during the training of the network,
not during the testing or the actual use of the resulting filter. The main purpose of
the dropout layer is to help reduce overfitting, described in Section 3.3.2.4, by forcing
the network to have built in redundancies. It does this simply by "dropping out"
some activations. That is, it sets a random set of activations from the previous layer
to zero during the forward pass part of the training of the network, see Section 3.3.2.
By doing this, the network should be able to better classify the input, even if some
activations are missing.

Figure 3.9 shows an example of a training iteration of a neural network when dropout
is applied.

(a) Standard Neural Net (b) After applying dropout.

Figure 3.9: Visualisation of dropouts effect on a neural network[1].

19


3. Theory

3.3.1.5 Fully Connected Layer

The fully connected layer is used to get probabilities for the different classes by
looking at what high level features correlates to a particular class the most.

The layer takes the output volume from the layer preceding it as input and outputs a
dimensional vector with a size equal to the number of classes the network is designed
to classify. For example, if the network is designed to classify digits, the vector would
have a size of 10, one element for each digit. Each element in the output vector is
connected, with a weight, to each neuron from the previous layer. The layer tries
to determine which activations in the previous layer correlates the most to what
particular class. This allows for nonlinear combination of features to be used in the
classifiers.

3.3.1.6 Softmax Layer

The softmax layer takes the output of the previous network layers and outputs
normalized class probabilities. It does this by using a softmax function

σ(z)j = ezj∑K
k=1 e

zk
for j = 1, . . . , K (3.8)

where K is the number of possible outcomes. The softmax layer highlights the
largest values while suppressing values that are below the maximum value. The
values are logarithmically more suppressed the further below the maximum value
they are. The softmax layer is often used in the final layer of a neural network to
provide a simple output to the classifier.

3.3.1.7 Activation Function

The activation function part of a neural network is a nonlinear layer that is added
throughout the network. This adds nonlinearities to the output of a layer, which
is important as most real world data is not linear. Adding nonlinearities also helps
to reduce network training times, as they reduce the computational complexity,
without a significant decrease in resulting accuracy of the network[22]. They can
also help counter the vanishing gradient problem[23] where the lower layers in the
network training is very slow due to an exponentially decreased gradient as it moves
through the network.

There exist a number of common activation functions that results in different ranges
of output. The three most commonly used ones today are the sigmoid function

σ (x) = 1
1 + e−x

(3.9)

20


3. Theory

which limits the input to a range from 0 to 1, the tanh function

tanh (x) = 2σ (2x)− 1 (3.10)

where σ is the chosen standard deviation. The tanh function fits the input into the
(−1, 1) range, or the the so called ReLU function

f(x) =

0, x < 0
x, x ≥ 0

= max (0, x) (3.11)

which only outputs the positive inputs.

The ReLu function is an abbreviation for "Rectified Linear Unit function" and it
simply changes negative inputs to zero while forwarding positive real valued inputs,
limiting the output to positive values.

3.3.2 Training a Neural Network

To have a useful network, the network weights need to be trained so it learns what
to look for. This is done through a training process called backpropagation. The
training is done by first initializing network weights as small random numbers to
promote variation through the network. Then an input is passed through the whole
network, a so called forward pass. The first time this is done the output will basically
not give any preference to any certain result. This is then passed to a loss function
comparing the output to the target. A common loss function is the mean squared
error (MSE)

Etotal = 1
n

n∑
i

(targeti − outputi)2 (3.12)

which, as the name suggest, is the mean of the square of the errors.

Initially this error will be large, and the networks goal is to minimize this error.
To minimize the error the different weights contribute to, the error needs to be
identified. This is done during the so called backward pass through the network
where a stochastic gradient descent is performed by computing the gradients in
reverse order (backwards) of the network. Once the derivatives are computed the
weights can be updated accordingly so they change in the direction indicated by the
gradient. The weight updates has a parameter called the learning rate, η, that is
basically a scalar on the length of each step. A large learning rate may make the
model converge in less time, but if the value is too high it could result in steps that
are too large for the model to reach an optimal point. Too small of a value and the

21


3. Theory

model might get stuck in a local optimum that is not close to the global optimum,
this is visualized in Figure 3.10.

Figure 3.10: Visualization of different learning rates. Loss function in blue. High
η in red that does not reach the global minimum. Low η in yellow that gets stuck
in a local minimum, and does not reach the global minimum.

There are different techniques to reduce the risk of this, such as using a changing
learning rate that starts out large, but gets progressively smaller, or applying a
momentum to the gradient to create a tendency to move in the same direction.

3.3.2.1 Training Set

The training set is what the neural network trains and learns from. The neural
network adjusts its weights by propagating the error through the network and cal-
culating the accuracy over the training set.

3.3.2.2 Validation Set

By validating the accuracy of the neural network between training iterations against
a set that is not used to adjust the weights, it is possible to verify that an increase
in accuracy on the training data yields an actual improvement on data not yet seen
by the network. If the accuracy of the training data increases while the accuracy of
of the validation set decreases the neural network is overfitting to the training set.

22


3. Theory

3.3.2.3 Testing Set

The testing set is used much like the validation set to verify the accuracy of the
network. The testing set is however used for verification of the network after training
is complete, whereas the validation set is used during training to verify an increase
in accuracy.

3.3.2.4 Overfitting

Overfitting[24] is a term that describes a phenomena where a complex model has
been made that describes the training data very well, however it does not function
well with new data. Such a model tries too hard to explain what can be considered
noise in the data, and that results in a model that is generally unfit to use on new
data.

In Figure 3.11 there is an example of three different models trying to model the
data. Figure 3.11a shows an example of an undertrained model that tries to fit a
linear model to nonlinear data. Figure 3.11c shows an example of overfitting, where
the model is trying to describe what can be considered as random noise. While the
error on the training data is low, when applied on new, previously unseen data, the
models predictive performance can be poor.

Neural networks are generally prone to overfitting, and there are many techniques
that can help reduce overfitting. Increasing the training set reduces the risk for
overfitting, as there is more data to train on. The training data can sometimes be
artificially expanded from the existing training set. For example, if the training set
is a set of images, new training data can be generated by rotating, scaling, cropping,
adding noise or otherwise transforming the existing images. One could also average
multiple trained models, provided the resources to produce multiple models exist.
Something that is becoming increasingly common is using a dropout layer, such as in
the one described in Section 3.3.1.4. In the dropout layer, a portion of the neurons
are randomly deactivated, i.e. they are set to 0, during training. This forces the
network to have built in redundancies.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

(a) Underfitting model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-0.5

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

(b) Wellfit model

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

-1

0

1

2

3

4

5

6

(c) Overfitting model

Figure 3.11: Visualisation of the concept of under or overfitting a model.

23


3. Theory

3.3.3 GoogLeNet

GoogLeNet, seen in Figure 3.12, was the winning architecture in ImageNets "Large
Scale Visual Recognition Challenge" of 2014[25] when they introduced the so called
Inception module. It was one of the first architectures strayed from the classical
approach of sequentially stacking layers. It adds multiple parallel layers in sections
called Inception modules. This can be seen in Figure 3.12.

Figure 3.12: Overview of the GoogLeNet Inception v.3 architecture.

3.3.3.1 The Inception Module

The GoogLeNet is built with the idea that in images, correlations tend be be both
local, and be slightly spread out around these areas. To catch this filters of different
sizes, (1×1, 3×3 and 5×5) are added in parallel to cover larger areas, while keeping a
fine resolution for the smaller information in the images, as done in Figure 3.13a. To
summarize the content from the previous layer, a pooling layer is added. Figure 3.13b
shows this as the naïve idea of the Inception module.

(a) The naïve idea to stack filters of
different sizes.

(b) Inception module, naïve version.

Figure 3.13: The Inception module[2].

24


3. Theory

However, doing it this way will not work as it leads to an explosion of computations
and outputs. The solution to this is to add a 1×1 CONV operation before the 3×3
and the 5 × 5 layer, one is also added after the pooling layer, so the full Inception
is as showed in Figure 3.14. This provides a dimensionality reduction to keep the
computational cost of the network down. ReLu layers are added after each CONV
to add nonlinearities to the network.

Stacking these modules allows for simple managing of parameters when designing
the network, as each module can be individually tweaked.

Figure 3.14: Inception module with dimension reductions[2].

25


3. Theory

26


4
Method

This chapter describes how the work during the thesis was carried out in order
to obtain the results. The objectives described in Section 2.2 consist of several
subproblems that all has to be solved in order to achieve the main goal. These
subproblems were the base of how the work was carried out. The first approach was
an iterative process of building algorithms and testing solutions and then testing
against GUIs of increasing complexity. Initial tests were performed on a simple,
simulated GUI to ensure full control of variables during early stages of development.
A conceptual view of the system as shown in Figure 4.1 was used as an initial model.

Figure 4.1: Schematic view of the early system.

The data that flows through the system is referred to as:

u0 The initial image from the GUI

x Identified areas of interest

x̂ Classified objects

x̂i Selected object to try an interaction on

uss Steady state image of GUI

r Result of comparison.

27


4. Method

4.1 Tools

The development process has seen use of various third party tools, which are de-
scribed briefly here.

4.1.1 MATLAB

Initial prototyping of the edge detection algorithm was done using MATLAB since
it offers off the shelf implementations of many methods such as Gaussian and Sobel
filtering. However, in the case of the filtering, these operations were written for the
sake of this thesis to be able to control pixel padding and performance.

4.1.2 Android Device

The Android device that the solution was developed on was a Sony Xperia L C2105
running CyanogenMod 12-20150616-NIGHTLY-taoshan, Android 5.0.2[26]. The de-
vice has a resolution of 480×854, and has three virtual navigation buttons (BACK,
HOME and APP_SWITCH) replacing common hardware navigation buttons.

4.1.3 Android Studio

A test environment for Android has served as a convenient way to test the algorithms
on an actual touch screen. At first, the idea was to test the algorithms in a simulated
environment with full control of interactive events. This proved unnecessary since it
would basically be the same as running the tests on a phone hooked up to Android
studio.

4.1.3.1 Android Debug Bridge

Android Debug Bridge (adb) is a command line tool for communication between
an Android device and a PC. The device can either be emulated or real. The adb
command tool includes three main components:

A client that sends commands to the device via a command line terminal.

A daemon (adb) which runs the actual commands on the connected device.

A server which manages communications between the client and the daemon.

This is the main communication protocol used to control and navigate the Android

28


4. Method

device. A complete list of the adb related commands used in the project can be
found in Appendix B.2.

4.1.4 Python

Python was used throughout the thesis duration and the final solution was written
in Python 3.5.2[27]. There are a set of required packages that were used in the
program, these are listed and briefly described in Appendix B.1.

4.1.5 TensorFlow

TensorFlow is an open source library for machine learning. It is used extensively by
Google to perform speech recognition, recommendation features and image classifi-
cation. There is a Python API for TensorFlow that can help during development.
TensorFlow is the base on which TensorBox rests.

4.1.5.1 TensorBoard

TensorBoard is a tool accompanying TensorFlow to visualize the process of training
neural networks. This is very helpful when training for a long time (several days).
If the accuracy does not seem to approach expected levels after a couple of hours,
it gives an indication that something might be wrong and it is possible to abort the
training process and start over with better training data or tuned parameters.

4.1.5.2 TensorBox

TensorBox is a framework based on TensorFlow for training neural networks to de-
tect multiple objects in images. The basic configuration implements the simple and
robust GoogLeNet-OverFeat algorithm. Furthermore, the framework also provide
an implementation of the ReInspect algorithm, reproducing state of the art detec-
tion results in highly occluded scenes. However, this is not in the scope of this thesis
since the GUI is assumed to be completely visible in its entirety.

The final version of the resulting neural network uses a modified version of Tensor-
Box. Out of the box, TensorBox comes with a rather deficient tool for annotating
the training data in an orderly fashion. A new tool to fill that need has since been
created.

29


4. Method

4.2 Data Acquisition

In order to solve any problem, data is needed, more data is better, even more so
when using machine learning.

Some data has been collected by the means of taking images of phones GUI with a
camera, and taking screenshots with adb. But as the aim was for a general solution
and the limited access to different GUI, the main source of image data has been the
Internet with its wide array of open source images.

4.3 Identification

Since one of the aims of the thesis was to explore the possibility to use image analysis
techniques to distinguish between interesting areas of a GUI, two possible methods
were explored. The first approach was using sobel edge detection to try and find
objects. Buttons and other interactive elements tend to have sharp borders that
run continuously around the interactive area which indicated that edge detection
would be a suitable solution. The other method was identification using a neural
network. It was deemed an interesting alternative after reading about TensorFlow
and TensorBox in particular.

4.3.1 Edge Detection Method

Starting with an image of the GUI, the edge detection was done in several steps,
namely grayscale conversion, seen in Figure 4.2, Gaussian filtering, seen in Fig-
ure 4.3, Sobel filtering, seen in Figure 4.4 and finally boundary detection, seen in
Figure 4.5. Since the goal is to find intensity gradients, the color information was
deemed unnecessary.

Figure 4.2: Before and after grayscale converison.

30


4. Method

Gaussian filtering is necessary to get rid of false positive edges. That is, edges that
occur due to noise or other high frequency components in the image. The size of the
Gaussian filter was calibrated manually using trial and error. A method to choose
the filter size dynamically was considered but never finished due to the decision to
abandon the method entirely in favour of a neural network.

Figure 4.3: Before and after Gaussian filtering.

When the image has been smoothed using a Gaussian filter, the intensity gradients
are calculated using a Sobel process. As seen in Figure 4.4, the process yields two
images that are combined.

Figure 4.4: The results of Sobel filtering (Gx, Gy and G).

Using the result of the Sobel filtering, bounding areas are found by identifying
edges that enclose an area above a certain threshold. Thus, the object identification
process using edge detection is finished.

31


4. Method

Figure 4.5: The results of Boundary detection.

4.3.2 Identification Using TensorBox

By using a neural network to identify objects in a GUI the majority of the work
is spent finding suitable images to use as training data. This meant gathering as
many images of Android and iOS GUIs as possible. The rather time consuming
task of manually annotating the images (selecting the areas which the neural net
should find once trained) led to a rather small training set of 62 images. Before any
meaningful results came out of this method, a lot of time was spent preparing the
development environment to accommodate for running TensorBox on the GPU.

4.3.2.1 Preparing Training Data

Before training can begin, it is necessary to collect and prepare the data to be used
when training. Collecting images of GUIs is simple enough. By wielding the mighty
sword that is Google Image Search, a collection of images of GUIs quickly began to
take form.

On the matter of preparing the data, TensorBox does actually include some means
of preparing training data. However, the application is slow and clunky. That is why
a new way of preparing the data had to be devised. Firstly, in order for TensorBox
to efficiently process the images, they need to be scaled to an appropriate size. If the
resolution of the images differs from the correct resolution (640x480), they are scaled
and/or padded with black pixels, as shown in Figure 4.6. Next, the manual labour of
selecting the areas that are considered clickable has to be done with relative speed.
Thus, an application to interactively annotate the images was created. The result
of using this application is a JSON-object, which is the format used throughout
TensorBox to catalogue training data.

The tool to aid the annotation process (manually selecting which areas the trained
neural network should find) is simple but effective. First, the directory of the training

32


4. Method

(a) Original image (b) Resized and padded

Figure 4.6: Training image before and after resizing.

images is selected. The tool then searches for a specific JSON file. If the file
exist, the tool finds the images that has not yet been annotated and the process is
continued. If the JSON file is not present, it is created and the process is started.
When annotating an image, the user marks two opposing corners by clicking twice
(Figure 4.7a-Figure 4.7b). This yields four coordinate pairs, (x1, y1), (x2, y1), (x1, y2)
and (x2, yy), one pair for each corner of the rectangle. The process is then repeated
as many times as needed for each image, as shown in Figure 4.7c.

(a) Step one: choose first
corner.

(b) Step two: choose sec-
ond corner.

(c) Annotating finished.

Figure 4.7: The three steps of the annotation process.

33


4. Method

When all images are annotated, the JSON-file is passed through a splitting function
that randomly selects a predefined ratio (50% in our case) of images to be included
in the training set and the validation set.

With two sets of prepared data, the training can begin. The process was performed
as described in Section 3.3.2. Several runs were performed using the Overfeat-
Rezoom method which produced a file containing the weights of the network. These
weights were then used to evaluate new images using this method.

4.4 Classification

Areas in the GUI was divided into one of two categories, namely an active area or
a passive area. If the user can interact with a given area, it will be classified as
active, or interactive. Otherwise it will be labelled as passive. Verification that the
identified objects were interactive was done by comparing the screens before, and
after, an action on a specified coordinate. If the screens differ above a threshold,
the coordinate will be classified as active.

The comparison was made using perceptual hashing. Only the dHash hashing algo-
rithm as described in Section 3.1 was used. The dHash algorithm was chosen due to
its simplicity to implement in Python, its relative speed and its relative accuracy.

The scaling size of the image was chosen to be 9 × 8, with a Hamming distance
threshold of one or below to indicate a matching image. This to accommodate for
small changes, such as the clock changing digits, which should not affect identity of
the state.

4.5 System Modelling

To enable path planning, memory and other features in the system, a model of the
system is critical. Since most GUIs could be considered discrete systems, a solution
where the system is modelled as a state machine with a graph structure is created.

By using a state model, states that are functionally identical can be marked as the
same and thus redundant states are removed. This could be useful for example if two
icons have switched place, two icons link to the same function, or that the aesthetics
of the GUI has been altered, but the individual functions remain the same.

The individual screens of the GUI are modelled as states and nodes are created for
each state. Interactive elements are added as transitions between these states.

The state graphs were generated with the DOT language via the python library
Transitions[28].

34


4. Method

4.6 GUI Control

To navigate and control the GUI, a means of interacting with the hardware was
needed. This was done via a PC by sending commands from a python script via adb
to the device. With adb a lot of different touch event types relevant for this work
can be simulated.

4.7 Proposed Method

The proposed method is a combination of using the GoogLeNet-OverFeat algorithm
in TensorBox to come up with a set of coordinates for the estimated interactive
objects and using perceptual hashing to verify that the object is interactive. That
is if clicking on the predicted coordinate produces a new screen.

4.7.1 The Algorithm

The proposed method starts out by grabbing an image of the initial screen and
calculating the hash value of the image. It then runs the image through the trained
network to acquire a set of coordinates that the network believes interactable objects
are located at.

All these acquired coordinates are then acted upon in order, performing a breadth
first search. After each action on a coordinate, a new image is taken, and a new hash
is calculated. If this new hash is different from the previous hash, the coordinate
that was acted on is assumed to be interactive.

If this image hash pair is new and not part of a previous set it is assumed to be a
new state, and a new node is created with a transition to the previous node. If the
image hash pair is part of a previous set, a transition is created between that node
and the previous node.

A flowchart over the proposed system can be seen in Figure 4.8. A more detailed
state machine over the system can also be found in the Appendix, Figure A.2.

35


4. Method

F
igure

4.8:
T
he

proposed
workflow

.

36


5
Results

5.1 Identification

In this section, the results of identifying GUI objects using edge detection and a
neural network respectively are presented.

Initially, the edge detection approach showed great promise. In the first test case
which consisted of an iOS home screen with a solid black background, the edge
detection method performed consistently well when tuned. Some false positives,
as seen in the top of Figure 5.1a, appeared, as well as some missed icons (Videos,
Calculator and Safari). This was due to a tuning problem that resulted in more
false positives if the sensitivity was increased to catch all icons.

Hits False Positives
Edge Detection 24/28 (85.7%) 3
Neural Network 27/28 (96.4%) 6

Table 5.1: Identification results, plain background.

From Table 5.1 and Figure 5.1, we can conclude that in the simple case of identifying
objects against a plain background, the edge detection method was not as accurate
as using the neural network, but it resulted in fewer false positives. These false
positives would not cause a lot of headache when running an automated test of a
touch screen since those areas are deadzones.

Moving on to a slightly more difficult test case. Here, the background has a slight
gradient, as seen in Figure 5.2. There are no sharp edges in the background, but the
edge detection method is rendered completely useless when using the same tuning
as the previous example. The algorithm is simply too sensitive for color fluctuations
in the background. This is partly due to the loss in color depth when converting
the image to black and white. The algorithm is designed to use too few levels of
intensity. On the bright side, the neural network performs almost perfectly with
only one missed object here as well.

As a final test of the neural network, a GUI with icons of varying size and shape

37


5. Results

(a) Using the edge detection method (b) Using the neural network

Figure 5.1: Identified icons on an iPhone GUI with a plain background.

(a) Using the edge detection method (b) Using the neural network

Figure 5.2: Identified icons on an iPhone GUI with a background that has a color
gradient.

was tested. Again, the edge detection method is useless while the neural network

38


5. Results

manages to find all application icons of normal size, as well as some of the smaller
navigational icons (arrows and such). Note that none of these particular images
were included in the training set. The result of these can be seen in Figure 5.3.

(a) Using the edge detection method

(b) Using the neural network

Figure 5.3: Identified icons on a generic image with a slight gradient.

The net was trained on a small dataset of only 62 images, with a resulting accuracy
of over 95% on the validation set was which was achieved after only about 4300
iterations of training, which was about 10 minutes on the computer used for training.

The computer used was fitted with 16 GB of RAM, an Intel Core i7-6700HQ proces-
sor with a clock frequency of 2.6 GHz. During training, the Nvidia GeForce GTX
960M graphics adapter handled much of the computations.

The reason for the small dataset was a combination of not prioritizing the time-
consuming work of annotating the images combined with the realization that a

39


5. Results

decent result were achieved anyway. That the network was pretrained also helped
reducing the training time required.

5.2 Classification

Table 5.2 shows the Hamming distance values of the dHashes generated from the
images in Figure A.4 with the images scaled to 7 × 6. The algorithm turns the
images into grayscale before the hash is calculated, the grayscale images can be seen
in Figure A.5. The Hamming distance is commutative, thus the distances below
the diagonal are not shown since they are mirrored across the diagonal. The cells
marked in blue are correctly identified as the same image (dh = 0), cells marked in
yellow are images with a low hamming distance (dh ≤ 3) that might run the risk
of being classified incorrectly. The cells marked in red are image pairs that were
classified as the same image incorrectly (dh = 0).

Table 5.2: Hamming distance of the dHash values between the images in Fig-
ure A.4, downscaled to 7x6

1 2 3 4 Grey Red Blue Green
1 0 2 0 3 7 8 8 7
2 0 2 3 7 8 8 7
3 0 3 7 8 8 7
4 0 8 8 8 8

Gray 0 1 1 0
Red 0 0 1
Blue 0 1

Green 0

The results of a larger image scaling, 11×10, can be seen in Table 5.3 with the same
colour code as in the previous table.

Table 5.3: Hamming distance between the dHash-values of the images in Fig-
ure A.4, downscaled to 11x10

1 2 3 4 Gray Red Blue Green
1 0 4 3 9 19 20 19 20
2 0 7 9 20 20 20 20
3 0 11 19 20 19 20
4 0 20 20 19 19

Grey 0 3 4 5
Red 0 3 3
Blue 0 3

Green 0

40


5. Results

5.3 Test Runs

In this section two test runs of the complete algorithm will be presented.

The first is a very simple run, with only four states and four tap transitions. The
result of this can be seen in Figure 5.4. The program was allowed to run to com-
pletion from the start screen with access to four application icons, where three are
unique. None of the applications contained sub-functions.

The second run contains more advanced applications, with a multitude of states and
transitions. These can be seen in Figure 5.5. The program was run from a start
screen with access to three application icons, where all three were unique. One of
the applications was a gallery with complicated substructures with several links to
the same image. The process was terminated early as the gallery contained over two
hundred images at the point of the test.

(a) The four identified states

(b) Resulting State Machine.

Figure 5.4: Completed run with simple applications. The node names in the figure
are the hashing values of each state.

41


5. Results

(a) Five excerpts from the identified states. The images in the
gallery has been blurred for the report. The node names in the
figure are the hashing values of each state.

(b) Resulting State Machine. Larger version can be found in the
Appendix, Figure A.3.

Figure 5.5: Early terminated run with one advanced application.

42


6
Discussion and Concluding

Remarks

Upon building the system using machine learning techniques, discussions among
eachother, as well as with the supervisor at Benchnode Technology AB, Mikael, led
to some concerns regarding just how useful such a system might be in a real world
testing situation.

As efficient as it may be to identify objects using a neural network, the system
cannot guarantee 100% coverage. This is a major flaw since it is quintessential to
be able to depend on your testing methods when testing a novel system. However,
it is not hard to imagine that such a system would save a lot of time in an otherwise
manual process.

The feature that allows a GUI to be visualized as a graph shows more promise. When
ordering a new interface to a touch screen, the automotive industry has to make sure
that the driver is not distracted by an overly complicated interface. Requirements
such as "the graphical user interface should be intuitive and easy to navigate" is not
very descriptive and is prone to be misinterpreted. An alternative, less ambiguous
way of conveying the desired result could be to reason in terms of interface depth.
This kind of measurement is quickly externalized if the GUI is represented as a
graph. Now, the requirement could instead be written in terms of maximum depth
for a given function.

As for the choice to use TensorFlow, in retrospect, the time invested in understand-
ing the tool was worth it. However, it could easily have been a not very pleasant
experience if the results were not as good. Had the solution been developed on a
Windows machine, as it started out, the experience might have been worse. Ten-
sorFlow is much easier to use on a Linux based system. As such a transition from
Windows to Linux was made in the middle of the thesis work. This transition led to
a slower development process but in the end, the choice to abandon Windows was
rewarding, both in terms of results, but also for us as developers. It also turned out
to be a good choice to abandon the edge detection method at a relatively early stage
when it failed to produce good results on relatively simple GUIs instead of spending
time developing it further. In the end, it was deemed to be a far too daunting task
to make the edge detection method general enough. This is due to the problems

43


6. Discussion and Concluding Remarks

to tune the Gaussian filter in order to make the method work on varying kinds of
backgrounds.

Our neural network cannot be considered very general, as the number of training
images are very low. However, as a proof of concept, it does the job, but to truly
be certain that this is indeed an accurate method, it should have been tested on a
variety of GUIs, as well as different kinds of touch screens.

Perceptual hashing was proved to be a very useful tool in the image analysis toolbox.
dHash performed unexpectedly good, and was very easy to change its specificity to
how small or large the differences between the images you wanted to catch. It also
acted as a verification of the identified elements that the neural network identified,
as it found false positives. However it will not be helpful in finding the elements
the neural network missed. Missed elements is harder to compensate for than false
positives and it is something that needs to be minimised.

6.1 Further Development

The machine learning approach to identify objects in a GUI could be expanded
to handle classification of different types of areas, not only icons, but for example
menues, sliders or other GUI functions. This was something that was discussed in
the planning phase, but left out due to shortage of time. It would also be interesting
to investigate the possibility to include character recognition and natural language
detection.

Another feature that was not studied is the possibility to evaluate how the GUI is
responding to input. That is, a system to compare how the GUI is responding com-
pared to a known response. This would be interesting when performing regression
tests on a GUI that is under development.

Naturally, there is always room for improvement when it comes to the neural net-
work. As mentioned in the previous section, our network is not to be considered very
general. It performs well in our testing environment but to be able to work in real
world testing scenarios, the training set would have to be expanded significantly.

Different hashing algorithms could be applied in unison with each other depending
on what is needed. For example the dHash algorithm could be used initially to find
very similar images, and then another more accurate, but computational heavier,
algorithm could be used, such as the DCT based hashing algorithm pHash. In our
case this was not needed, but in a large scale set up where the input images aren’t
as pristine, and the volume is larger, computational time might be of interest.

As of now, the only measurement of the GUI is depth. For the accuracy and efficiency
of the testing system to be measured completely, something more to compare the
results against should be considered. The accuracy could be verified by comparing

44


6. Discussion and Concluding Remarks

the result of interacting with an object to a screenshot that shows the correct result.
Efficiency could be the time it takes to execute the testing procedure.

Another kind of quality measurement could instead be performed on the GUI itself.
The score could be calculated on parameters like the euclidean distance travelled on
the screen to arrive at a certain menu, time spent navigating the menus, the number
of button presses and so on.

To take the GUI evaluation further, the concept of usability could be introduced. It
is a highly subjective concept and therefore difficult to measure. There are however
some design standards for interactive media that could be checked. For example, the
color green often indicates an affirmative behaviour, and red is the stopping color. It
is also often confusing for the user if the OK -button is to the right of CANCEL when
presented with a standard dialog box. Other things that could affect the general
usability is button size, the presence of descriptive text or the number of objects
appearing on the screen simultaneously. These examples are all making the screen
appear as cluttered and distracting, which is not desired when navigating a GUI at
the same time as doing something as inherently dangerous as driving at the same
time.

The solution could be made truly general and usable in a real world testing scenario
if it is possible to apply to any touch screen GUI. As of now, it is only possible to
test Android tablets and phones. Consider a physical rig that is mounted above a
screen. This rig could use a camera to grab images of the GUI, and a pointing device
acting as a mechanized finger to interact with the screen. This would yield a truly
general testing solution. It would also introduce a new set of interesting problems to
solve, both in the field of mechanical construction as well as handling image quality
issues and distortions such as screen glaring.

45


6. Discussion and Concluding Remarks

46


A
Appendix 1

A.1 Figures

I


A. Appendix 1

F
igure

A
.1:

T
he

proposed
algorithm

II


A. Appendix 1

F
ig
ur
e
A
.2
:
St
at
e
di
ag
ra
m

an
d
tr
an

sit
io
ns

ov
er

th
e
us
ed

pr
og
ra
m

III


A. Appendix 1

F
igure

A
.3:

R
esulting

state
m
achine

ofa
early

term
inated

run
on

a
com

plex
G
U
Istructure.

IV


A. Appendix 1

A.2 Hashing comparisons

(a) 1 (b) 2 (c) 3 (d) 4

(e) Gray (f) Red (g) Blue (h) Green

Figure A.4: Images used to test perceptual hashing algorithms.

V


A. Appendix 1

(a) 1 (b) 2 (c) 3 (d) 4

(e) Gray (f) Red (g) Blue (h) Green

Figure A.5: Images in grayscale that were used to test perceptual hashing algo-
rithms.

VI


B
Appendix 2

B.1 Python Packages

• fileinput[29]: This module implements a helper class and functions to quickly
write a loop over standard input or a list of files.

• glob[30]: Unix style pathname pattern expansion

• itertools[31]: This module implements a number of iterator building blocks
providing tools for fast and efficient "iterator algebra".

• json[32]: JSON encoder and decoder.

• math[33]: Provides access to the mathematical functions defined by the C
standard.

• os[34]: Miscellaneous operating system interfaces

• PIL[35]: Imaging Library

• re[36]: This module provides regular expression matching operations

• subprocess[37]: Allows spawning of new processes, connections to their in-
put/output/error pipes, and their return codes.

• sys[38]: System-specific parameters and function

• time[39]: Time access and conversions

• tkinter[40]: Provides a robust and platform independent windowing toolkit

• transitions[28]: A lightweight, object-oriented state machine implementation

• warnings[41]: Warning control

VII


B. Appendix 2

B.2 Android Debug Bridge: Code samples

Listing B.1: List all connected devices
1 adb devices

Listing B.2: Take a screencapture and move it to "PATH"
1 adb -s deviceID exec -out screencap -p > "PATH"

Listing B.3: Send the keyevent for the HOME button
1 adb -s deviceID shell input keyevent KEYCODE_HOME

Listing B.4: Send the keyevent for the BACK button
1 adb -s deviceID shell input keyevent KEYCODE_BACK

Listing B.5: Send the keyevent for the SWITCH button
1 adb -s deviceID shell input keyevent KEYCODE_APP_SWITCH

Listing B.6: Perform a "tap" at location X Y
1 adb -s deviceID shell input tap X Y

Listing B.7: Perform a "swipe" from X Y to X Y during T
1 adb -s deviceID shell input touchscreen swipe X Y X Y T

VIII


Bibliography

[1] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,
“Dropout: A simple way to prevent neural networks from overfitting,” Journal
of Machine Learning Research, vol. 15, pp. 227–236, 2014.

[2] L. A. dos Santos, “Googlenet.” leonardoaraujosantos.gitbooks.io/artificial-
inteligence/, 2014. [Online; accessed 05-May-2017].

[3] V. (OpenSource), “Linux desktop testing project,” 2016. [Online].

[4] X. Fang, B. Sheng, P. Li, D. Wu, and E. Wu, “Automatic gui test by using sift
matching,” China Communications, vol. 13, no. 9, pp. 227–236, 2016.

[5] Exforsys, “What is monkey testing,” 2017. [Online; accessed 2017-01-24].

[6] T. Daboczi, I. Kollar, and G. Simon, “How to test graphical user interfaces,”
IEEE Instrumentation & Measurement Magazine, 2003.

[7] Wikipedia, “Scale-invariant feature transform — wikipedia, the free encyclope-
dia,” 2017. [Online; accessed 25-April-2017].

[8] Wikipedia, “Image histogram — wikipedia, the free encyclopedia,” 2017. [On-
line; accessed 25-April-2017].

[9] M. Özuysal, M. Calonder, V. Lepetit, and P. Fua, “Fast keypoint recognition
using random ferns,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, 2009.

[10] Wikipedia, “Perceptual hashing — wikipedia, the free encyclopedia,” 2017.
[Online; accessed 25-April-2017].

[11] N. Krawetz, “Kind of like that.” The Hacker Factor Blog, 2013. [Online; accessed
04-April-2017].

[12] B. Yang, F. Gu, and X. Niu, “Block mean value based image perceptual hash-
ing,” 2006 International Conference on Intelligent Information Hiding and Mul-
timedia, pp. 1–4, 2006.

IX


Bibliography

[13] N. Krawetz, “Looks like it.” The Hacker Factor Blog, 2011. [Online; accessed
04-April-2017].

[14] E. Duda, “Image hash comparison — github,” 2015. [Online; accessed 20-April-
2017].

[15] E. Klinger and D. Starkweather, “phash - the open source perceptual hash
library,” 2010. [Online; accessed 2017-02-22.

[16] Wikipedia, “Discrete cosine transform — wikipedia, the free encyclopedia,”
2017. [Online; accessed 19-April-2017].

[17] C. Zauner, “Implementation and benchmarking of perceptual image hash func-
tions,” Master’s thesis, FH Upper Austria, 2010. [Published on phash.org].

[18] Wikipedia, “Hamming distance — wikipedia, the free encyclopedia,” 2017. [On-
line; accessed 25-April-2017].

[19] Z. Blanco, “Neural networks and the backpropagation algorithm,” 2015. [On-
line; accessed 2017-05-18.

[20] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional net-
works,” European Conference on Computer Vision, vol. 1, no. LNCS 8689,
pp. 818–833, 2014.

[21] A. Karpathy and J. Johnson, “Understanding cnn,” 2017. [Online; accessed
2017-01-24.

[22] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann
machines,” in Proceedings of the 27th International Conference on Machine
Learning (ICML-10) (J. Fürnkranz and T. Joachims, eds.), pp. 807–814, Om-
nipress, 2010.

[23] Wikipedia, “Vanishing gradient problem — wikipedia, the free encyclopedia,”
2017. [Online; accessed 22-May-2017].

[24] Wikipedia, “Overfitting — wikipedia, the free encyclopedia,” 2017. [Online;
accessed 17-May-2017].

[25] ImageNet, “Large scale visual recognition challenge,” 2014.

[26] Taoshan, “Cyanogenmod 12,” 2015. [Online; Discontinued].

[27] Python, “Python 3.5.2,” 2016. [Program Language].

[28] T. Yarkoni, “transitions,” 2017. [Library package].

[29] Python, “fileinput,” 2017. [Library package].

X


Bibliography

[30] Python, “glob,” 2017. [Library package].

[31] Python, “itertools,” 2017. [Library package].

[32] Python, “json,” 2017. [Library package].

[33] Python, “math,” 2017. [Library package].

[34] Python, “os,” 2017. [Library package].

[35] Python, “Pil,” 2017. [Library package].

[36] Python, “re,” 2017. [Library package].

[37] Python, “subprocess,” 2017. [Library package].

[38] Python, “sys,” 2017. [Library package].

[39] Python, “time,” 2017. [Library package].

[40] Python, “tkinter,” 2017. [Library package].

[41] Python, “warnings,” 2017. [Library package].

XI


	Abbreviations
	Nomenclature
	List of Figures
	List of Tables
	Introduction
	Related Work
	Purpose
	Contributions

	Preliminaries
	Problem Description
	General Objectives
	Limitations

	Theory
	Perceptual Hashing
	Average Based Hash (aHash)
	Block Based Hash (bHash)
	Gradient Based Hash (dHash)
	DCT Based Hash (pHash)
	Hamming Distance

	Edge Detection
	Gaussian Filtering
	Sobel Filtering

	Neural Networks
	The Makings of a Neural Network
	Convolutional Layer
	Pooling Layer
	Concatenation Layer
	Dropout Layer
	Fully Connected Layer
	Softmax Layer
	Activation Function

	Training a Neural Network
	Training Set
	Validation Set
	Testing Set
	Overfitting

	GoogLeNet
	The Inception Module


	Method
	Tools
	MATLAB
	Android Device
	Android Studio
	Android Debug Bridge

	Python
	TensorFlow
	TensorBoard
	TensorBox


	Data Acquisition
	Identification
	Edge Detection Method
	Identification Using TensorBox
	Preparing Training Data


	Classification
	System Modelling
	GUI Control
	Proposed Method
	The Algorithm


	Results
	Identification
	Classification
	Test Runs

	Discussion and Concluding Remarks
	Further Development

	Appendix 1
	Figures
	Hashing comparisons

	Appendix 2
	Python Packages
	Android Debug Bridge: Code samples

	Bibliography