Evaluation of Background Subtraction in
Pan-Tilt Camera Tracking
Master’s thesis in Complex Adaptive Systems

MARKUS HÄGERSTRAND
HJALMAR KARLSSON

Department of Signals & Systems
Chalmers University of Technology

Gothenburg, Sweden 2016
Master’s Thesis EX040/2016


Master’s Thesis EX040/2016

Evaluation of Background Subtraction in
Pan-Tilt Camera Tracking

Markus Hägerstrand
Hjalmar Karlsson

Department of Signals & Systems
Chalmers University of Technology

Gothenburg, Sweden 2016


Evaluation of Background Subtraction in Pan-Tilt Camera
Tracking
Markus Hägerstrand
Hjalmar Karlsson

c© The authors, 2016.

Supervisor: Harald Freij, SAAB AB

Examiner: Fredrik Kahl, Department of Signals & Systems, Chalmers

Master’s Thesis EX040/2016

Department of Signals & Systems
Chalmers University of Technology
SE-412 96 Gothenburg
Sweden
Telephone: +46(0)31-772 1000

Gothenburg, Sweden 2016


Evaluation of Background Subtraction in Pan-Tilt Camera Tracking

Markus Hägerstrand

Hjalmar Karlsson

Department of Signals & Systems
Chalmers University of Technology

Abstract

Object tracking is the subfield of computer vision where an object is to be located in
each frame of a video sequence. Automated tracking is useful in all areas where vision
and cameras are used. Computers can assist in time-consuming tasks in television or
surveillance as well as stabilise and increase tracking precision compared to manual oper-
ation. In a system using a movable camera such as a pan-tilt-zoom camera mounted on a
robot, information about the pan-tilt-zoom configuration can be used to locate a moving
object in successive frames since static background can be accounted for from one frame
to the next. Two state-of-the-art trackers, called Adaptive Scale Mean Shift (ASMS)
and Kernel Correlation Filter (KCF), as well as a tracker based on Shi-Tomasi corner
detection together with optical flow (OPTFLOW), are in this thesis evaluated when us-
ing a pre-processing stage of online background subtraction based on the historical pixel
value distribution. All trackers with and without background subtraction were evaluated
for robustness on multiple scenarios containing either a circular unicoloured object or
a multicoloured polygon in front of two different backgrounds respectively. The track-
ing performance was shown to not benefit from this particular background subtraction
since the amount of wrongly classified background pixels ruined more than the correctly
classified pixels helped. The implemented background subtraction model affected OPT-
FLOW the most since the background subtraction removed important corner features,
while ASMS and KCF were robust and unaffected by the background subtraction. The
background subtraction routine for a static camera view was successfully adapted to
function for a translating camera, and may be of more use for some trackers not evalu-
ated.

Keywords: Tracking, Background subtraction, PTZ


Acknowledgements

This thesis work has been conducted at SAAB AB located in Gothenburg. We would like
to express our great appreciation to the supervisor Harald Freij for guiding us throughout
the project. Also, we are grateful for the valuable input on object tracking given by
Martin Lillieborg and Henrik Söderström. Finally, we wish to thank our friends and
family for their support.

Markus Hägerstrand & Hjalmar Karlsson, Gothenburg, June 13, 2016


Contents

1 Introduction 1
1.1 Review of motion tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Background subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Purpose and scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Outline of the report . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 7
2.1 Background subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 Optical flow tracker . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 Mean shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1 ASMS: Robust mean shift with scale adaption . . . . . . . . . . . 15
2.4 Correlation filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1 KCF: Kernelized correlation filter . . . . . . . . . . . . . . . . . . . 19
2.5 Measures of tracking performance . . . . . . . . . . . . . . . . . . . . . . . 20

3 Method 23
3.1 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.1 Background subtraction . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Camera parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Test procedures for evaluating the trackers with background sub-

traction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Results 29
4.1 Camera configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Background subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.1 Stationary image . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Rotating camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

i


CONTENTS

4.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 VOT Challenge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3.3 Tracking with background subtraction . . . . . . . . . . . . . . . . 35

5 Discussion 39
5.1 Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 Background subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.3 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.3.1 OPTFLOW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.2 KCF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 ASMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.4.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 46

ii


GLOSSARY

Glossary

ASMS Adaptive Scale Mean Shift (tracker)

BGR Blue-Green-Red, colour space

BRW Background Ratio Weighting

DFT Discrete Fourier Transform

Egomotion Motion of the camera

HOG Histogram of Oriented Gradients

KCF Kernelized Correlation Filter (tracker)

KDE Kernel Density Estimation

MS Mean Shift

OpenCV Computer Vision library

OPTFLOW The optical flow tracker

PTZ Pan-tilt-zoom

VOT Visual Object Tracking

iii


1
Introduction

Visual surveillance systems are in use in many places, the applications range from
passive monitoring to active use as an aid for operators, e.g. increase their field of

vision. Automatic tracking of objects of interest decreases the need of human supervision
of the camera.

The task of tracking can generally be described as two steps, detection of objects
and updating the position of the objects in consecutive frames in a video sequence. The
focus of this thesis is on pan-tilt-zoom (PTZ) cameras, movable cameras restricted to
pan, tilt and zoom motions. This type of camera is suitable for keeping a single object
in focus, and thus only tracking of a single object at a time is studied. As only a single
object is to be tracked, the initialisation of the object of interest is done by the operator.
It is possible to construct a system that performs object detection and presents a list
of objects for an operator to choose from. Since that system also would require human
intervention, the system studied in this thesis is designed so that the initial detection is
done manually by selecting a region that contains only a single object.

Even if the initial detection is performed manually the next step of the tracking
(updating the position of the object) can benefit from some form of object detection,
segmenting foreground and background objects. By filtering out the background the
complexity of the image is reduced which may help with the tracking. For stationary
cameras the task of segmenting background and foreground for detecting objects of
interest is well-studied. For freely moving cameras or cameras with restricted motion
this is not the case.

Pre-processing videos by segmenting background or foreground has been done be-
fore [1, 2, 3]. In particular [2] presents a framework for performing background subtrac-
tion with data from a PTZ camera by using recent frames and known per frame offset
to build a model for each pixel. A common approach to get knowledge of the camera
motion is through keypoint tracking, which is what is done in [2].

When the control of the PTZ is integrated with the tracker the computation of the

1


1.1. REVIEW OF MOTION TRACKING CHAPTER 1. INTRODUCTION

camera egomotion can be simplified, as each frame comes with logged information about
the pan and tilt angles. The rotation of the camera causes a different image to get
projected onto the image plane, when the rotation is known this can be compensated
for [4, 5]. Performing this transformation is computationally cheaper than matching
keypoints.

To keep track of the detected objects many different algorithms are available [6], and
which to choose depends on a number of factors. Important factors for applications in
surveillance are speed and robustness. That means the tracker should be able to work
in real time, i.e. at least 30 frames per second, and should not lose track of the object
too often.

The majority of research in computer vision with focus on tracking tries to solve the
very general problem: given a video stream from an unrestricted camera, keep track of
the motion of some object(s). By restricting the motion of the camera to changes in pan,
tilt and zoom some simplifications are possible. Performing background subtraction is
intractable in the general case unless very detailed information about the movement is
known, and usually no such data exist and methods such as matching keypoints has to
be employed. Those methods come with their own problems, is the movement of the
keypoint from camera egomotion or did the object move?

Performing background subtraction for tracking has been done before [5, 7], what
is new in this work is evaluating the impact of using background subtraction before
tracking.

To compare the performance between trackers there are frameworks where trackers
are evaluated on a diverse dataset to obtain measures of accuracy and robustness as well
as a ranking relative to other trackers. One such framework is the VOT Challenge [8].

1.1 Review of motion tracking

The first step of the tracking, object detection, can be done in many ways. How to do
it depends on data available and whether the object is in motion or not. For objects
at rest some prior knowledge regarding the type of objects must be known. This can
be a single sample image of the object to track. Detecting moving objects in an image
sequence does not need prior knowledge but needs multiple consecutive images. Two
common methods for detecting moving objects are [9]:

Background subtraction: If the view of the camera is known it is possible to perform
background subtraction. The principle is that if a reference background image is
known, that image can be compared with the frame in which objects are to be
detected. The regions that are different contain moving objects.

Optical flow: By calculating the flow field of pixels in successive frames it is possible
to detect objects. Clusters of pixels moving together are likely to be part of the
same object.

When the location of the object to be tracked is known some features must be extracted
and recorded to make it possible to find the same object in new frames. Good segmen-

2


1.2. BACKGROUND SUBTRACTION CHAPTER 1. INTRODUCTION

tation from the background ensures that only features that actually belong to the object
of interest are recorded. The problem is thus, given an area containing an object, to
determine which pixels belong to the object and which belong to the background. In
some cases a pixel-wise segmentation is not needed, but if too much background gets
incorporated in the object model the noise will make it very hard to keep track of the
target.

Objects can be represented in multiple ways, as a centroid point, multiple points,
primitive geometric shapes or object contours and silhouettes. These can be combined
to get a good representation of the object that is to be tracked. Good features to track
are things that continue looking the same even if the scale changes or the object rotates
out of plane. Examples of that kind of features are corners and edges. Another possible
representation of the object is the colour histogram of the object area [10].

When features have been extracted they need to be tracked. How this is done depends
to some extent on the features used. Two broad categories of tracking algorithms are [6]:

Point tracking: A detector has selected some points that are located on an object,
these points are then tracked from frame to frame based on assumptions about
how much a given point can move and change between frames.

Kernel tracking: Kernel refers to the object shape and appearance. Usually this is
some bounding box for the object and a histogram. Objects are tracked by com-
puting the motion of the kernel.

Point tracking can be done by calculating the optical flow between images and keeping
track of the locations of interesting features, or Kalman filters can be used to predict the
motion of the feature points [11, 12]. Point based tracking performs recognition (and
keeps track of whether tracking is lost or not) by clustering the extracted points into
higher level features that can be matched between frames [9].

1.2 Background subtraction

In the most basic sense background subtraction is nothing more than what the name
implies, the absolute difference between a reference image (the background) and an image
of interest. At image positions where the difference is greater than some threshold the
position is classified as not belonging to the background, i.e. classified as a foreground
pixel. Most modern algorithms for performing background subtraction are more complex
than this and can be divided into a couple of categories. The main difference between
most methods is how the background model is represented. From simple to more complex
ones:

Running Gaussian Average: For each pixel the background is modelled indepen-
dently as a Gaussian probability density function. The Gaussian distribution is
fitted to the n latest pixel values and a pixel is classified by calculating the prob-
ability that the latest pixel value describes the same object as the earlier pixel
values did.

3


1.3. PURPOSE AND SCOPE CHAPTER 1. INTRODUCTION

Mixture of Gaussians: Sometimes the part of an image that should be classified as
background is not entirely static, some parts might move a little (due to wind,
vibrations of the camera etc.) and should still be classified as background. To cope
with that kind of background a single-valued background model is insufficient. The
idea is to have different Gaussian models for different possible background objects,
if a pixel value is unlikely to come from any of the different distributions then it is
classified as foreground.

Kernel density estimation (KDE): In this method a function is constructed that
gives the probability that a given pixel belongs to the distribution of background
pixels. For the Gaussian running average the previous known pixel values were
fitted to a Gaussian to model the distribution, in the kernel density estimator the
distribution is instead constructed from a sum of kernels.

For a more detailed review of different methods and how they differ see: [13, 14]. A
variant of kernel density estimation for background subtraction is implemented in [15]
and is explained in more detail in section 2.1. An example of an implementation of a
Gaussian Mixture Model for background subtraction with a PTZ camera is described
in [2].

Another approach to background subtraction is to calculate a partial optical flow for
the video sequence and classifying different point trajectories as either belonging to the
foreground or the background based on constraints on the movements [1].

1.3 Purpose and scope

The goal of this thesis is to study the impact of background subtraction on tracking in
the scenario that the camera is restricted to pan-tilt motions. Restriction of the motion
reduces the complexity of matching pixel locations between subsequent frames, prior
knowledge of the changes in pan and tilt angle further reduces the complexity as the
camera egomotion does not have to be computed. Tracking is only done at fixed zoom
levels with no change in zoom level during tracking. Allowing zoom would require some
means to rescale the bounding box for the trackers and updating the model, while this
can be done, it is outside the scope of this thesis.

To achieve the goal a background subtractor must be implemented, as well as finding
trackers that are able to track both with and without the background subtraction. The
main measures which will be used for the evaluation will be computational performance
(frames processed per second) and robustness. Three different trackers that meet the
requirements of speed and robustness in tracking without background subtraction are
selected and evaluated with background subtraction.

The trackers are expected to be able to process new frames at minimum 30 fps given
that the resolution of the video stream is not higher than 720× 480 pixels. The system
will only be evaluated at fixed zoom levels, no zooming will be done during a sequence
and the tracked objects are assumed to be rigid. In addition it is assumed that motion

4


1.4. OUTLINE OF THE REPORT CHAPTER 1. INTRODUCTION

between successive frames (change in pan and tilt of the camera) is small enough that
it can be approximated as a pure translation of the scene.

The three trackers chosen are “Adaptive Scale Mean Shift” (ASMS) [16], “Kernelized
Correlation Filter” (KCF) [17], both modified from publicly available code1,2, and a
tracker we call OPTFLOW. These three track on different features, ASMS represents
the target with a colour histogram, KCF calculates a histogram of gradients that is used
to create a template and OPTFLOW tracks feature points (corners).

1.4 Outline of the report

There are five chapters in this report: Introduction, Theory, Method, Results, and
Discussion. This first chapter introduced the concepts of tracking and background sub-
traction as well as the motivation and goal of the thesis. In the theory chapter the
algorithms and some technical details from the background subtraction and the three
trackers are presented, information later used in the discussion chapter. The method
chapter consists of a system overview and what was carried out in the project together
with the organisation of the test procedures. The results chapter presents the results
of the tests described in the method chapter. The last chapter is the discussion about
relevant aspects regarding tracking and background subtraction, based on the theory
and results chapter.

1https://github.com/vojirt/asms
2https://github.com/joaofaro/KCFcpp

5

https://github.com/vojirt/asms
https://github.com/joaofaro/KCFcpp


2
Theory

This chapter provides the theoretical background for the different tracking algorithms
and the background subtraction. It also contains some notes on how to measure

tracking performance.

2.1 Background subtraction

In the introduction there is an overview introducing different ways to perform background
subtraction. The following section explains the algorithm implemented and used in this
thesis. It is based on [15] with some modifications to make it suitable for use with a non-
stationary camera. First an example demonstrating the simplest version of background
subtraction.

A simple and intuitive way to perform background subtraction is to calculate the ab-
solute difference between two consecutive frames. The images Fn and Fn+1 are converted
to greyscale (Fn(x) is the colour intensity of the image point at x) and the subtraction
produces a binary mask K where

K(x) =

1 if |Fn(x)− Fn+1(x)| > τ

0 else.
(2.1)

The value 1 represents that the pixel belongs to the foreground and 0 that it is part of
the background, τ is the threshold for how much difference in colour intensity that is
allowed. An example of the result of doing this kind of background subtraction can be
seen in figure 2.1.

A more sophisticated algorithm for background subtraction works by constructing
a more complex model of the background. The background model is constructed from
the N most recent frames, assumed known is the transformation from camera egomo-
tion between consecutive frames. For an illustration of how the background model is

7


2.1. BACKGROUND SUBTRACTION CHAPTER 2. THEORY

(a) Frame 1 (b) Frame 2

(c) Absolute difference (d) Thresholded

Figure 2.1: Example of the result of taking the absolute difference between two frames. In
2.1c the raw result is displayed and in 2.1d the result after setting all pixels with luminosity
of over 20 (in any channel) to white and the rest black. Pixels are represented as colour
intensities in the BGR channels with values from 0 to 255. Top figures from “car” sequence
in the VOT2013 dataset [18].

constructed see figure 2.2. In the case with a PTZ camera this is calculated from the
known absolute pan and tilt position for each frame.

A probability estimate for classifying pixels as background or foreground is calculated
as:

Pr(xt) =
1

n

n∑
i=1

d∏
j=1

1√
2πσ2

j

exp

[
−1

2

(xtj − xij )2

σ2
j

]
. (2.2)

For each pixel xt (xt = Ft(x)) in the latest frame, n previous values are known (n ≤ N
due to movement of the camera, for a stationary camera n = N). σ2

j is the kernel
bandwidth for a given colour channel j, for a greyscale image d = 1 and for a colour
image usually d = 3, e.g. blue-green-red (BGR). Equation (2.2) is equation (5) in [15].

The parameters that need to be set to segment the frame is thus a threshold τ such
that if Pr(xt) > τ then xt is a background pixel, and the kernel bandwidths σ2

j . The
value of the kernel bandwidths limits the size of the pixel fluctuations that are filtered
out. The value should be set to allow small variations due to noise in the image but

8


2.1. BACKGROUND SUBTRACTION CHAPTER 2. THEORY

Background model
Frame: tFrame: t-1Frame: t-2

Figure 2.2: Background model constructed from three consecutive frames. Pixels in the
black area have n = 3, pixels in the grey area have n = 2 and pixels in the light grey area
n = 1. The other pixels are outside of the current frame and discarded. The blue rectangle
is a stationary background object.

not large changes in pixel intensity value (from an object with different colour moving
in front of something else). If a too high value is used only objects with a high contrast
compared to the background will be detected, setting it too low results in many false
foreground pixels. From [15] a good estimate is:

σ2
j =

m2
j

0.682 · 2
(2.3)

were mj is the median of |xi−xi+1| for colour channel j with x going over all pixels and
i ∈ [0,n).

A major challenge is how to update the background model to get both good detection
of moving objects and avoid incorrect detection of stationary objects (due to errors in
the background model). Given a new pixel sample one can do a blind update and add
it to the background model regardless of whether it has been classified as background
or not. Otherwise one can choose to only add those pixels that have been classified as
background.

By blindly updating the background model, pixels that are part of the foreground
can accidentally get added to the model. This usually results in pixels in the centre of
the moving object getting classified as background.

If updates to the background model are done by only adding pixels that have been
classified as background on the other hand. One can end up in a deadlock where pixels
that have once been classified incorrectly as background makes the real background get
classified as foreground. An example of the difference can be seen in figure 2.3 from the
sequence “car” from the VOT Challenge 2013 dataset1 [18].

In order to get the best of both update models two separate background histories
are recorded, one that is updated selectively, only adding values that are believed to be

1http://votchallenge.net

9

http://votchallenge.net


2.1. BACKGROUND SUBTRACTION CHAPTER 2. THEORY

(a) Current frame

(b) Selective update result (c) Blind update result

Figure 2.3: When selective updating is performed the moving object is segmented with
very good precision but the position where the object start to move from in the first frame
get initially classified as background, thus when the object moves away the real background
gets classified as foreground. In the model performing blind update the segmentation is not
as good but incorrect classifications does not persist. Top figure from “car” sequence in the
VOT2013 dataset [18].

background and one that is updated blindly. Then, for pixels to be classified as fore-
ground both models must classify them as foreground. Two separate histories introduce
more parameters, determining how many frames (i.e. values for each pixel) should be
kept in each model, and if all frames are included in each model or only every W :th
frame [15].

To increase performance, in terms of suppression of false detections and decreasing
processing time, the algorithm is modified to run on image pyramids. The pyramid is
constructed by scaling down the original frame M times, then level 0 is the full scale
image [W × H] and level l has been downscaled to 1

2l
[W × H], see figure 2.4 for an

illustration. Starting at level l = M − 1 the background subtraction is performed with
a threshold τl. The result is then scaled up and used as a mask on level l ← l − 1.
Only pixels classified as foreground on the coarser level is considered to be foreground
candidates on the finer level.

This reduces the computational load as many pixels do not need to be evaluated on
the full resolution image. Another benefit is suppression of false detections due to flicker

10


2.2. OPTICAL FLOW CHAPTER 2. THEORY

Figure 2.4: Illustration of image pyramids. Source: https://en.wikipedia.org/wiki/

File:Image_pyramid.svg.

of single pixels as those changes get blurred and thresholded away in the coarser images.
A complete list of the parameters for the background subtraction:

• σ2
j , can either be set to a fixed value or estimated by the algorithm. It is the

bandwidth of the kernel and can be seen as a value for how much the density
estimate is smoothed.

• Nlt, number of frames in the background model that is blindly updated.

• Nst, number of frames in the selectively updated model.

• W , only every W :th frame is added to the blind update model.

• npyr, number of image pyramids.

• threshold, how high the background pixel probability must be for a pixel to be
classified as background.

How these parameters are chosen depends on the movement of the tracked object.
W and the size of the blind update history Nlt should be set so that incorrect updates
in the selectively updated model are suppressed.

2.2 Optical flow

Optical flow describes the motion field for objects in an image. When it has been
calculated it is possible to use it for image segmentation and tracking. The optical flow

11

https://en.wikipedia.org/wiki/File:Image_pyramid.svg
https://en.wikipedia.org/wiki/File:Image_pyramid.svg


2.2. OPTICAL FLOW CHAPTER 2. THEORY

problem can be formulated as follows: given an image point u = [ux,uy]T in frame Fn,
find v = [ux+δx,uy+δy]T in Fn+1 such that Fn(u) and Fn+1(v) are similar, see figure 2.5
for an example of the optical flow of a sparse set of points for a square undergoing an
rigid transformation [11].

Frame 1 Frame 2 Optical flow 1-2

Figure 2.5: Changes in the grid pattern from frame 1 to 2 can be represented by a vector
field, that vector field is the optical flow. The black square from frame 1 undergoes a
translation and rotation between frames 1 and 2. Only the optical flow for the corner points
(marked with grey circles) is shown.

One way to formulate the problem (as done by Lucas and Kanade) is to consider a
small area around the image point (determined by ωx and ωy), and define the following
function:

ε(d,A) =

ωx∑
x=−ωx

ωy∑
y=−ωy

(Fn(x + u)− Fn+1(Ax + d + u))2 . (2.4)

In this A is an affine transformation matrix that takes into account that the motion
between the two images might not be a pure translation and d = [δx,δy]T . The goal is
then to find A and d such that ε(d,A) is minimised [11, 19]. The optical flow (velocity)
for the point u is thus d. This works under the assumption that the optical flow around
a given pixel is essentially constant.

This is a differential method for solving the optical flow and a pyramidal variant
of the Lucas-Kanade method is implemented in OpenCV and is described in detail by
Bouguet [11].

2.2.1 Optical flow tracker

To construct a tracker around the Lucas-Kanade method for optical flow a couple of
additional components are necessary. Suitable feature points to track by calculation
of the optical flow need to be located. To get good tracking one should also try to
ensure that the points actually belong to the object which is to be tracked and not the
background. That a point is easy to track does not mean that it is one that is of interest.

12


2.2. OPTICAL FLOW CHAPTER 2. THEORY

A good candidate method for detection of suitable features is the Shi-Tomasi corner
detector as it is designed to select features based on how suitable they are for track-
ing [20]. Good features to track are in general regions which contain much motion
information. Straight edges and unidirectional patterns can only be used to determine
motions in one direction, corners and salt-and-pepper textures are more suitable as they
are able to provide more information about the motion [21]. See figure 2.6 for examples.

(a) Salt-and-pepper. (b) No texture.

(c) Corners. (d) Edges.

Figure 2.6: Examples of different textures that can be on objects. Salt-and-pepper and
corners are good for tracking. The other textures are hard to find the correct optical flow
for and are thus bad for tracking.

The basic idea of corner detection is that given a small image patch, shifting the
region slightly in any direction should give a large change in appearance. For uniform
region there is almost no change regardless of which direction the patch is shifted, for
straight edges there is only a large change if the shift is orthogonal to the direction of
the edge. A mathematical approach to finding corners and edges is the Harris corner
detector (which the Shi-Tomasi corner detector is a variation of) [22].

Making sure that features actually belong to the tracked object is harder. One
option is to perform image segmentation to get rid of the background, this can be
done through background subtraction or by simply finding contours and guessing which

13


2.3. MEAN SHIFT CHAPTER 2. THEORY

contour belongs to the tracked object.
A simple tracker based on these ideas can then be formulated as follows:

1. Find the largest contour in a selected region of interest, this can be done by running
Canny edge detection followed by a border following algorithm.

2. Locate good features to track inside the contour using the Shi-Tomasi algorithm
for corner detection.

3. Calculate the optical flow for the corners to locate them in the next frame, if many
points are lost try to redetect by going to 1, using last known tracking rectangle
as region of interest.

4. Centre the tracking rectangle on the centre point of all tracked features.

5. Scale the tracked rectangle based on how much the points have moved in relation
to the mean position of all points since last frame.

6. Goto 3.

This is the basic idea of the algorithm, in its simplicity it works quite well and is very
fast.

2.3 Mean shift

Another approach to tracking is by a technique commonly referred to as mean shift
(MS). It is a mode seeking algorithm that works by mapping all pixels in an image
into some feature space and then locate clusters. A cluster will then correspond to an
image segment. The method is iterative and works by stepwise increasing similarity with
a reference by shifting a window [23]. The mapping can be thought of as a greyscale
image where high pixel values (darker) correspond to a higher likelihood that the pixel at
that location belongs to the reference. The objective is then to centre the largest cluster,
maximise the number of dark pixels in the window, this is illustrated in figure 2.7.

The mapping of pixels to a feature space can be done by histogram backprojection.
The histogram for an initial area containing the object that is to be tracked is computed.
Pixels are then assigned a probability of belonging to the object based on how many
pixels of that colour exist in the reference [24]. If the histogram for the tracked object
shows that the object for example has 66 % blue pixels, 33 % red pixels and 0 % green
pixels. All blue and red pixels will be given a high probability of belonging to the tracked
object, for an example of this mapping see figure 2.8. By mean shift the region with
high probability pixels gets centred.

The equation describing the mean shift is

m(x) =

∑
s∈ΩK(s− x)s∑
s∈ΩK(s− x)

(2.5)

14


2.3. MEAN SHIFT CHAPTER 2. THEORY

a) b) c)

Figure 2.7: Example of how mean shift works. Consider a) as the first frame and the black
square in the centre is a cluster, b) is a frame where the cluster has moved. The largest
cluster is at the far right of the tracking box and it is most likely that the object sought is
there and thus the mean shift algorithm will stepwise move the box right until the cluster
is centred again.

where K(·) is a given kernel function and Ω a neighbourhood to x. In the example in
figure 2.7, x would be the centre of the rectangle, Ω all the pixels in the rectangle and
the movement of the rectangle to the right be the update m(x) ← x. This is iterated
until m(x) converges.

2.3.1 ASMS: Robust mean shift with scale adaption

Adaptive scale mean shift (ASMS) is an improved mean shift tracker described in detail
in [16]. It is one of the algorithms which this thesis evaluates for PTZ tracking and
checks if it benefits from pre-processing in the form of background subtraction.

In mean shift tracking the kernel is moved from a given location y0 to a new location

y1 = m(y0) =

∑n
i=1wig

(∥∥ si−y0

h

∥∥2
)
si∑n

i=1wig
(∥∥ si−y0

h

∥∥2
) (2.6)

Using the same notation as in equation (2.5) we have s ∈ Ω and n is the number of
elements in Ω and where x was a coordinate in some space, here y is a point in two-
dimensional Euclidean space. And g(x) = − d

dxk(x) where k(x) is some isotropic, convex
and monotonically decreasing kernel profile.

The weights

wi =

√
q̂u

p̂u(y0)

with u corresponding to the feature pixel si maps to represents the likelihood that si
belongs to the target. q̂u is the probability of feature u belonging to the target and p̂u(y)
is the probability of feature u in a target candidate centred at y [16]. This is improved

15


2.3. MEAN SHIFT CHAPTER 2. THEORY

(a) Frame with selection rectangle.

Red Green Blue

33 %

0 %

66 %

Colour

P
ro
b
ab

il
it
y

(b) Colour histogram.

(c) Probability map.

Figure 2.8: Example of using histogram backprojection to convert a colour image to a
probability map. In 2.8a an area is selected and a histogram describing the colour distribu-
tion is created. In 2.8c the probability mapping based on the histogram is shown. Darker
pixels correspond to higher probability pixels.

in [16] by changing the calculation of wi by taking an estimated background histogram
into account through background ratio weighting (BRW). A guess for the background
histogram is calculated from the neighbourhood of the tracked object.

The algorithm described in [16] is scale adaptive. A parameter describing the scale of
the target rectangle is introduced and in much the same ways as for finding the mean shift
a correct scaling is iteratively calculated. To prevent incorrect changes the calculated
change from step t− 1 to t is validated by calculating the change from t to t− 1. If they
do not give the same result the change is rejected and a weighted combination of the
previous size, the estimated size and a “default size” is used instead.

The idea of the tracker is to take a simple mean shift tracker (as described in section
2.3) and enhance it by making it able to adapt to scale changes in a robust way. It is
further improved by a method for histogram colour weighting that helps discriminate

16


2.4. CORRELATION FILTER CHAPTER 2. THEORY

the object from the background (BRW). The result of this is that features that have a
high probability of both belonging to the background and the foreground are suppressed.

A simple example, your feature space is just three colours and you have an equal
distribution of all the colours in the template target. If your background is dominated
by one of these colours, BRW can compensate for this and downweight the background
colours in the template target. For an illustration of this example see figure 2.9.

The complete algorithm for each frame is:

1. Calculate new position yt and scale ht.

2. Apply corrections to ht.

3. If the change in scale is large than some threshold, perform consistency check
(backward validation).

4. Goto 1.

2.4 Correlation filter

To find if a template is present in an image a correlation filter is used to correlate the
template with the image, a response will indicate that the template is present. For
example, if the template consists of an ellipse, the correlation between the filter and the
image should produce a high response at the centres of ellipses in the image. A template
is the part of an image which include the target object that is to be tracked. It is defined
as the manually chosen object bounding box in the first frame, but can then be updated
as combination of how the object looks like in subsequent frames.

A tracker is constructed around this idea by performing tracking by detection. Sam-
ples are selected from an image and a classifier decides whether a sample is similar to
the correct template or not (a sample can be any extracted part of the image). The
classifier is trained to correctly classify negative and positive samples [17].

The training of the filter is based on regression

min
w

∑
i

L(f(xi)− yi) + λ||w||2 (2.7)

where L is a loss function, λ controls the amount of regularisation. xi are the samples
and yi the regression target. The goal is to find the w that best describes the object.
f(xi) = wTxi is called linear regression and even further, using a quadratic loss function
the problem is called ridge regression.

Samples to be evaluated in the next frame can be chosen differently, i.e. randomly
or by a certain method. It is showed that if every possible sample from the next frame
is chosen in the neighbourhood of the previous template position (dense sampling),
circulant matrices can be used in the calculations. Two overlapping patches will contain
some information that is the same in both due to the overlap [17]. A circulant matrix is
diagonalised by the Discrete Fourier Transform (DFT). Diagonal matrices are good for
fast calculations since several matrix operations are reduced to element-wise operations.

17


2.4. CORRELATION FILTER CHAPTER 2. THEORY

(a) Frame with selection rectangle. The en-
larged outer rectangle is used when creating
the background histogram.

(b) Probability map.

Red Green Blue

33 % 33 % 33 %

Colour

P
ro
b
ab

il
it
y

(c) Colour histogram.

(d) Weighted probability map.

Red Green Blue

50 %

0 %

50 %

Colour

P
ro
b
ab

il
it
y

(e) Weighted colour histogram.

Figure 2.9: Example of “background ratio weighting”, an area around the template is
selected to get an approximation of the background colour distribution. This information
can then be used to weight the template histogram.

The Fourier transform is periodic but to reduce the effects of a wrapped-around
image boundary, cosine windowing and zero-padding is used. In a n× n-image a cosine

18


2.4. CORRELATION FILTER CHAPTER 2. THEORY

window downweights values such as x ∈ [0,1] closer to the border according to [25]

xij = (xrawij − 0.5) sin

(
πi

n

)
sin

(
πj

n

)
∀i,j = 0, . . . , n− 1. (2.8)

A Histogram of Oriented Gradients (HOG) is a way to represent an image area, such
as a 2 × 2-pixel square area. The main idea is to calculate the image gradient in each
pixel and then building a histogram of these gradients, see figure 2.10. Doing this can
give better features to use and the cost of computing them is countered by reducing the
amount of features the tracker operates on. Using a cell size of 2× 2-pixels for the HOG
reduces the amount of features by a fourth compared with using the raw pixel values.

Figure 2.10: To the left, image gradients. To the right, Histograms of Oriented Gradients
(HOGs) for each corresponding area of 4 pixels.

When using more flexible and powerful non-linear regression functions f(xi) the
kernel trick can be used to make the calculations cheaper. The problem is solved like
it is linear but the variables are different, so called dual variables. The kernel trick
is that one never computes a data vector in the high-dimensional space described by
the kernel function, but always operate on inner products between data vectors in the
high-dimensional space [26].

2.4.1 KCF: Kernelized correlation filter

KCF is an tracking-by-detection algorithm described in detail in [17]. It operates in
the Fourier domain, and use non-linear regression with a Gaussian kernel with the help
of the kernel trick. The dense sampling procedure makes it possible to calculate the
responses for all samples at the same time.

The algorithm is:

1. In the current frame, within the image region that was the previous template
position, samples are all possible shifts of a window of the same size as the template,
with wrap-around at the edges. Cosine-windowing and zero-padding is used.

2. The exact target position is calculated as the sample with calculated maximum
response in this window.

19


2.5. MEASURES OF TRACKING PERFORMANCE CHAPTER 2. THEORY

3. An updated model of the target is trained on the current frame and the classifier
is then updated.

KCF can be used on raw pixel values as well as multi-channel features such as His-
togram of Oriented Gradients (HOG), see figure 2.10. HOG features seem to improve
the performance of KCF relative to operating on raw pixel values [17].

Frame 1 Frame 2 Frame 2 updated

Figure 2.11: An initial target is selected in frame 1, an enlarged rectangle around it is
used as template patch. In frame 2 the blue rectangle is the starting guess for finding the
object, the regression function is evaluated for all cyclic shifts of that patch to find the one
most similar to the training patch. The position of the patch and target is then updated.

2.5 Measures of tracking performance

Two dimensions of tracking performance can be expressed as accuracy and robustness.
Accuracy is how close the tracker is, when it is tracking. Robustness is how good the
tracker is at not losing the track. The region overlap is an accuracy measure defined
as the overlap between the ground-truth region and the region proposed by the tracker.
The region overlap captures the tracking accuracy in terms of the position as well as
the size of the tracker region. The failure rate is a robustness measure defined as the
number of frames where the tracking is lost and then reinitialised (the tracker box is set
as the ground-truth). Frames must be annotated and especially the accuracy measure is
therefore influenced to the subjective opinions of the annotators, especially for objects
that do not have a well-defined boundary or centre. Then the frames must be annotated
several times by different annotators and averaged over [27].

Visual tracking is done in different scenarios and in different ways. The tracking
that the VOT Challenge [8] aims to analyse is in the field of single-camera, single-
object, model-free and causal trackers. Model-free tracker means the tracker only has
the bounding box in the first frame to learn the object to be tracked. A causal tracker
does not use future frames to track the object in the current frame.

To analyse tracker performance, an annotated dataset is beneficial. Annotation is
done by labelling each frame with attributes, such as illumination change, camera move-

20


2.5. MEASURES OF TRACKING PERFORMANCE CHAPTER 2. THEORY

ment and size change. Then it is easy to make sure that the sequences in the dataset
have a large variety of attributes, assuring a fair comparison of general tracking per-
formance. In the VOT Challenge the tracker is reinitialised when tracking has failed,
tracking redetection is thus not evaluated. Trackers are run several times on the dataset
to get relevant measures, since trackers can be stochastic. The tracking performance is
measured in robustness and accuracy. The accuracy during the first frames following
after the reinitialisation could be better than expected due to bias, and therefore the
ten first frames are not included in the calculation of the accuracy result. After tracking
failure the tracker is reinitialised five frames later, since the tracker might fail instantly
due to the same attribute, such as occlusion, and this should not be reflected in the
robustness measure [28].

21


3
Method

This chapter serves to outline the work done during the thesis and describe how the
results were obtained. The first step of the work was to get acquainted with com-

puter vision as a field of research and the state-of-the-art in motion tracking. This was
done by a literature review, especially [6, 8, 28, 29] was used to find suitable candidates
for evaluation. From this two state-of-the-art trackers were chosen (ASMS [16] and
KCF [17], theoretical background described in 2.3.1 and 2.4.1) that seemed both fast
and quite robust. In addition to these, one simple tracker based on keypoint tracking
through calculation of optical flow was implemented (see section 2.2.1).

A system for performing the testing of algorithms was developed, written in C++
and making use of the library OpenCV for the image processing. The implementations
of ASMS and KCF are slight modifications of publicly available code1,2. The code was
modified to give a consistent interface for all the tracking algorithms and to make it
possible to use them together with background subtraction.

For initial testing the trackers were evaluated in the VOT Challenge framework on
the dataset from the 2015 competition. The results from the competition are available
so it is possible to compare the results of the implementations from this thesis with that
of the original algorithm authors.

In addition to the framework for the tracking algorithms, software for controlling the
pan-tilt unit was developed to make it possible to control the camera and record video
sequences annotated with absolute pan, tilt and zoom positions. The PTZ camera and
robot can be controlled either manually by mouse or keyboard, or by one of the tracking
algorithms. The control is based on a feedback loop for the pan-tilt motions in order to
keep the tracked object centred in the image.

An algorithm for background subtraction was implemented based on an article by
Elgammal et al. [15]. One implementation of the original algorithm is available as a

1https://github.com/vojirt/asms
2https://github.com/joaofaro/KCFcpp

23

https://github.com/vojirt/asms
https://github.com/joaofaro/KCFcpp


3.1. MATERIALS CHAPTER 3. METHOD

part of the BGSLibrary3, the implementation used in the testing in this thesis is entirely
based on the written article and modified to work with a moving camera. Not all features
described in [15] were implemented, for a description of the algorithm used see 2.1.

The trackers were evaluated to determine whether they benefit from background
subtraction. Details of the testing procedure are described in 3.2.

3.1 Materials

The hardware the tracker system consists of is a camera with zoom, a pan-tilt unit and
a computer for control. A schematic image of the camera system is shown in figure 3.1.
The system has two axes of rotation, one for pan and one for tilt. Note that the pan and
tilt axes do not go through the entrance pupil of the camera but are a bit offset. This
means that a pure rotation of the camera can not be done, the camera will translate as
well when rotating.

Figure 3.1: Illustration of the camera and pan-tilt unit used. A and B are offsets from the
rotational axes and α is the rotation of the image sensor compared to the pan-tilt coordinate
system, exaggerated in this image.

3https://github.com/andrewssobral/bgslibrary

24

https://github.com/andrewssobral/bgslibrary


3.2. TESTING CHAPTER 3. METHOD

3.2 Testing

The purpose of the testing is to determine the impact of background subtraction on
tracking and to evaluate how well it is possible to perform background subtraction given
a sequence where the camera rotates. A dataset was created for the evaluation by
manually controlling the pan-tilt unit and recording multiple sequences with per frame
annotation of pan-tilt position.

To get a baseline tracking performance each tracker was evaluated on the VOT2015
Challenge4 without using background subtraction (no prior knowledge about camera
movement available). The main reason to do this was to check that the implementations
of the algorithms we used performed as expected and to make it possible to compare
OPTFLOW with the others.

3.2.1 Background subtraction

The resulting background mask from the background subtraction is used in different
ways for the trackers. For ASMS and KCF a new image is created from the original
image by setting background pixels to black. This image is then input to the tracker.
For OPTFLOW, feature points on the background were removed before tracking.

The performance of the background subtraction is evaluated on different cases to
see what impacts its performance. Two cases are constructed. The first is a simple
sequence, taking a single large image and creating a video by sweeping over it with a
smaller window. In this case all pixels should be classified as background since there are
no moving objects in a static image.

The second case is to evaluate the result of the subtraction when there is no error
in the data for the camera movement. This was done by recording a sequence without
moving the camera and then constructing a new video using small parts of the original
sequence (moving a window over it, simulating a moving camera). By doing this we
minimise vibrations and we get perfect knowledge of the per frame movement.

Finally, the background subtraction is evaluated on sequences from the PTZ camera
under the conditions: only pan motions, only tilt motions, and both pan and tilt motions.
For this background model, small angle rotations are assumed and the camera movement
is approximated as a translation.

3.2.2 Camera parameters

To perform background subtraction when the source of the frames is moving, knowledge
about the pixelwise offset between the frames is needed. One way to get that without any
prior knowledge about the movement is by tracking points belonging to the background
from one frame to the next, by for example optical flow.

When knowledge about the camera movement is available, some way to relate changes
in pan and tilt angle to changes in pixel position in an image is needed. When the

4http://votchallenge.net

25

http://votchallenge.net


3.2. TESTING CHAPTER 3. METHOD

changes in pan and tilt are small, the change in pixel position can be approximated to
be proportional to the change in pan and tilt.

∆x =c1 ·∆p
∆y =c2 ·∆t

The coefficients c1 and c2 can be estimated by for example using optical flow to get
an estimate for the pixel movement and compare that with the change in pan and tilt.
They can also be found by manually matching images with known camera position and
calculate the coefficients from that. Both methods are evaluated.

3.2.3 Test procedures for evaluating the trackers with background sub-
traction

• Scenario 1: Tracking with background subtraction.
A sequence of frames is collected from a static camera. In the sequence there is
at least one moving object. A moving camera is simulated by constructing a new
sequence of frames where each frame is a fixed size region from the corresponding
static camera frame. The pixel position of the extracted region is logged in order
to simulate translation information from the robot. Then for each tracker:

– Initiate a bounding box on the object to be tracked.

– Track with and without background subtraction.

– If the tracking is lost, reinitialise by giving a new bounding box around the
object.

– Count the number of times the tracking is lost and how many frames processed
per second.

• Scenario 2: Simple tracking with background subtraction and robot.
Sequences of frames are collected from the PTZ camera when it is moved manually,
in the sequence there is a moving unicoloured circle in front of a simple background.
Pan and tilt positions are logged to make it possible to align the frames. Sequences:

– a. The camera is only panned right and left.

– b. The camera is only tilted up and down.

– c. The camera is panned and tilted in an irregular pattern.

– d. The moving object is moved around in an irregular pattern, the camera is
manually controlled to keep the object centred in the image (as the tracker
would control it).

For each tracker:

26


3.2. TESTING CHAPTER 3. METHOD

– Initiate a bounding box on the object to be tracked.

– Track with and without background subtraction.

– If the tracking is lost, reinitialise by giving a new bounding box around the
object.

– Count the number of times the tracking is lost and how many frames processed
per second.

• Scenario 3: Tracking with background subtraction and robot

– Same procedure as scenario 2 but with a more complicated background with
clutter.

As targets for the tracking, a yellow circle and a multicoloured polygon were selected.
The circle was chosen to test with a simple shape, invariant to rotation, and a single
distinct colour. The polygon was chosen to get an object as different as possible to the
circle, with several colours and many corners.

The initial bounding box were chosen as the smallest upright rectangle so that every
part of the object was inside the box. Sometimes the entire object may move outside the
camera view, this is not reported as a failure. The bounding box is initialised around
the object when it enters the view again. Example images from all the sequences can be
seen in figure 3.2.

(a) (b)

(c) (d)

Figure 3.2: All the different combinations of moving test objects and backgrounds used
for evaluating the trackers with background subtraction.

27


4
Results

In this chapter, system and test data are presented. An in-depth discussion of the
results follows in the next chapter.

4.1 Camera configuration

The camera is configured to have a resolution of 1280 × 720 pixels and that a step in
pan or tilt position correspond to a change in angle of 1

30

◦
.

By recording a sequence of frames and storing the absolute pan and tilt position
for each frame the relation between change in pan position and pixel movement in the
image was calculated. Another sequence was also recorded where in addition to the
absolute pan-tilt position for each frame an estimate of the change in pixel position
between successive frames was stored. The estimate was calculated by keeping track
of the optical flow for some features and taking the average change in position as an
estimate for the entire frame movement.

For both sequences the optical zoom was set to ×10. Manually matching frames the
following correlation was found:

∆x =5.48 ·∆p (4.1)

∆y =5.63 ·∆t (4.2)

where ∆p is a change in pan position and ∆t is a change in tilt position.
From the sequence with the recorded estimates of ∆x and ∆y the following correlation

was found by fitting a line to the data minimising the mean squared error:

∆x =5.39 ·∆p (4.3)

∆y =5.34 ·∆t. (4.4)

29


4.2. BACKGROUND SUBTRACTION CHAPTER 4. RESULTS

In the testing the coefficients 5.48 and 5.63 were used. The other coefficients were
also tested but there was no distinguishable difference in the subtraction result.

4.2 Background subtraction

To use the images from a moving camera to build a background model the images must
be positioned so that a pixel that has not moved is at the same place. An example of how
well this works for the camera used can be seen in figure 4.1a together with how it would
have looked if the match was perfect. Constructing a video sequence from one single large

(a)

(b)

Figure 4.1: The result of using the change in camera position (a) to line up two images
with 30 frames between and manually matching them (b). The angular distance between
the frames is (∆p,∆t) = (0.45◦, 1.38◦)

image gave the expected result, all pixels were classified as background. By calculating
the absolute difference between frames and setting the threshold to zero we noticed that
some errors were introduced in the sequence by the compression of the images when the
sequence was constructed. Pixel intensity values could differ on approximately 1 % when
compared to the source image.

30


4.2. BACKGROUND SUBTRACTION CHAPTER 4. RESULTS

4.2.1 Stationary image

The background subtraction was evaluated on a sequence with simulated camera move-
ment, the camera was stationary and the movement was simulated by constructing a
new sequence from parts of the entire frames. This minimised errors in the background
subtraction due to possibly incorrect alignment of frames, since the per frame offset was
known pure translations.

Even if the alignment of the frames is perfect the background subtraction is not.
In figure 4.3 there are two examples of how it can go wrong. The object can acciden-
tally become a part of the background model, if the object then returns to the same
position it erroneously is classified as background. Another problem is that the back-
ground subtraction algorithm only operates on a colour level, if a moving object moves
over a background that has the same colour the overlapping part will be classified as
background. There is a tradeoff between making the subtraction robust against slight
variations in colour (for example a change in lighting) and making correct classifications
when there are similar objects with only a slight difference in colour.

(a) (b)

Figure 4.2: Original image (a) and the background mask (b) from a sequence where the
camera movement is simulated.

4.2.2 Rotating camera

For the moving camera the same problems with background subtraction as with the sta-
tionary camera exist. In addition there are errors due to incorrect alignment of frames.
This increases the amount of pixels classified as foreground due to the difficulty in sepa-
rating movement from vibration and actual robot movements. Another problem is that
details get smoothed out in the background model. A patch that is quite small if you
only look at a single frame can appear to cover a much larger area, this is especially trou-
blesome if the patch has the same colour as the moving object. Figure 4.5 demonstrates
this, a yellow circle moves over a background that contains a small yellow rectangle.

The subtraction was evaluated for the camera moving freely as well as restricted to
only pan and only tilt but no visible difference was noticed for the different cases.

31


4.2. BACKGROUND SUBTRACTION CHAPTER 4. RESULTS

(a) (b)

Figure 4.3: Problems with the background subtraction that occur even though the frames
have been aligned perfectly. In 4.3a the object have become a part of the background model
and in 4.3b the object passes over a patch of background with the same colour. The object
is a yellow circle.

(a) (b)

Figure 4.4: Frame 250 and the corresponding background mask from the sequence star10.
Some edges from background objects are classified as foreground due to missalignment.

Figure 4.5: Behind the yellow circle is a small yellow square that has become smoothed
out and appears much larger in the background model due to errors in the alignment of
frames. The circle is in the same position as in figure 4.3b for comparison.

4.2.3 Speed

The impact of different parameters for the background subtractor was evaluated by
running the same sequence multiple times with different settings while monitoring the
frame rate. All parameters except W (how often frames are added to the background

32


4.3. TRACKING CHAPTER 4. RESULTS

model) impacted how fast the subtraction was. The size of the image also heavily impacts
the performance. This is as expected, the number of pixels directly correspond to the
amount of computations and W is the only parameter that does not influence this. Using
the same settings as in column 1 in table 4.1 but using an image sequence at a lower
resolution (160× 120 instead of 320× 180) the frame rate was 130 fps, compared to 56
fps.

Table 4.1: Average frame rate
Impact of different parameter settings on frame rate, evaluated on a sequence with 1280×720
resolution downscaled to 320 × 180. Nst and Nlt are parameters for the number of frames
in the short and long term models, W governs how often the long term model is updated,
npyr sets the number of levels in the image pyramid and th is a threshold a pixel probability
must exceed to be classified as background.

Nst 15 15 15 2 25 20 20 20

Nlt 5 5 5 0 0 5 5 5

W 60 30 60 – – 30 30 30

npyr 2 2 1 2 2 2 2 2

th 10−6 10−6 10−6 10−6 10−6 10−4 10−6 10−8

Time per frame [ms] 17.9 17.9 22.7 12.2 21.3 31.3 20.4 19.2

4.3 Tracking

The trackers were evaluated on four different object-background combinations. A mul-
ticoloured polygon on a white background and on a cluttered background, and a uni-
coloured circle on the same two backgrounds. The results in number of failures is pre-
sented in table 4.2.

Multiple sequences were recorded for the different combinations, in total 21 sequences
of varying length were produced. The main difference between the recordings was how
the PTZ was allowed to move.

The backgrounds were chosen so that one was very simple with large contrast against
the tracking target without features that could appear as moving due to an incorrect
alignment of frames in the subtractor. The other was selected to be the opposite, con-
taining a lot of small and large features and areas with the same colour as the tracking
target. The objects were chosen in the same manner, one that was simple with uniform
colour and one with some texture.

4.3.1 Speed

The average time it took to process a frame for the different trackers for the different
sequences is shown in table 4.3. KCF and OPTFLOW showed quite stable performance
over all the sequences while ASMS had a drop in frame rate for the sequences with the

33


4.3. TRACKING CHAPTER 4. RESULTS

Table 4.2: Tracking without background subtraction. Stars indicate sequences with clut-
tered background.

OPTFLOW Time per ASMS Time per KCF Time per

Sequence Failures frame [ms] Failures frame [ms] Failures frame [ms] #frames Size

circle1 0 3.1 0 8.2 0 21.3 1309 640×480

circle2* 6 3.5 1 10.0 0 21.7 1860 640×480

circle3 0 6.2 0 33.3 0 20.8 1227 1280×720

circle4* 2 6.8 0 25.0 0 14.9 2774 1280×720

circle5 0 5.3 0 25.0 0 20.4 1658 1280×720

circle6* 1 6.3 0 34.5 0 19.2 3443 1280×720

circle7 0 5.6 0 25.6 0 21.7 918 1280×720

circle8* 3 6.5 0 31.3 0 20.0 2727 1280×720

circle9 0 5.5 0 23.3 0 22.2 1437 1280×720

circle10* 1 6.5 0 34.5 0 23.3 2554 1280×720

circle11* 7 6.6 1 38.5 0 18.9 3784 1280×720

star1 0 3.4 0 14.7 0 17.9 971 640×480

star2* 0 3.2 0 14.5 0 18.2 1757 640×480

star3 0 3.6 0 83.3 0 23.3 1325 1280×720

star4* 0 3.6 0 142.9 0 20.0 1753 1280×720

star5 0 6.5 0 62.5 0 18.2 2064 1280×720

star6* 0 6.2 0 111.1 0 15.2 2348 1280×720

star7 0 6.5 0 55.6 0 15.9 1596 1280×720

star8* 0 6.2 0 62.5 0 16.7 3999 1280×720

star9 0 6.1 0 55.6 0 16.7 1446 1280×720

star10* 0 5.9 0 142.8 0 16.1 3277 1280×720

total 20 2 0 44227

star-shaped object. Looking at the frame rates reported in table 4.5 for the sequences
with lower resolution it seems that both ASMS and OPTFLOW are faster on smaller
images. KCF on the other hand reports a stable performance regardless of resolution
size.

Table 4.3: Average time per frame [ms]

Combination OPTFLOW ASMS KCF

circle on white 5.7 26.8 21.3

circle on clutter 6.5 32.8 19.3

star on white 5.7 64.3 18.5

star on clutter 5.5 114.8 17.0

34


4.3. TRACKING CHAPTER 4. RESULTS

4.3.2 VOT Challenge

The trackers were evaluated on the VOT2015 dataset. Table 4.4 shows the results with
respect to robustness. Robustness is measured by counting the number of times a tracker
fails on a sequence and then taking the average over multiple runs.

Only looking at robustness all trackers performed well on the sequences: bag, ball1,
birds2, blanket, bmx, car2, dinosaur, fernando, fish3, godfather, iceskater1, iceskater2,
marching, racing, sheep, singer1, singer3, sphere. These sequences had in common that
the objects where quite constant in size, without abrupt changes in colour or form and
the sequences contained little occlusion.

Looking at the sequences where the trackers performed badly, almost all trackers
had problem with occlusion that lasted longer than one or two frames. KCF had trouble
when objects got blurred due to rapid motion. This did not affect ASMS as much,
as the colour content in an area is less distorted than the actual form during blurring.
ASMS had trouble when there where multiple objects with the same colour or the entire
frame had roughly the same colour (see soccer1 for example, red confetti obscuring a
soccer team with red clothes). All trackers had problems when the contrast between
background and the tracked object was low.

4.3.3 Tracking with background subtraction

The number of failures for tracking on a masked background is presented in table 4.5.
OPTFLOW had 91 failures, compared to 20 without background subtraction. ASMS
had only one failure in total, one less than tracking without background subtraction.
KCF had 13 failures, without background subtraction there were no failures.

The background subtraction reports a speed of about 50 fps (20 ms), calculating the
frame rate when combining a tracker and background subtraction is straightforward.
The two stages of background subtraction and tracking are run sequentially and not in
parallel so the time per frame can be added.

35


4.3. TRACKING CHAPTER 4. RESULTS

Table 4.4: Robustness
The table gives the robustness score achieved on each sequence in the VOT2015 dataset.
For comparison, stationary is benchmark tracker that does not update the position of the
target after initialisation.

Sequence OPTFLOW ASMS KCF Stationary Length [#frames]
bag 0 1 0 3 196
ball1 1 0 0 9 104
ball2 5 2 3 4 40
basketball 2 1 1 12 725
birds1 8.2 3 2 18 339
birds2 0 2 0 3 539
blanket 1 0 0 1 225
bmx 0 0 1 1 76
bolt1 11.8 1 0 8 350
bolt2 2 1 1 9 293
book 5 4 9 11 175
butterfly 3.33 0 2 5 151
car1 3.07 1 0 12 742
car2 0 0 0 3 393
crossing 3 1 1 5 131
dinosaur 0 0 2 6 326
fernando 1 0 2 3 292
fish1 4 2 4 7 366
fish2 4 3 5 5 310
fish3 2 1 0 6 519
fish4 1 1 3 8 682
girl 4 1 1 14 1500
glove 4 2 3 9 120
godfather 0 2 0 3 366
graduate 6.13 4 4 12 844
gymnastics1 3 1 7 4 567
gymnastics2 2.07 0 5 4 240
gymnastics3 2 3 4 3 118
gymnastics4 1 2 2 3 465
hand 6 5 8 15 267
handball1 10 3 7 20 377
handball2 12 3 11 28 402
helicopter 0 2 1 2 708
iceskater1 0.33 0 1 5 661
iceskater2 0 0 3 2 707
leaves 6 0 5 7 63
marching 1.2 1 0 8 201
matrix 8 2 4 4 100
motocross1 1 2 2 4 164
motocross2 1 0 3 0 61
nature 2 3 3 6 999
octopus 1 1 0 1 291
pedestrian1 4 3 10 14 140
pedestrian2 4 1 1 15 713
rabbit 6 4 5 8 158
racing 0 1 0 3 156
road 11.3 4 0 20 558
shaking 1 1 1 2 365
sheep 0 3 0 4 251
singer1 0 1 0 4 351
singer2 2.53 1 1 2 366
singer3 1 1 1 5 131
soccer1 0 10 2 5 392
soccer2 15 3 14 16 129
soldier 1 2 1 2 138
sphere 0 0 0 6 201
tiger 2 2 0 20 365
traffic 1 2 0 0 191
tunnel 3.73 5 0 0 312
wiper 6.07 2 0 5 341

36


4.3. TRACKING CHAPTER 4. RESULTS

Table 4.5: Tracking with background subtraction. Stars indicate sequences with cluttered
background.

OPTFLOW ASMS KCF

Sequence Failures Failures Failures #frames Size

circle1 4 0 0 1309 640×480

circle2* 13 0 1 1860 640×480

circle3 13 0 2 1227 1280×720

circle4* 4 1 0 2774 1280×720

circle5 4 0 1 1658 1280×720

circle6* 3 0 2 3443 1280×720

circle7 2 0 1 918 1280×720

circle8* 5 0 1 2727 1280×720

circle9 3 0 0 1437 1280×720

circle10* 4 0 0 2554 1280×720

circle11* 7 0 1 3784 1280×720

star1 1 0 1 971 640×480

star2* 5 0 0 1757 640×480

star3 2 0 1 1325 1280×720

star4* 3 0 0 1753 1280×720

star5 5 0 0 2064 1280×720

star6* 1 0 0 2348 1280×720

star7 3 0 1 1596 1280×720

star8* 4 0 0 3999 1280×720

star9 4 0 1 1446 1280×720

star10* 1 0 0 3277 1280×720

total 91 1 13 44227

37


5
Discussion

Using background subtraction should make some algorithms perform better in circum-
stances where they otherwise have trouble. For OPTFLOW it can be used as a tool

to reject background features and for colour based trackers like ASMS, filtering out the
background can increase the contrast. Correct background subtraction can be hard to
perform, and if done incorrectly it can do more damage than help.

A better use of the information about the camera motion might be to use it to
create a motion model for the tracked object. This would help the tracking by making
it possible to give a better starting guess for locating the object in a new frame.

For the evaluation of tracking with a PTZ camera, test data are needed. To make
tests fair and avoid overfitting to certain scenarios data need to be diverse and cover a
wide range of scenarios. For general tracking there are a lot of sequences available for
testing but data of the camera movement in the sequence are often not available.

Robustness is the primary factor of interest for the PTZ tracking system. As the
camera moves, the only thing that really matters is that the tracked object stays in view.
Accuracy is in a sense related to robustness. If the accuracy is good the tracker should
have a better chance to keep it in view and thus be robust.

5.1 Camera

There are some problems with projecting a 3D scene onto a 2D plane and then stitch
planes from different camera directions. The error from approximating the camera ro-
tation as a translation is greater at low zoom levels as the angle of view for the camera
is larger than for higher zoom levels.

While increasing zoom reduces some problems it amplifies other, e.g. vibrations from
the servos controlling the movement become more visible. When using a real camera it
is hard to make the image sensor perfectly aligned with the rotational axes. Depending
on how the camera and the control unit is constructed the offset for the rotational axes

39


5.2. BACKGROUND SUBTRACTION CHAPTER 5. DISCUSSION

from the entrance pupil can be quite large. The camera used is illustrated in figure 3.1
and has offsets of varying size for all axes. What this results in is that rotations are also
translations and that a change in for example pan position can give an offset in both
x and y coordinates for the image. If the offset angle for the image sensor α is zero, a
change in pan position should only give a translation along the x axis in the image.

The vibrations proved to be the factor that had the largest impact on the possibility
to align frames correctly. Even if the factor correlating a change in pan and tilt position
to a change in x/y position of a pixel was exactly known, vibrations introduce an error
of a couple of pixels. This error had largest influence when moving very slow (the object
moving a couple of pixels per frame) or when changing rotation direction.

5.2 Background subtraction

In theory background subtraction could be a good tool for pre-processing a video stream
to get an input frame that is easier to track in. In practice performing background
subtraction well is a non-trivial task even if the camera is static. Adding inaccuracies
from a moving camera it is possible to do more harm than good. A hypothesis is that
background subtraction would help a tracker in sequences where the tracked object
passes close to a similar object that is not moving. Then the tracker may start to track
the close-by similar object but background subtraction will suppress the static object.
It seems that it is a rare case that all these requirements are fulfilled.

Looking at the literature concerning state-of-the-art background subtraction less
work appear to have been done on subtraction with a non-static camera. Even in the
static case there does not really exist any method that works well in all scenarios, adding
the requirement for real-time speed reduces the potential methods even more [30, 31].

Scaling down the input images before performing the background subtraction is one
way to reduce the computational load at the cost of losing details. A pyramidal imple-
mentation performs background subtraction first on a very coarse level and use that as
a mask for performing the background subtraction on a higher resolution image. Such
an implementation can reduce the number of pixels that need to be evaluated without
losing too many details.

When used as pre-processing for tracking, the loss of details that occurs when reduc-
ing the scale can be beneficial as it increases the amount of false foreground (classifying
background as foreground) along the edge of objects. For the tracking, false background
are more harmful than false foreground so a trade-off where the amount of false fore-
ground increases at the cost of less true background is good.

Another problem arises when the tracked object slows down and is about to stand
still, then it will become part of the background model if not handled. By keeping track
of the speed of the object and stop updating the background model until the object
resumes its movement this can be avoided.

We noticed that the algorithm does not work well when a moving object passes over
background that has the same colour. This is a fundamental problem in tracking and in
this algorithm as it is only designed to care about changes in colour for a pixel. Only

40


5.3. TRACKING CHAPTER 5. DISCUSSION

detecting the edge of moving objects that have large areas of one colour can be solved
by making sure that the object itself is not incorporated in the background and that the
blind update model has large enough amount of skipped frames between updates. No
such thing can be done to solve the problem that the background has the same colour
as the moving object. A way to handle it is to notify the user that the object is entering
a background region that has the same colour as the object.

The background subtraction seems good when exploring new areas in images. When
it returns to a place it has been stationary at for a while earlier in the sequence, it gets
reported as background. This is because the object has been saved as background in the
blindly updated history. Having the camera restricted to tracking a object in a small
area where it revisits the same spot multiple times might be unfair as that is a scenario
that is hard to handle well and maybe not so common.

The way we used the background subtraction mask was that we set the background
pixels to black and left the foreground pixels unaffected. Then we input this new image
to the tracker. This means that the trackbox will consist of black pixels that get incor-
porated into the description of the tracked object. For example, there will probably be
a sharp edge between the foreground and the background instead of a smoother transi-
tion. A large gradient will be introduced into the object description affecting the KCF
tracker. ASMS constructs its template in the first frame before background subtraction,
and background ratio weighting should be able to suppress the black pixels incorporated
by the background subtraction.

5.3 Tracking

The recorded test sequences were easy for all trackers except the circular target se-
quences, which were hard for OPTFLOW. The reason OPTFLOW did not perform well
on those sequences is that the target had no good features, e.g corners. The sequences
had no occlusion or change in light and the target did not change appearance through
the sequence as a consequence of out of plane rotation or movement towards the camera.

As all trackers performed acceptably on the sequences without using background
subtraction the only thing we could show was whether the background subtraction would
make the tracking worse. Running the trackers on the sequences when using background
subtraction made the trackers, especially OPTFLOW, more prone to lose the target.
This might be due to that the constructed scenarios are not sequences where background
subtraction is suitable for helping the trackers, or due to how the information from the
background subtraction is used by the trackers.

Take ASMS for example, it operates in the same feature space as the background
subtraction algorithm (pixel colour values). The scenario where it might be reasonable
to expect the background subtraction to help is when the tracking target moves close to
background features that are at rest and that have similar colour distribution. Looking
at how ASMS works this is not a scenario where the tracking is prone to fail.

OPTFLOW on the other hand might have more to gain from using background
subtraction as it discards all colour information. Running the tests showed that it is very

41


5.3. TRACKING CHAPTER 5. DISCUSSION

sensitive to false background in the background subtraction. The way the information
from the background mask is used is to look at the position for each tracked feature in
the background mask, and if that position is background, discard the feature. This is
good for getting rid of points that belong to the background that should not be tracked.
But if tracked points only appear along the outer edge of the object and the subtraction
removes a couple of pixels too much there might not be any corners left to track.

For ASMS and KCF, the scenarios in which they fail are when there are a lot of
occlusion or the contrast from the background is low. Distinguishing objects that are
small, with low contrast and moving due to noise is not helped by background subtrac-
tion. Other methods, such as motion prediction models are probably more helpful in
these scenarios.

Background subtraction might be suitable for detecting tracking failure and occlu-
sion. Other tools are then needed to make use of that information to help the tracking
(predict where the object ought to reappear and try to redetect when the region contains
a moving object again).

Background subtraction could speed up the tracking since some pixels are not taken
under consideration. But all trackers studied are already real-time with a good margin
and background subtraction itself is slower so this is not a good motivation for using
background subtraction in this case.

5.3.1 OPTFLOW

OPTFLOW works well when the target for the tracking has a good amount of salt-and-
pepper texture on it. If it is possible to select a region on the object that contains many
features without selecting any background that is optimal. If the selected region contains
no such features the tracker will not work.

In its näıveté there are a lot of possible paths to make OPTFLOW better and more
robust, a problem with the current implementation is how it handles which points to
track. At the moment it operates on the principle that most feature points likely belong
to the target (this assumption might be wrong, if the target does not have many features
it is almost certainly wrong) and that the few points that are on the background will
get averaged out and lost when they move out of the image. One idea is to use some
clustering algorithm together with the background subtraction and only keep the cluster
which is in a region with a high amount of foreground pixels. This would make it
possible to get rid of incorrect points and not be as dependent on the correctness of the
background subtraction.

From the optical flow calculation we also get the velocity of the feature points, from
the knowledge regarding the camera movement it is possible to estimate what the velocity
of a background pixel is. By comparing with the estimate of the background velocity it
might be possible to filter out points belonging to the background without performing
background subtraction.

When a good selection of features has been made it might be desirable to increase
the robustness when tracking the features. One way to do that is to perform backward
validation. Take the points you found by tracking the features and reverse the time to

42


5.3. TRACKING CHAPTER 5. DISCUSSION

track the points backwards to the image where you know the correct location. Points
that end up in the wrong location are assumed to be incorrect and therefore discarded.

The performance with regard to speed depends on the resolution of the sequence
and the number of feature points that are tracked. The detection step (and thus also
redetection in order to add more features during tracking) depends on the size of the
tracking target and not the resolution of the entire image. It is also independent with
respect to how many features that are to be found. The reason is that OPTFLOW
evaluates the entire region for possible corner features and orders them in a list according
to how good they are. How many you use is then simply how many feature points you
select from that list.

The tracking step depends on the number of features picked in the detection and the
resolution of the entire image. By the use of image pyramids the tracking step is quite
efficient but it might be possible to increase the speed by only calculating the optical
flow on a smaller region of the image. For example restricting it to the region covered
by the previous bounding box of the tracked object.

There are a number of parameters in the OPTFLOW tracker that can be set dif-
ferently. A choice is how often to redetect for new features within the object bounding
box. Redetecting often could improve the robustness if done correctly, if the bounding
box is in the wrong position when the redetection is done the tracker might get lost
completely. Redetection should be performed when it is likely that the bounding box
covers the object.

The downside of many parameters is more parameters to tune, which can lead to
overfitting to certain scenarios. It is in general desirable to be able to operate under
different conditions.

5.3.2 KCF

KCF is shown to be robust and fast on the sequences we tested it on. According to [17]
the speed of KCF is directly related to the size of the tracked region.

The tracker is local in that sense that it only evaluates samples in the current image
around the previous position of the object. If the object moves very fast in the image,
which may also be due to that the camera moves rapidly, no sample contains the object
and is valid but the object is still in the image somewhere. This may be of less concern for
PTZ cameras since they often have a smooth movement due to the pan-tilt restrictions,
and therefore the use of KCF could be appropriate for PTZ tracking.

Information about the movement is available from the system. If used during change
of zoom level (not in the scope of this thesis) it must be accounted for that the tracker
is not detecting in the whole image.

The implementation we used tracked on HOGs, a more complex feature than raw
pixel values, but KCF is also capable of working with raw pixel values. This may be
better if the object edges become blurred since large gradient features represent sharp
edges. This is connected to the camera movement, objects become more blurred if the
camera moves fast. Since a PTZ camera is relatively stable, there should not be too

43


5.4. CONCLUSIONS CHAPTER 5. DISCUSSION

much blur from motion, and KCF together with HOGs could be suitable to use in PTZ
tracking.

5.3.3 ASMS

Looking at the sequences recorded with the PTZ camera the tracking ability of ASMS
is hard to put to fault. The speed on the other hand took a large hit when the bounding
box (and thus the colour histogram representing the target) contained many different
colours. This is probably a problem with the implementation and not the algorithm
(bad memory locality when comparing pixels to the histogram), running the tracker on
a computer with a better CPU the impact of the more complex object was not as severe.

Analysing the results from evaluating the tracker on the VOT2015 dataset it is possi-
ble that ASMS has room for improvement. One change that would be relatively easy to
implement and evaluate is a change of feature space. Instead of using colour histograms
for the backprojection it might be beneficial to use HOGs. KCF was evaluated in [17]
using both raw pixel values and HOG descriptors and was shown to benefit from using
HOGs.

Changing feature space might make the tracking more robust in the presence of
objects with similar colour distribution with different form. The histogram that is used
as a template for finding the target is never updated after initialisation which makes it
hard for the tracker to cope with even gradual changes in how the object looks.

5.4 Conclusions

There are in theory cases where a good background subtraction result helps a tracker
in keeping the correct track, detecting tracking failures as well as detecting occlusion.
Failure detection and occlusion detection were not in the scope of this thesis and the
computations spent on modelling the background online to aid tracking seems to be
better used in the tracking routine, for example fusion of several complementary tracking
algorithms.

If background subtraction is to be done with a moving camera, good alignment of
frames is very important. An idea is to use the knowledge of the movement to get an
initial guess and then use some form of keypoint matching to compensate for noise and
fine-tune the alignment.

5.4.1 Future work

To make it possible to evaluate and get good measures of how well different algorithms
perform, more evaluation data are needed. It would be of interest to generate datasets
of pan-tilt-zoom video sequences where the pan-tilt-zoom configuration for each frame is
attached. Especially construct sequences where this information could help the tracking,
such as sequences containing occlusion and where the moving object is similar to the
background. Using the toolset from the VOT Challenge with the dataset containing
frame offset information might be a good way to automate the testing.

44


5.4. CONCLUSIONS CHAPTER 5. DISCUSSION

To increase tracking performance in the scenario with a PTZ camera there are some
different things that could be explored. The background subtraction proved to be of
little help to trackers. What might be better is to use the raw probability map in the
subtraction instead. Depending on the probability of being foreground, a pixel should
contribute more or less to the tracking.

Instead of using the information about the robot motion to make it possible to
perform background subtraction it might be better to use the information to create a
motion model for the tracked object. For example, keep track of how many pixels the
object has moved between two frames and subtract the motion of the camera to get a
better estimate of the true velocity. When guessing starting position one can take into
account both how much the camera has moved since last frame and how fast the object
is moving (if vobject is equal to vcamera, do not move the tracking box from reported
location).

An obvious way to get better tracking in a known specific scenario is to select an
appropriate tracker. In the current state of the field, no single tracker works perfectly in
all scenarios. But if you know that the camera will only move smoothly and track objects
that move smoothly you could optimise for that. Which scenarios a tracker works well
in depends on the feature space used. The choices of feature space and tracker are in
a sense disconnected. The feature space is how an image is pre-processed before input
to the tracker. Using raw pixels (pixel colour intensities) makes the tracking robust
to problems that erase edges but preserve colour (fog, motion blur etc.) but makes
the tracking vulnerable to rapid changes in colour (blinking lights). One solution to
the colour change problem might be to normalise the images, to give them the same
mean colour. Another is to simply pick another feature space, like HOGs. Using HOGs
the tracking becomes more robust against illumination changes but gets vulnerable to
changes that destroy edges.

To get better tracking the simplest approach might be to pick trackers that are good
at different things and combine them in a tracker fusion. Combining ASMS and KCF
for example should yield a tracker that is robust both to colour changes and destruction
of edges. The hard part of doing tracker fusion is to know when to rely on which tracker.
The good thing is that many trackers are designed such that it could be possible to
calculate a measure of tracker certainty. ASMS and KCF report a score for several
possible positions in each frame. Then to be certain the best scores should be much
better than the average scores, otherwise all samples are equally bad which indicates
tracking failure. In OPTFLOW the number of tracked features still left on the object
could be a measure of tracker certainty.

An extension of tracker fusion in a single video stream is to add more cameras and
different sensors in a fully connected system.

45


Bibliography

[1] Y. Sheikh, O. Javed, and T. Kanade, “Background subtraction for freely mov-
ing cameras,” in Computer Vision, 2009 IEEE 12th International Conference on,
pp. 1219–1225, IEEE, 2009.

[2] E. Hayman and J.-O. Eklundh, “Statistical background subtraction for a mobile
observer,” in Computer Vision, 2003. Proceedings. Ninth IEEE International Con-
ference on, pp. 67–74, IEEE, 2003.

[3] A. Elqursh and A. Elgammal, “Online moving camera background subtraction,” in
Computer Vision–ECCV 2012, pp. 228–241, Springer, 2012.

[4] K.-I. Kanatani, “Camera rotation invariance of image characteristics,” Computer
vision, graphics, and image processing, vol. 39, no. 3, pp. 328–354, 1987.

[5] D. Murray and A. Basu, “Motion tracking with an active camera,”Pattern Analysis
and Machine Intelligence, IEEE Transactions on, vol. 16, no. 5, pp. 449–459, 1994.

[6] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” Acm computing
surveys (CSUR), vol. 38, no. 4, p. 13, 2006.

[7] Z. Kim, “Real time object tracking based on dynamic feature grouping with back-
ground subtraction,” in Computer Vision and Pattern Recognition, 2008. CVPR
2008. IEEE Conference on, pp. 1–8, IEEE, 2008.

[8] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Čehovin, G. Fernandez, T. Vo-
jir, G. Häger, G. Nebehay, et al., “The Visual Object Tracking VOT2015 challenge
results,” Dec 2015.

[9] W. Hu, T. Tan, L. Wang, and S. Maybank, “A survey on visual surveillance of object
motion and behaviors,” Systems, Man, and Cybernetics, Part C: Applications and
Reviews, IEEE Transactions on, vol. 34, no. 3, pp. 334–352, 2004.

[10] H. Yang, L. Shao, F. Zheng, L. Wang, and Z. Song, “Recent advances and trends in
visual tracking: A review,” Neurocomputing, vol. 74, no. 18, pp. 3823–3831, 2011.

46


BIBLIOGRAPHY BIBLIOGRAPHY

[11] J.-Y. Bouguet, “Pyramidal implementation of the affine lucas kanade feature tracker
description of the algorithm,” Intel Corporation, vol. 5, no. 1-10, p. 4, 2001.

[12] S. J. Julier and J. K. Uhlmann, “New extension of the kalman filter to nonlin-
ear systems,” in AeroSense’97, pp. 182–193, International Society for Optics and
Photonics, 1997.

[13] M. Piccardi, “Background subtraction techniques: a review,” in Systems, man and
cybernetics, 2004 IEEE international conference on, vol. 4, pp. 3099–3104, IEEE,
2004.

[14] Y. Benezeth, P.-M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, “Review
and evaluation of commonly-implemented background subtraction algorithms,” in
Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, pp. 1–4,
IEEE, 2008.

[15] A. Elgammal, D. Harwood, and L. Davis, “Non-parametric model for background
subtraction,” in Computer Vision—ECCV 2000, pp. 751–767, Springer, 2000.

[16] T. Vojir, J. Noskova, and J. Matas, “Robust scale-adaptive mean-shift for tracking,”
Pattern Recognition Letters, vol. 49, pp. 250–258, 2014.

[17] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with
kernelized correlation filters,” Pattern Analysis and Machine Intelligence, IEEE
Transactions on, 2015.

[18] M. Kristan, R. Pflugfelder, A. Leonardis, J. Matas, F. Porikli, L. Čehovin, G. Nebe-
hay, G. Fernandez, T. Vojir, et al., “The visual object tracking vot2013 challenge
results,” Dec 2013.

[19] J. L. Barron, D. J. Fleet, and S. S. Beauchemin, “Performance of optical flow tech-
niques,” International journal of computer vision, vol. 12, no. 1, pp. 43–77, 1994.

[20] J. Shi and C. Tomasi, “Good features to track,” in Computer Vision and Pattern
Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer Society Confer-
ence on, pp. 593–600, IEEE, 1994.

[21] C. Tomasi and T. Kanade, Detection and tracking of point features. School of
Computer Science, Carnegie Mellon Univ. Pittsburgh, 1991.

[22] C. Harris and M. Stephens, “A combined corner and edge detector.,” in Alvey vision
conference, vol. 15, p. 50, Citeseer, 1988.

[23] Y. Cheng, “Mean shift, mode seeking, and clustering,” Pattern Analysis and Ma-
chine Intelligence, IEEE Transactions on, vol. 17, no. 8, pp. 790–799, 1995.

[24] M. J. Swain and D. H. Ballard, “Indexing via color histograms,” in Active Perception
and Robot Vision, pp. 261–273, Springer, 1992.

47


BIBLIOGRAPHY

[25] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant
structure of tracking-by-detection with kernels,” in proceedings of the European Con-
ference on Computer Vision, 2012.

[26] Wikipedia,“Kernel method — wikipedia, the free encyclopedia,”2016. https://en.
wikipedia.org/w/index.php?title=Kernel_method&oldid=709375900, [Online;
accessed 12-May-2016].

[27] L. Čehovin, M. Kristan, and A. Leonardis, “Is my new tracker really better than
yours?,” in WACV 2014: IEEE Winter Conference on Applications of Computer
Vision, IEEE, Mar 2014.

[28] M. Kristan, J. Matas, A. Leonardis, T. Vojir, R. Pflugfelder, G. Fernandez, G. Nebe-
hay, F. Porikli, and L. Cehovin, “A novel performance evaluation methodology for
single-target trackers,” arXiv preprint arXiv:1503.01313, 2015.

[29] A. Li, M. Lin, Y. Wu, M.-H. Yang, and S. Yan, “Nus-pro: A new visual tracking
challenge,” 2015.

[30] Y. Benezeth, P.-M. Jodoin, B. Emile, H. Laurent, and C. Rosenberger, “Compar-
ative study of background subtraction algorithms,” Journal of Electronic Imaging,
vol. 19, no. 3, pp. 033003–033003, 2010.

[31] Y. Feng, S. Luo, Y. Tian, S. Deng, and H. Zheng, “Comprehensive analysis and
evaluation of background subtraction algorithms for surveillance video,” Sensors &
Transducers, vol. 177, no. 8, p. 163, 2014.

48

https://en.wikipedia.org/w/index.php?title=Kernel_method&oldid=709375900
https://en.wikipedia.org/w/index.php?title=Kernel_method&oldid=709375900

	Introduction
	Review of motion tracking
	Background subtraction
	Purpose and scope
	Outline of the report

	Theory
	Background subtraction
	Optical flow
	Optical flow tracker

	Mean shift
	ASMS: Robust mean shift with scale adaption

	Correlation filter
	KCF: Kernelized correlation filter

	Measures of tracking performance

	Method
	Materials
	Testing
	Background subtraction
	Camera parameters
	Test procedures for evaluating the trackers with background subtraction


	Results
	Camera configuration
	Background subtraction
	Stationary image
	Rotating camera
	Speed

	Tracking
	Speed
	VOT Challenge
	Tracking with background subtraction


	Discussion
	Camera
	Background subtraction
	Tracking
	OPTFLOW
	KCF
	ASMS

	Conclusions
	Future work


	 Bibliography