Frame rate up-conversion of real-time
high-de�nition remote surveillance video

Andreas Isberg

andreas@isberg.se

Master of Science in Software Engineering and Technology

Johan Jostell

johan.jostell@gmail.com

Master of Science in Computer Science: Algorithms, Languages

and Logic

Chalmers University of Technology
Department of Computer Science and Engineering
Göteborg, Sweden, April 2012

mailto:andreas@isberg.se
mailto:johan.jostell@gmail.com


The Author grants to Chalmers University of Technology and University of
Gothenburg the non-exclusive right to publish the Work electronically and
in a non-commercial purpose make it accessible on the Internet.
The Author warrants that he/she is the author to the Work, and warrants
that the Work does not contain text, pictures or other material that violates
copyright law.

The Author shall, when transferring the rights of the Work to a third party
(for example a publisher or a company), acknowledge the third party about
this agreement. If the Author has signed a copyright agreement with a third
party regarding the Work, the Author warrants hereby that he/she has ob-
tained any necessary permission from this third party to let Chalmers Uni-
versity of Technology and University of Gothenburg store the Work electron-
ically and make it accessible on the Internet.

Frame rate up-conversion of real-time high-de�nition remote surveillance
video

c© Andreas J. Isberg, April 2012.
c© Johan C. Jostell, April 2012.

Examiner: Devdatt Dubhashi
Supervisors: Vinay Jethava, Henrik Engström

Chalmers University of Technology
Department of Computer Science and Engineering
SE-412 96 Göteborg
Sweden
Telephone + 46 (0)31-772 1000

Cover: Frame rate up-conversion (see Chapter 3) of the Liseberg video (see
Figure 9.2).

Department of Computer Science and Engineering
Göteborg, Sweden April 2012


Abstract

Many surveillance and monitoring systems capture video using

static cameras with relatively low frame rate. Low frame rates have

bene�ts such as less storage requirements and less bandwidth used.

These issues may be of even greater concern for high-de�nition video

capture and remote surveillance. Also, a reduced frame rate makes

the video appear less smooth, causing increased strain for the viewer.

It is therefore desirable to increase the frame rate without increasing

bandwidth use or changing equipment.

This thesis presents the design and implementation of a tool for

performing frame rate up-conversion in real-time. The tool makes use

of GPUs and multi-core CPUs found in modern computer systems.

Techniques and frameworks such as OpenMP, SSE and OpenCL are

utilized to make full use of the systems capabilities.

Frame rate up-conversion is performed in two steps: motion esti-

mation and motion compensation. For motion estimation a number

of block matching algorithms were evaluated and implemented. We

present in this thesis our version of an exhaustive bidirectional search

algorithm for block matching, called Full Bidirectional Search (FBDS).

It uses zero motion prejudgement inspired from Adaptive Rood Pattern

Search (ARPS). The tool performs bidirectional motion compensation

(BDMC) on the GPU using OpenCL.

Testing was performed on three high-de�nition videos recorded by

us, �tting the use scenario of remote surveillance. Results show in-

creased quality compared to simple frame averaging, although with

occasional block artefacts. The tool is capable of up-converting the

recorded videos by a factor of three in a timespan less than 50 ms.

Hence, it is viable for use in a high-de�nition remote surveillance sys-

tem.


Sammanfattning

Många övervakningssystem använder statiska kameror med relativt

låg bildfrekvens för att spela in video. Låg bildfrekvens har vissa förde-

lar, t.ex. mindre lagringsutrymme och mindre bandbreddsanvändning

för fjärrövervakningssystem. För högupplöst video kan dessa fördelar

vara ännu viktigare. Video kan upplevas som ryckig när den är in-

spelad med en låg bildfrekvens, vilket kan orsaka ökad ansträngning

för personen som kollar på videon. Det är därför önskvärt att öka

bildfrekvensen utan att öka bandbredden eller byta utrustning.

I detta examensarbete presenteras design och implementering av

ett verktyg för att utföra uppkonvertering av bildfrekvens i realtid.

Verktyget använder sig av gra�kkort och �erkärniga processorer till-

gängliga i moderna datorer. Tekniker och ramverk så som OpenMP,

SSE och OpenCL används för att utnyttja systemens fulla kapacitet.

Uppkonvertering av video sker i två steg: rörelse-estimering och

rörelse-kompensering. Ett �ertal blockmatchningsalgoritmer imple-

menterades och utvärderades för rörelse-estimering. Vi presenterar

i denna rapport en egenutvecklad version av en fullständig dubbelrik-

tad sökalgoritm för blockmatchning, kallad Full Bidirectional Search

(FBDS), med förkontroll av statiska bildområden inspirerat av Adap-

tive Rood Pattern Search (ARPS). Dubbelriktad rörelse-kompensering

(BDMC) utfördes på gra�kkortet med hjälp av OpenCL.

Testning utfördes både på tre egeninspelade högupplösta sekvenser

som passar användningsområdet för fjärrstyrd övervakning. Resul-

taten visar att videokvaliteten ökar jämfört med enklare tekniker, även

om vissa blockartefakter förekommer. Verktyget klarar uppkonverter-

ing med faktor tre på de egeninspelade videosekvenserna, inom en

tidsrymd kortare än 50 ms. Detta visar att verktyget lämpar sig för

användning inom fjärrövervakningssytem med högupplöst video.


Acknowledgments

We would like to thank our supervisors Vinay Jethava and Hen-

rik Engström for support and advice during the writing of this thesis.

Vinay has been a great help in sifting through the relevant literature

and providing insightful comments during the development. Henrik's

expertise in x86 assembly has been a large contribution in our opti-

mization e�orts. We would also like to thank Erik Levin who was kind

enough to let us borrow his camera for �lming test videos. Finally we

would like to thank our friends and families for all their support.

Johan and Andreas


Table of Contents

1 Introduction 1

1.1 Frame rate conversion . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Problem description 3

2.1 System speci�cations . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Software-based implementation . . . . . . . . . . . . . 4
2.1.2 De�nition of real-time . . . . . . . . . . . . . . . . . . 4
2.1.3 High-de�nition . . . . . . . . . . . . . . . . . . . . . . 5
2.1.4 Static scenes . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.5 Remote objects . . . . . . . . . . . . . . . . . . . . . . 5
2.1.6 High reliability . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Scope and delimitations . . . . . . . . . . . . . . . . . . . . . 6

3 Frame rate up-conversion 7

3.1 Frame repetition . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 Frame averaging . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Advanced methods . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 Existing research . . . . . . . . . . . . . . . . . . . . . . . . . 10

4 Motion estimation 11

4.1 Block matching algorithms . . . . . . . . . . . . . . . . . . . . 12
4.1.1 Full search . . . . . . . . . . . . . . . . . . . . . . . . . 12
4.1.2 Three Step Search . . . . . . . . . . . . . . . . . . . . 13
4.1.3 New Three Step Search . . . . . . . . . . . . . . . . . . 14
4.1.4 Four Step Search . . . . . . . . . . . . . . . . . . . . . 16
4.1.5 Diamond Search . . . . . . . . . . . . . . . . . . . . . . 18
4.1.6 Adaptive Rood Pattern Search . . . . . . . . . . . . . . 19
4.1.7 Full Bidirectional Search . . . . . . . . . . . . . . . . . 21


4.2 Block similarity functions . . . . . . . . . . . . . . . . . . . . 23
4.2.1 Sum of Absolute Di�erences . . . . . . . . . . . . . . . 23
4.2.2 Mean Absolute Di�erence . . . . . . . . . . . . . . . . 24
4.2.3 Sum of Squared Di�erences . . . . . . . . . . . . . . . 24
4.2.4 Mean Squared Error . . . . . . . . . . . . . . . . . . . 25

4.3 Bitstream motion vectors . . . . . . . . . . . . . . . . . . . . . 25

5 Motion compensation 26

5.1 Direct motion compensation . . . . . . . . . . . . . . . . . . . 26
5.2 Bidirectional motion compensation . . . . . . . . . . . . . . . 27

6 OpenCL 29

6.1 Processing �ow . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

7 Evaluation methods 33

7.1 Objective quality evaluation . . . . . . . . . . . . . . . . . . . 33
7.1.1 Peak signal-to-noise ratio . . . . . . . . . . . . . . . . . 34
7.1.2 Structural similarity . . . . . . . . . . . . . . . . . . . 35
7.1.3 Other methods . . . . . . . . . . . . . . . . . . . . . . 35

7.2 Subjective quality evaluation . . . . . . . . . . . . . . . . . . . 36
7.2.1 Double-stimulus impairment scale . . . . . . . . . . . . 36
7.2.2 Double-stimulus continuous quality-scale . . . . . . . . 36
7.2.3 Subjective assessment method for video quality . . . . 37

7.3 Choice of quality evaluation methods . . . . . . . . . . . . . . 37

8 Implementation 38

8.1 Naive methods . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.2 Motion estimation . . . . . . . . . . . . . . . . . . . . . . . . 39

8.2.1 SAD assembly optimization . . . . . . . . . . . . . . . 40
8.2.2 Overlapped block motion estimation . . . . . . . . . . 40
8.2.3 Zero motion prejudgement . . . . . . . . . . . . . . . . 40

8.3 Motion compensation . . . . . . . . . . . . . . . . . . . . . . . 41

9 Results 43

9.1 Test videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
9.2 Test system speci�cation . . . . . . . . . . . . . . . . . . . . . 46
9.3 Subjective quality evaluation . . . . . . . . . . . . . . . . . . . 47

9.3.1 Recorded videos . . . . . . . . . . . . . . . . . . . . . . 47
9.3.2 Standard test videos . . . . . . . . . . . . . . . . . . . 53

9.4 Objective quality evaluation . . . . . . . . . . . . . . . . . . . 55


9.4.1 Recorded videos . . . . . . . . . . . . . . . . . . . . . . 56
9.4.2 Standard videos . . . . . . . . . . . . . . . . . . . . . . 60

9.5 Real-time capability . . . . . . . . . . . . . . . . . . . . . . . 65
9.5.1 Motion estimation time measurements . . . . . . . . . 65
9.5.2 Motion compensation time measurements . . . . . . . . 70
9.5.3 Total processing times . . . . . . . . . . . . . . . . . . 70

10 Conclusions 79

10.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

Bibliography 82

Glossary 87

Acronyms 88


List of Figures

2.1 Video path from source to screen with frame rate up-conversion 3

3.1 The frame rate up-converter illustrated as a black box . . . . . 7
3.2 Up-conversion using frame averaging . . . . . . . . . . . . . . 8
3.3 Up-converting a high-motion video using frame averaging . . . 9

4.1 Block matching . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4.2 The three steps of TSS . . . . . . . . . . . . . . . . . . . . . . 14
4.3 New Three Step Search . . . . . . . . . . . . . . . . . . . . . . 16
4.4 The four steps of 4SS . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 Diamond Search . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Adaptive Rood Pattern Search . . . . . . . . . . . . . . . . . . 21
4.7 Full Bidirectional Search . . . . . . . . . . . . . . . . . . . . . 23

5.1 Direct motion compensation . . . . . . . . . . . . . . . . . . . 26
5.2 Un�lled pixels when using direct motion compensation . . . . 27
5.3 Bidirectional motion compensation . . . . . . . . . . . . . . . 28

7.1 Full-reference objective quality evaluation . . . . . . . . . . . 34

8.1 Up-conversion pseudo code . . . . . . . . . . . . . . . . . . . . 38
8.2 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . . . 42

9.1 CIF test video sequences . . . . . . . . . . . . . . . . . . . . . 44
9.2 Liseberg test video . . . . . . . . . . . . . . . . . . . . . . . . 45
9.3 FerryClose test video . . . . . . . . . . . . . . . . . . . . . . . 45
9.4 FerryFar test video . . . . . . . . . . . . . . . . . . . . . . . . 46
9.5 Artefacts in 3x up-conversion of FerryFar sequence . . . . . . 48
9.6 Sharpness comparison, 3x up-conversion of FerryClose . . . . . 49
9.7 Sharpness compared to original, 3x up-conversion of FerryClose 49
9.8 Artefacts in 3x up-conversion of FerryClose . . . . . . . . . . . 50
9.9 Artefacts on tram, 3x up-conversion of Liseberg . . . . . . . . 51
9.10 Artefacts on tower, 3x up-conversion of Liseberg . . . . . . . . 51


9.11 Artefacts on tram, 2x up-conversion of Liseberg . . . . . . . . 52
9.12 Artefacts on tower, 2x up-conversion of Liseberg . . . . . . . . 53
9.13 Face artefacts, frame 38 of Akiyo . . . . . . . . . . . . . . . . 53
9.14 Face artefacts, frame 76 of Akiyo . . . . . . . . . . . . . . . . 54
9.15 Face artefacts, frame 137 of Akiyo . . . . . . . . . . . . . . . . 54
9.16 6x up-conversion of Container, frame 111 . . . . . . . . . . . . 54
9.17 Bird artefacts, frame 262 of Container . . . . . . . . . . . . . 55
9.18 Face artefacts, frame 13 of Foreman . . . . . . . . . . . . . . . 55
9.19 Average PSNR of FerryFar . . . . . . . . . . . . . . . . . . . . 56
9.20 Per-frame PSNR of FerryFar . . . . . . . . . . . . . . . . . . . 57
9.21 Average PSNR of FerryClose . . . . . . . . . . . . . . . . . . . 57
9.22 Per-frame PSNR of FerryClose . . . . . . . . . . . . . . . . . . 58
9.23 Average PSNR of Liseberg (3x) . . . . . . . . . . . . . . . . . 59
9.24 Per-frame PSNR of Liseberg (3x) . . . . . . . . . . . . . . . . 59
9.25 Average PSNR of Liseberg (2x) . . . . . . . . . . . . . . . . . 60
9.26 Per-frame PSNR of Liseberg (3x) . . . . . . . . . . . . . . . . 60
9.27 Average PSNR of Akiyo . . . . . . . . . . . . . . . . . . . . . 61
9.28 Per-frame PSNR of Akiyo . . . . . . . . . . . . . . . . . . . . 61
9.29 Per-frame SSIM of Akiyo . . . . . . . . . . . . . . . . . . . . . 62
9.30 Average PSNR of Container . . . . . . . . . . . . . . . . . . . 62
9.31 Per-frame PSNR of Container . . . . . . . . . . . . . . . . . . 63
9.32 Per-frame SSIM of Container . . . . . . . . . . . . . . . . . . 63
9.33 Average PSNR of Foreman . . . . . . . . . . . . . . . . . . . . 64
9.34 Per-frame PSNR of Foreman . . . . . . . . . . . . . . . . . . . 64
9.35 Per-frame SSIM of Foreman . . . . . . . . . . . . . . . . . . . 65
9.36 ME on FerryFar using DESKTOP . . . . . . . . . . . . . . . . 67
9.37 ME on FerryFar using LAPTOP . . . . . . . . . . . . . . . . . 67
9.38 ME on FerryClose using DESKTOP . . . . . . . . . . . . . . . 68
9.39 ME on FerryClose using LAPTOP . . . . . . . . . . . . . . . 68
9.40 ME on Liseberg using DESKTOP . . . . . . . . . . . . . . . . 69
9.41 ME on Liseberg using LAPTOP . . . . . . . . . . . . . . . . . 69
9.42 Total processing time on Akiyo using DESKTOP . . . . . . . 71
9.43 Total processing time on FerryFar using DESKTOP . . . . . . 72
9.44 Total processing time on FerryFar using LAPTOP . . . . . . . 72
9.45 Total processing time on FerryClose using DESKTOP . . . . . 73
9.46 Total processing time on FerryClose using LAPTOP . . . . . . 73
9.47 Total processing time on Liseberg using DESKTOP . . . . . . 74
9.48 Total processing time on Liseberg using LAPTOP . . . . . . . 74
9.49 Total processing time on FerryFar (8x8E) using DESKTOP . . 75
9.50 Total processing time on FerryClose (8x8E) using DESKTOP 75
9.51 Total processing time on Liseberg (8x8E) using DESKTOP . . 76


9.52 Total processing time on FerryFar (8x8E) using LAPTOP . . 77
9.53 Total processing time on FerryClose (8x8E) using LAPTOP . 77
9.54 Total processing time on Liseberg (8x8E) using LAPTOP . . . 78

List of Tables

4.1 List of notations used in this chapter. . . . . . . . . . . . . . . . 12

9.1 Speci�cations of test systems. . . . . . . . . . . . . . . . . . . . 46
9.2 Average time for di�erent optimizations on ME . . . . . . . . 66
9.3 Relative time improvements with optimization . . . . . . . . . 66
9.4 Average time for ME for the standard videos . . . . . . . . . . 66
9.5 Comparison of CPU- and OpenCL-based BDMC implementa-

tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.6 Average time for MC on the recorded videos . . . . . . . . . . 70
9.7 Average time for MC on the standard videos . . . . . . . . . . 71


1. Introduction

1.1 Frame rate conversion

Many surveillance and monitoring videos are captured using static cameras
with relatively low frame rate. The low frame rate of the videos is economic
both in terms of the monitoring equipment and storage requirement. Other
factors may in�uence the necessity of keeping a low frame rate, such as
limited transmission bandwidth for a remote surveillance solution. Storage
space may be an issue for high-de�nition (HD) resolution video capture.
However, this poses a challenge in terms for the manual inspection of such
video feeds, since low frame rate videos frequently consist of non-smooth
motion artefacts. This can lead to increased strain for the viewer. Therefore,
a key challenge towards improving the user experience lies in successfully
obtaining a smoother video by increasing the frame rate.

In this thesis we present a solution aimed at removing these issues for re-
mote surveillance HD video feeds, by a technique known as frame rate up-
conversion.

1.2 Thesis contributions

The main contributions of this thesis are:

• The design and implementation of a frame rate up-conversion tool
(FRUCT), specialized for use with remote surveillance video.

• A comparative study of some well-known, published unidirectional block
matching algorithms (BMAs) and their suitability for use with frame
rate up-conversion of remote surveillance video.

• Research and data on the topic of how Open Computing Language
(OpenCL) can be used for GPU-acceleration of frame rate up-conversion.

Page 1 of 90


• Full Bidirectional Search (FBDS), our version of an exhaustive search
BMA, using ideas from Adaptive Rood Pattern Search (ARPS) for
early termination of search in static areas of video.

1.3 Disposition

The thesis is organized as follows:

• Chapter 2 presents the problem description and goals, as well as the
scope and delimitations.

• Chapter 3 describes the theory behind both naive and more advanced
ways of performing frame rate up-conversion. A short summary of some
of the existing research in the area is also given.

• Chapter 4 introduces a number of motion estimation algorithms used
in this thesis.

• Chapter 5 gives details on the motion compensation algorithms used
in this thesis.

• Chapter 6 familiarizes the reader with the terminology and processing
�ow of OpenCL.

• Chapter 7 contains information about evaluation methods by which the
result can be measured.

• Chapter 8 shows a detailed look on our implementation of a frame rate
up-converter.

• Chapter 9 displays the results after frame rate up-conversion.

• Chapter 10 sums up the thesis with conclusions and notes on possible
future work.

Page 2 of 90


2. Problem description

2.1 System speci�cations

The focus of this thesis is the design and implementation of a frame rate up-
conversion tool (FRUCT). The up-converter will be part of a larger system,
as can be seen from Figure 2.1. It will work as a "middle-man" between the
decoder and renderer software.

Input to the FRUCT is raw 1080p video at 20 frames per second (FPS)
corresponding to the intended remote surveillance scenario. Output should
be up-converted video in the same format at 60 FPS. In other words: two
interpolated frames should be inserted for every original frame. Performance
and quality should be high enough to be successfully applied in real-time.

Figure 2.1: Video path from source to screen with the FRUCT.

The area of application, remote surveillance with the settings described

Page 3 of 90


above, adds a number of items to the speci�cation. These may not be as
important for a general-purpose implementation of a FRUCT.

2.1.1 Software-based implementation

The FRUCT should be implemented as software, not as hardware. Many
hardware-based solutions exist today, especially for use in real-time and for
high-de�nition (HD) video, e.g. by Lee & Nguyen (2010). Software-based
solutions are however easier to maintain and can more easily interact with
other parts of a software-based remote surveillance system.

2.1.2 De�nition of real-time

The FRUCT should be capable of up-converting a received video stream
from 20 FPS to 60 FPS (an up-conversion factor of 3) in real-time. For true
real-time behaviour, frames constructed by the FRUCT must be completed
within 1/60 seconds, roughly 16.7 ms, for this to be ful�lled.

The real-time demand in the case of remote surveillance also imposes a re-
striction on the delay between a video frame being captured in the camera
and the the frame being displayed on-screen. This delay should ideally be
kept as low as possible, and in this speci�c case no more than one (1) sec-
ond. The variable transmission delay poses additional di�culty to put a hard
limit on the maximum delay the FRUCT could introduce in the system. The
FRUCT may require some initial bu�ering or setup routines, introducing a
delay before the �rst received frame is written to the output channel.

A solution involving a relaxation on the real-time demand is to produce
frames to be placed in an output bu�er with a predetermined size, typically
equal to the up-conversion factor. For a bu�er size of n and frame rate f ,
the time limit is then equal to n

f
, which for n = 3 and f = 60 calculates to 50

ms. In other words, we have 50 ms to �ll the bu�er with three frames where
one is an original frame and two are constructed by the FRUCT. Since the
original frame can be copied directly to the output bu�er, most of the 50 ms
can in practice be used for construction of the two new frames.

Page 4 of 90


2.1.3 High-de�nition

Video received by the FRUCT will be in HD, 1080p. Video capture at such
high resolutions produce large amounts of data to be encoded, transmitted
and decoded. A majority of the FRUCTs described in existing literature
have primarily been tested on low-resolution video formats with less focus
on real-time capability.

2.1.4 Static scenes

The intended usage scenario for the FRUCT is stationary cameras and mostly
static outdoor scenery. A more general purpose FRUCT must be able to
handle video motion as well as camera motion (panning, tilting, etc.), which
in the latter case is known as global motion. Also, cuts and scene changes also
never happen for continously �lming stationary cameras. There is however
the case of events such as sudden changes in cloud coverage or weather which
may give e�ects to a large portion of the visible area, which have to be taken
into account.

2.1.5 Remote objects

Objects found in video from the intended usage scenario are a sizeable dis-
tance from the camera; from ten to a hundred meters up to several hundred
meters. The distance means that they will appear on a relatively small area
of the video frame. This opens up for both possibilities and problems: the
motion of these objects from frame to frame in terms of pixels on the screen
will be fairly small. However, depending on which method used for motion
estimation (ME) there might be problems �nding true motion of the image
areas covered by the objects.

2.1.6 High reliability

For the video to be a reliable source of information, especially vital if it is to
be used for security-critical surveillance, the processed video should stay as
close to the original as possible. The frames used for interpolation are not
to be manipulated in any way, and shall be shown "as-is". Visible artefacts
in the video should preferably be kept to a minimum, more so in regions of
interest.

Page 5 of 90


2.2 Scope and delimitations

In order to keep the focus on areas of importance, the following decisions
were made:

• The implementation should not have to be a complete product, a work-
ing prototype would su�ce.

• The prototype should be implemented in software, not hardware. Hard-
ware solutions have been proven to be successful. However, a well im-
plemented software solution should still be able to produce good results,
with the bene�ts of being more easily implemented and maintained.

• The prototype should focus on high performance and quality for small
moving objects (far away, thus having small motion), rather than large
objects close to the camera. This focus is due to the intended use for
remote surveillance (remote in both senses; the video streaming to a
remote location and objects under surveillance being a sizeable distance
from the cameras).

• The prototype should not handle any encoding, decoding or rendering
except for possibly obtaining motion vectors from the H.264/MPEG-4
AVC (H.264) bitstream.

• The prototype should only be used with video from stationary cameras,
as opposed to a solution capable of up-converting any kind of video.

• The prototype should work with 1080p video resolution (or lower), in
Y'CbCr 4:2:0 color space (raw, uncompressed video).

Page 6 of 90


3. Frame rate up-conversion

Increasing the frame rate, i.e. up-conversion, is done to enhance the visual
experience of video with low frame rate. Video at 30 frames per second
(FPS) is clearly smoother than video at 20 FPS, especially for video with
much motion. Up-conversion is performed by adding new frames to a video
from existing ones, as Figure 3.1 illustrates. Much research has been put into
doing this in a fast and e�cient manner.

Figure 3.1: The frame rate up-converter illustrated as a black box. Input is a

source video and output is an up-converted video.

Existing research in the area falls under two broad categories of solutions:
hardware-based and software-based. Hardware-based solutions using Field-
Programmable Gate Arrays (FPGAs) perform well for real-time 1080p video
(Lee & Nguyen 2010). Such solutions are however beyond the scope of this
thesis.

Page 7 of 90


3.1 Frame repetition

The simplest way of increasing the frame rate is by repeating every frame F
k (i.e. up-conversion factor) number of times. Equation 3.1 below describes
frame repetition (FR), where 1 ≤ j ≤ k, j ∈ Z+ and t being a time index.

Ft− j
k
= Ft−1 (3.1)

For example, Ft− 1
2
is the interpolated frame between frames Ft−1 and Ft.

This method is extremely fast but gives a poor visual experience.

3.2 Frame averaging

A slightly better way to increase the frame rate is by interpolating all pixel
values for a new frame from two neighbouring frames. Figure 3.2 below
illustrates frame averaging (FA).

Figure 3.2: Up-conversion using FA. Every new frame (non-striped) is averaged

by the two neighbouring frames (striped).

Let Pt(x, y) be the value of a pixel with coordinates (x, y) in frame ft, t being
a time index. Then for all (x, y) inside the frame boundaries of ft, with

Page 8 of 90


1 ≤ j < k and j ∈ Z+ , the pixel values for an intermediate interpolated
frame ft− j

k
is given by Equation 3.2.

Pt− j
k
(x, y) =

(
j

k

)
Pt−1(x, y) +

(
1− j

k

)
Pt(x, y) (3.2)

This method is very fast, and gives a smoother visual experience than FR. It
is however not suitable for high-motion video, as seen in Figure 3.3. Smooth-
ness is gained at the loss of sharpness.

Figure 3.3: Up-converting a high-motion video using frame averaging. This video

is the Soccer standard benchmarking video1.

3.3 Advanced methods

The main challenge in frame rate up-conversion is how to e�ciently and cor-
rectly perform motion estimation (ME) and produce the intermediate frames
using motion compensation (MC). Motion-compensated frame interpolation
is a term often used to describe this process.

There are many existing algorithms for motion estimation. Approximating
the optical �ow, motion of individual pixels from one frame to another, is
one way. However, it has traditionally not been used in video encoding for
two purposes. It is a relatively slow method, and encoding the motion of in-
dividual pixels would require too much data to be feasible. Instead, methods
based on the movements of rectangular blocks of pixels are often used. Since
block matching is the de facto most-used family of motion estimation algo-
rithms in use, fast algorithms have been developed during the years. Some

1Standard videos can be found at http://media.xiph.org/video/derf/

Page 9 of 90

http://media.xiph.org/video/derf/


of the block matching algorithms (BMAs), of which a few were selected for
implementation in this thesis, are presented in Chapter 4.

Once the video motion has been estimated other algorithms may be used to
utilize the so-called motion vectors of each block to produce a sharp interme-
diate frame. Algorithms for this are presented in Chapter 5. In the following
section we give a brief overview of the research that has been done in the
area of frame rate up-conversion.

3.4 Existing research

ME is often done with various BMAs, to reduce computational complexity.
Some form of BMA is performed in encoders for all video coding standards,
e.g. H.264/MPEG-4 AVC (H.264) (Richardson 2010). The resulting motion
vectors can be extracted and decoded from the bitstream at the receiving
end to reduce or remove the need for ME when performing frame rate up-
conversion (Chen et al. 1998, Sasai et al. 2004). In order to detect and account
for abrupt illumination changes, which can cause faulty motion vectors dur-
ing BMA, histogram-based methods have been developed (Thaipanich et al.
2009). By using an increased temporal window in BMA the accuracy of ME
can be increased (Kang et al. 2008), though at the price of processing time.

A very popular post-processing step on the Motion Vector Field (MVF) ob-
tained from ME is median �ltering (Zhai et al. 2005, Choi et al. 2006, Gan
et al. 2007, Luessi & Katsaggelos 2009). This step can also be performed
on hardware (Tasdizen & Hamzaoglu 2010). The �ltering "smooths" the
MVF, removing outliers. The problem of overlapping motion-compensated
blocks is often solved by employing Overlapped Block Motion Compensation
(OBMC) techniques (Lee et al. 2003, Zhai et al. 2005, Choi et al. 2006). An
alternative method of handling occlusion is for example using the divergence
of the MVF (Hong 2009).

Page 10 of 90


4. Motion estimation

There are several di�erent ways to estimate motion between frames, and
many models which may be suitable to describe this motion. The model
most commonly used is that of linear motion with constant acceleration.
This is a reasonable assumption provided the frame rate is not too low, since
the frame-to-frame coherence is usually high.

Section 4.1 below presents a way to perform motion estimation (ME) by using
a block matching algorithm (BMA). Image blocks in temporally adjacent
frames are matched, as illustrated in Figure 4.1, using a block similarity
function (see Section 4.2). Section 4.3 presents another way to estimate
motion by using pre-calculated motion vectors from encoded bitstreams.

Figure 4.1: Block matching between a block in the current frame against blocks

in the reference frame. The best match is found in the lower-right corner. The

search area is used to limit the amount of blocks matched.

Page 11 of 90


4.1 Block matching algorithms

In block matching algorithms (BMAs) a frame is divided into a set, B, of
non-overlapping rectangular image blocks. An individual block b ∈ B is
matched against blocks in a temporally adjacent reference frame. The best
match found for a block b using some appropriate similarity function is used
to form a motion vector for the block. The motion vectors are used in motion
compensation (MC) described in Chapter 5.

The most common block sizes used are 4x4, 8x8 and 16x16 pixels. This is
because they are common divisors to the most well-known resolutions, such
as Common Intermediate Format (CIF) and 720p. 1080p is however not
divisible by 16, and this must be taken into consideration during implemen-
tation.

Matching blocks can be computationally expensive, and di�erent BMAs try
to reduce the number of block matching operations required to �nd the best
match. All BMAs presented below, except for Adaptive Rood Pattern Search
(ARPS) (Nie & Ma 2002), are fully parallelizable as all blocks within a frame
are independent from each other.

α Length of rood arm
b An image block
bH Height of block in pixels
bW Width of block in pixels
H Search area height
MVx Horizontal component of a block motion vector
MVy Vertical component of a block motion vector
n Total number of blocks matched for one block
si Number of block mathing operations in step i of an algorithm
t Time index
W Search area width
Z+ Positive integers (1, 2, 3, ...,)

Table 4.1: List of notations used in this chapter.

4.1.1 Full search

Full Search (FS) involves �nding a best match for a reference block against
all possible block locations within a speci�ed search window. The number of

Page 12 of 90


blocks matched, n, for blocks of width bW and height bH , and search window
width W and height H (where W ≥ bW and H ≥ bH), can be calculated
using Equation 4.1 below.

n = (W − bW + 1) ∗ (H − bH + 1) (4.1)

For a typical setting of bW , bH = 16 and W,H = 30 a total of 225 blocks
are matched. Matching a block against an entire frame is possible, but is
not practically feasible, especially for high frame resolutions. For example,
matching a block against an entire frame using bW , bH = 16 and W = 1920
and H = 1080, a total of 2028825 block matching operations would be re-
quired.

4.1.2 Three Step Search

One of the very �rst BMAs was the Three Step Search (TSS) (Koga et al.
1981). This algorithm has in�uenced many other algorithms, such as New
Three Step Search (NTSS) and Four Step Search (4SS). TSS reduces the
number of search operations, compared to FS, by iteratively locating a best
match in three steps.

• Step one

Match all positions, including the center position, at a distance of four
pixels away from the center. This step matches nine blocks, i.e. s1 = 9.

• Step two

Center the search position around the best match found. Match all
positions at a distance of two pixels away from the center. This step
matches eight blocks, i.e. s2 = 8.

• Step three

Center the search position around the best match found. Match all
positions at a distance of one pixel away from the center. This step
matches eight blocks, i.e. s2 = 8. The best match found in this step is
the match returned by the algorithm.

This algorithm always perform 25 block matching operations (Equation 4.2).
Figure 4.2 below illustrates an example run of TSS.

n = s1 + s2 + s3 = 25 (4.2)

Page 13 of 90


TSS reduces the number of blocks matched compared to full search. If
bW , bH = 16 and W,H = 30 is used for full search, TSS performs nine times
better.

Barjatya (2004) argues that one disadvantage of TSS is that it can miss small
motion due to the search pattern.

Figure 4.2: Illustration of the TSS algorithm. The starting center position is

indicated by a surrounding square. The circles indicate the best matches found in

each step. The double circle indicates the best match found by the algorithm. TSS

always perform 25 block matching operations.

4.1.3 New Three Step Search

Li et al. (1994) suggested an improved version of TSS, called New Three
Step Search (NTSS). Where TSS can have problems of �nding small motion,
NTSS tackles this using a center-biased search pattern. In contrast to TSS
it has two di�erent early termination conditions, where searching can be
stopped early to reduce the total number of blocks matched.

• Step one

Match all positions, including the center position, at a distance of one
and four pixels away from the center. This step matches 17 blocks, i.e.
s1 = 17. If the best match is found in the center, return this as the
best match found.

Page 14 of 90


• Steps two and three
If the best match in s1 is found at a distance of:

� One pixel from the center. Then center the search position around
the best match found. Match all positions not matched in the �rst
step, at a distance of one pixel away from the center. This step
matches either three or �ve blocks. The best match found at this
point is the best matching block.

� Four pixels from the center. Then follow steps two and three from
TSS. The best match found in step three is the match returned
by NTSS.

Step two matches three, �ve or eight blocks, i.e. s2 ∈ {3, 5, 8}. Step
three can be skipped, or matches eight blocks, i.e. s3 ∈ {0, 8}.

The number of blocks matched, n, are the sum of all blocks matched in the
steps above. This can vary from 17 to 33 matches (Equation 4.3). Figure 4.3
below illustrates an example run of NTSS.

n = s1 + s2 + s3, where n ∈ {17, 20, 22, 30, 32, 33} (4.3)

Li et al. (1994), Barjatya (2004) have presented results showing that NTSS
performs slightly better than TSS.

Page 15 of 90


Figure 4.3: Illustration of the NTSS algorithm. The third step was skipped as the

best match found in the �rst step was 1 pixel away from the center. The starting

center position is indicated by a surrounding square. The circles indicate the best

matches found in each step. The double circle indicates the best match found by

the algorithm. A total of 22 blocks were matched in this example.

4.1.4 Four Step Search

The Four Step Search (4SS) algorithm (Po & Ma 1996) is based on TSS and
NTSS. Like TSS and NTSS it uses a center biased search pattern and has
early termination conditions in the two �rst steps.

• Step one

Match all positions, including the center position, at a distance of two
pixels away from the center. This step matches 9 blocks, i.e. s1 = 9. If
the best match is found in the center, skip to step four.

• Step two

Center the search position around the best match found. Match all
positions not matched in the �rst step at a distance of two pixels away
from the center. This step is either skipped, or matches three or �ve
blocks, i.e. s2 ∈ {0, 3, 5}. If the best match is found in the center, skip
to step four.

• Step three

Page 16 of 90


This step is exactly the same as step two. It is either skipped, or
matches three or �ve blocks, i.e. s3 ∈ {0, 3, 5}.

• Step four

Center the search position around the best match found. Match all
positions at a distance of one pixel away from the center. This step
matches eight blocks, i.e. s4 = 8. The best match found in this step is
the match returned by the algorithm.

The number of blocks matched, n, are the sum of all blocks matched in the
four steps. This can vary from 17 to 27 matches (Equation 4.4). Figure 4.4
below illustrates an example run of 4SS where 25 blocks are matched.

n = s1 + s2 + s3 + s4, where n ∈ {17, 20, 22, 23, 25, 27} (4.4)

Figure 4.4: Illustration of the 4SS algorithm. The starting center position is

indicated by a surrounding square. The circles indicate the best found matches in

each step. The double circle indicates the best match found by the algorithm. A

total of 25 blocks were matched in this example.

Po & Ma (1996), Barjatya (2004) show that 4SS reduces the number of blocks
matched, compared to TSS, by 20-30%, depending on the amount of motion
in the video. 4SS can in the worst case be 8% slower than TSS.

Page 17 of 90


4.1.5 Diamond Search

Diamond Search (DS) (Zhu & Ma 1997) is in�uenced by 4SS but uses a
diamond shape search pattern instead. It has three steps, where the second
step can be repeated any number of times.

• Step one

Start by matching the center position, the positions two pixels away in
the horizontal and vertical directions, and the positions one pixel away
in the diagonal directions. This step matches nine blocks, i.e. s1 = 9.
If the best match is found in the center, skip to step three.

• Step two

Center the search position around the best match found. Match all
positions not matched in the �rst step in a diamond shaped pattern.
This step is either skipped, or matches three or �ve blocks, i.e. s2 ∈
{0, 3, 5}. Repeat this step until the best match is found in the center.

• Step three

Center the search position around the best match found. Match all
positions one pixel away in the horizontal and vertical directions. This
step matches four blocks, i.e. s3 = 4. The best match found in this
step is the match returned by the algorithm.

The number of blocks matched, n, are the sum of all blocks matched in the
three steps (Equation 4.5). Figure 4.5 below illustrates an example run of
DS where 21 blocks are matched.

n = s1 + k ∗ s2 + s3, where n ∈ {13, 16, 18, 19, 21, 22, 23 . . .},
k ∈ {0, 1, 2, . . .}

(4.5)

Page 18 of 90


Figure 4.5: Illustration of the DS algorithm. The starting center position is

indicated by a surrounding square. The circles indicate the best matches found in

each step. Notice that the same position was found as the best match for both the

�rst and second run of step two. The double circle indicates the best match found

by the algorithm. A total of 21 blocks were matched in this example.

Barjatya (2004) shows that DS slightly reduces the number of blocks matched
compared to 4SS.

4.1.6 Adaptive Rood Pattern Search

The Adaptive Rood Pattern Search (ARPS) (Nie & Ma 2002) is named
after its rood-like pattern. Every block uses the motion vector of their left
neighbouring block to set the length of the rood arm α. ARPS is thus
only parallelizable in the vertical direction as the blocks are horizontally
dependent. This dependency gives the algorithm the potential of accurately
�nding large motion between frames, since it assumes that motion in general
is coherent in a frame.

ARPS uses zero-motion prejudgement for early termination of video with
static content, which can signi�cantly reduce the total number of blocks
matched for a frame.

Three steps are used to match blocks, where the last step is repeated until
the center position is the best match.

Page 19 of 90


• Step one

Perform zero-motion prejudgement; match the block against the block
at the same position in the reference frame. If the di�erence is below a
threshold τ , stop searching and return this match. This step matches
one block, i.e. s1 = 1.

• Step two

If the block is the left-most block, set α = 2. Otherwise use the largest
component,MVx orMVy, of the motion vector of the left neighbouring
block, i.e. α = max(|MVx|, |MVy|). Letting (x, y) denote the current
center of search, all positions (x ± α, y ± α) and (x +MVx, y +MVy)
are matched. This step is either skipped, or matches �ve blocks, i.e.
s2 ∈ {0, 5}.

• Step three

Center the search position around the best match found. Match all
positions, not matched in the �rst or second step, at a distance of one
pixel away from the center in the horizontal and vertical directions.
Iterate this step until the best match is the center position. This posi-
tion is returned as the best match by the algorithm. This step is either
skipped, or matches two to four blocks, i.e. s3 ∈ {0, 2, 3, 4}.

The number of blocks matched, n, are the sum of all blocks matched in the
three steps (Equation 4.6). Notice that step three can be repeated k number
of times. Figure 4.6 below illustrates an example run of ARPS where step
three is iterated three times. A total of 15 blocks are matched in this example.

n = s1 + s2 + k ∗ s3, where n ∈ {1, 8, 9, 10, . . .},
k ∈ {0, 1, 2, . . .}

(4.6)

Page 20 of 90


Figure 4.6: Illustration of the ARPS algorithm. The starting center position is

indicated by a surrounding square. The rood arm α is illustrated by the thin line.

The circles indicate the best found matches in each step. The double circle indicates

the best match found by the algorithm. A total of 15 blocks were matched in this

example.

Nie & Ma (2002), Barjatya (2004) show that ARPS reduces the number of
blocks matched, compared to DS, by a factor of two.

4.1.7 Full Bidirectional Search

Bidirectional search, where bidirectional and bilateral are sometimes used in-
terchangeably, is a family of BMAs which in theory should be more suited for
use together with bidirectional motion compensation (BDMC). The search-
ing procedure is changed compared to e.g. 4SS and ARPS by not keeping a
position in one frame constant while searching for a match in another frame.
Instead, positions in both frames are allowed to vary. The idea is not to
estimate where one block of the frame might have moved to or from in the
reference frame, but which surrounding parts may have moved through the
current block. This corresponds well to the behaviour of BDMC, since the
motion vectors for the block, as used by BDMC, can be said to originate
from an un�lled block in an interpolated intermediate frame. Variations on
the bidirectional search is used by e.g. Zhai et al. (2005) and Kang et al.
(2008).

Page 21 of 90


We present in this thesis Full Bidirectional Search (FBDS), our simple version
of an exhaustive bidirectional search algorithm. It uses the zero motion
prejudgement found in ARPS for early termination, which for the low to
moderate motion found in the remote surveillance use scenario should prove
highly bene�cial. The search area ranges from -4 to +4 in horizontal and
vertical directions, for a total of 81 positions searched (82 including the early
zero motion check). The algorithm rests heavily on the assumption of linear
motion. The o�set in search position is negated for one of the frames; e.g.
a position with o�set of minus one in both horizontal and vertical directions
compared to the block's position is matched against a position with o�set
plus one in horizontal and vertical directions in the other frame.

The number of positions searched may seem large compared to e.g. 4SS and
ARPS. However, zero motion prejudgement, aggressive optimization from
the compiler and improved scheduling when using OpenMP can reduce this
issue.

Figure 4.7 shows how the two steps of FBDS work. In step one, the center
position in both frames is matched and checked against the zero motion
threshold. Step two (s2) involves 81 matches, starting at the top left (-4, -4)
block in frame A matched to the bottom right (+4, +4) block in frame B.
The last matching operation in this step is between the bottom right (+4,
+4) block in frame A and the top left (-4, -4) block in frame B.

Page 22 of 90


Figure 4.7: Illustration of the FBDS algorithm. The starting center position is

indicated by a surrounding square. Step 2 involves 81 additional matches, starting

at the top left block for frame A and the bottom right block in frame B. The last
match is between the bottom right block in frame A and the top left block in frame

B. A total of 82 blocks were matched in this example.

4.2 Block similarity functions

There are a number of di�erent techniques to measure the similarity between
two blocks (denoted by bA and bB in the following sections). All pixels in
the blocks are compared in pairs and a total value indicating the similarity
is computed. A low value indicates higher similiarity. As the pixels are
independent from each other in each frame, the techniques are parallelizable
and can be implemented in an e�cient manner.

4.2.1 Sum of Absolute Di�erences

The Sum of Absolute Di�erences (SAD) summarizes the absolute di�erence
pixel values (Bovik 2009, Marzat & Ducrot 2009), as described in Equa-

Page 23 of 90


tion 4.7. SAD is sometimes referred to as Sum of Absolute Errors (SAE).

SAD(bA, bB) =

bW−1∑
i=0

bH−1∑
j=0

|bAij
− bBij

| (4.7)

This technique is widely adopted (Cetin & Hamzaoglu 2010, Lee & Nguyen
2009, Luessi & Katsaggelos 2009), as it is computationally cheaper than
many other block similarity functions. Streaming SIMD Extensions (SSE)
exploits the inherent parallelism by having special instructions for computing
the SAD for up to 16 pixels at a time (see Section 8.2.1).

4.2.2 Mean Absolute Di�erence

Mean Absolute Di�erence (MAD) computes the mean of SAD (Barjatya
2004), as described in Equation 4.8. MAD is sometimes referred to as Mean
Absolute Error (MAE).

MAD(bA, bB) =
1

bW ∗ bH
SAD(bA, bB) (4.8)

MAD is computationally more expensive than SAD and an implementation
requires �oating point precision.

4.2.3 Sum of Squared Di�erences

Sum of Squared Di�erences (SSD) summarizes the squared di�erence pixel
values (Marzat & Ducrot 2009), as described in Equation 4.9.

SSD(bA, bB) =

bW−1∑
i=0

bH−1∑
j=0

(bAij
− bBij

)2 (4.9)

Page 24 of 90


4.2.4 Mean Squared Error

Mean Squared Error (MSE) computes the mean of SSD (Bovik 2009), as
described in Equation 4.10.

MSE(bA, bB) =
1

bW ∗ bH
SSD(bA, bB) (4.10)

MSE is computationally more expensive than SSD and an implementation
requires �oating point precision. This function is used to compute the peak
signal-to-noise ratio (PSNR) between two frames, which is described in Sec-
tion 7.1.1.

4.3 Bitstream motion vectors

Some form of block matching is performed in encoders for all video coding
standards. The resulting motion vectors can be extracted and decoded from
the bitstream at the receiving end to reduce or remove the need for ME when
performing frame rate up-conversion. This technique has successfully been
used by e.g. Chen et al. (1998) and Sasai et al. (2004). Depending on the
settings and runtime decisions on the encoder side the motion vectors in the
bitstream may or may not be describing true motion in the video. They
can in some cases be missing entirely (Huang & Nguyen 2008). If bitstream
motion vectors are to be used some sort of judging of the quality of the
received motion vectors must be incorporated into the algorithm, as done by
e.g. Luessi & Katsaggelos (2009).

Page 25 of 90


5. Motion compensation

5.1 Direct motion compensation

The most simple form of motion compensation (MC) is direct motion com-
pensation (DMC), also referred to as unidirectional motion compensation.
In a new interpolated frame, blocks are �lled with data by moving blocks in
the reference frame along their motion vectors (obtained in motion estima-
tion (ME)). This can be done in either the forward or backward direction.
Figure 5.1 below illustrates DMC in the forward direction.

Figure 5.1: Illustration of DMC in the forward direction. Every block in Ft−1 is

moved along its motion vector.

The problem with DMC is that all pixels might not be �lled (Huang &
Nguyen 2008, Luessi & Katsaggelos 2009). These "holes" are illustrated in
Figure 5.2 below. Various methods exist to �ll these holes, such as video

Page 26 of 90


inpainting (Criminisi et al. 2004, Wexler et al. 2004) and texture synthesis
(Ashikhmin 2001).

Figure 5.2: Un�lled pixels when using direct motion compensation. This frame is

from the Foreman standard benchmarking video.

5.2 Bidirectional motion compensation

Another way of performing motion compensation is bidirectional motion com-
pensation (BDMC). For every block in a new frame, the motion vector found
for the same block in the reference frame is used to interpolate in both the
backward and forward directions. This method leaves no un�lled pixels, like
DMC, as the method iterates over all blocks in the new frame instead of
blocks from the reference frame. Figure 5.3 below illustrates BDMC.

Page 27 of 90


Figure 5.3: BDMC illustrated. For every block in the interpolated frame Ft− 1
2
,

the motion vector found for the same block in the reference frame Ft−1 is used to

interpolate in both the backward and forward directions.

Let Pt(x, y) be the value of a pixel with coordinates (x, y) in frame ft, t being
a time index. Then for all (x, y) inside the boundaries of ft, with 1 ≤ j < k
and j ∈ Z+, the pixel values for an intermediate interpolated frame ft− j

k
is

given by Equation 5.1 below.

Pt− j
k
(x, y) =

1

2
Pt−1

(
x− (1− j

k
) ∗MVx, y − (1− j

k
) ∗MVy)

)
+

1

2
Pt

(
x+ (

j

k
) ∗MVx, y + (

j

k
) ∗MVy)

) (5.1)

Page 28 of 90


6. OpenCL

The growing interest in parallel processing solutions has lead to the emer-
gence of multi-core CPUs as well as GPGPU, general purpose computing
on graphics processing units. The GPUs of today contain hundreds up to
a thousand stream processing cores, a good match for the highly paralleliz-
able algorithms in this thesis. E�orts to utilize this processing power include
frameworks and APIs such as CUDA from Nvidia, OpenCL and Microsoft's
DirectCompute.

Open Computing Language (OpenCL) was chosen for the work of this thesis
on the merits of having a wider industry support. CUDA is only supported by
Nvidia, and DirectCompute is part of Microsoft's DirectX frameworks. We
made use of OpenCL for GPU-acceleration of the motion compensation part
of the code, taking advantage of on-device specialized hardware for bilinear
interpolation. For a su�ciently high-end GPU it also proved to be faster
than a CPU-based implementation of motion compensation. We used the
�rst version of the standard, OpenCL 1.0. A later version, OpenCL 1.1, is
available but does not contain any substantial changes relevant for our needs,
and at the time of writing required beta-version drivers for some platforms.

OpenCL is an open industry standard. It was developed for allowing general
purpose code to run on a wide range of processor types, such as CPUs, GPUs
or Digital Signal Processing (DSP) units. The speci�cation is de�ned by the
Khronos OpenCL Working Group, whose mother organisation The Khronos
Group is also responsible for the development of OpenGL (hence the naming
of OpenCL). The standard is supported by companies such as Nvidia, AMD,
IBM, Apple and others. OpenCL consists of a language, OpenCL C (based
on the C99 standard), and a set of APIs for controlling the execution of the
code written in this language.

In the following sections we present a brief overview of the terminology and
processing �ow when working with OpenCL. For a more detailed overview,
please refer to the OpenCL speci�cation (Khronos OpenCL Working Group
2008).

Page 29 of 90


6.1 Processing �ow

Setting up OpenCL to run code on a GPU typically requires the following
steps:

1. Acquire a platform ID and device ID.

2. Create a compute context.

3. Create a command queue.

4. Create a compute program.

5. Build the compute program.

6. Create a compute kernel in the compute program.

7. Allocate bu�ers on the device (if bu�ers are required by the kernel).

Once these steps have been completed some additional steps are required to
actually run the code and obtain the results:

8. Set the global and local work sizes. In many cases this only needs to
be done once.

9. Transfer data from the host into device bu�ers (if necessary).

10. Set the kernel arguments. Some arguments may only have to be set
once.

11. Execute the kernel on the device.

12. Transfer result from device back to host.

6.2 Terminology

A platform ID contains the host and a collection of devices. The host is
the part of the system from where the calls to the OpenCL API are made.
There can be many OpenCL implementations available on the system. One
example of this would be if implementations from both AMD and Nvidia
were installed. A choice would then have to be made which of these to use.

Devices consists of one or more compute units, which in turn consists of pro-
cessing elements. A device can be a CPU, a Graphics Processing Unit (GPU)
or other type of device (referred to as "dedicated OpenCL accelerators" in

Page 30 of 90


the speci�cation). For a multi-core CPU the corresponding OpenCL repre-
sentation would be a device with a single compute unit and one processing
element per core. If a system contains more than one CPU these would be
available as separate compute units.

A context is the environment within which the code executes. This environ-
ment includes a set of devices together with information about the memory
properties of these devices. It also contains command queues associated to
the devices. The command queue holds commands enqueued for a speci�c
device.

Kernels, functions written in the OpenCL C language, are grouped together
in programs which are either built for speci�c devices during runtime or
loaded from pre-built binaries. To execute a kernel in a successfully built
program, the kernel arguments must �rst be set. These arguments may e.g.
be primitive types such as integer values, handles to bu�ers allocated on the
device, or texture sampler objects.

The size and dimensions of the data to be processed must be speci�ed by
setting the global and local work sizes. Data representations may be one-,
two- or three-dimensional. The global work size determines the total number
of work items needed to process all the data, and specifying a local work size
gives the option to partition the data.

Several instances of the kernel is executed in parallel on the device, each
instance being a work-item. The work-items are grouped into work groups,
whose sizes depends on the speci�ed local work sizes. Partitioning the data
into smaller chunks e.g. enables sharing of data common to all work-items
in a work group. E�cient sharing may optimize memory operations, leading
to much higher performance.

Each work group execute on a single compute unit, and each work item can
be distinguished from the others by its so-called global and local IDs. The
global ID is the coordinate in the data being processed, coordinates which
may be one-, two- or three-dimensional depending on the type of data. If
e.g. a two-dimensional image is being processed, the global ID of a work
item might map to a speci�c pixel coordinate in the image. The local ID
determines the coordinate of the work item within its work group.

If on-device bu�ers are used by a kernel these must be �lled with data by
enqueuing data transfer commands to a command queue associated with the
device. Commands may be issued to be blocking, where the next statement
is not executed before the transfer is complete, or non-blocking, resuming
execution while allowing the transfer to happen "in the background". When

Page 31 of 90


the arguments and work sizes have been set, and bu�ers �lled with data,
the kernel can be enqueued for execution by adding an execution command
to a command queue. This execution command can also be said to be non-
blocking.

Output from kernels often come in the form of a device bu�er �lled with
modi�ed data. This data can be copied back from the device, also either
blocking or non-blocking, by issuing a bu�er-reading command to the com-
mand queue. The command can be added to the queue right after the kernel
has been queued for execution, since the transfer will not initiate before the
execution is complete (unless the command queue has been speci�ed to al-
low out-of-order execution). Once data has been transferred back from the
device it may be written to �le or used in other ways.

Page 32 of 90


7. Evaluation methods

There are two di�erent performance aspects to consider when evaluating
a frame rate up-conversion tool (FRUCT): algorithm runtimes and image
quality.

The runtime depends on how fast the motion estimation (ME) and motion
compensation (MC) is performed. This aspect is particularly interesting in
a time-constrained setting, e.g. real-time environments.

The image quality can be evaluated with both objective and subjective meth-
ods; objective methods are based on mathematical formulas, whereas sub-
jective methods are based on how humans perceive the quality. A number of
objective and subjective methods are presented below along with the choice
of methods for this thesis.

7.1 Objective quality evaluation

Objective quality evaluation can be categorised into full-reference (source
frame exists), reduced-reference (some part of a source frame exists) and
no-reference (no source frame exists) quality estimation.

To estimate the quality of a new frame without, or with a reduced, source
frame is a di�cult and challenging task. Nevertheless, there have been suc-
cessfull results; Woodard & Carley-Spencer (2006) uses no-reference objec-
tive measurement to automate screening for structural magnetic resonance
images, and Wang et al. (2002) uses no-reference objective measurement to
assess quality of JPEG images.

In full-reference evaluation, new frames are compared with source frames to
determine the quality of the new frames. Frames are dropped at a factor k
from a source video to produce a down-converted video. E.g. a source video
at 60 frames per second (FPS) can be down-converted with k = 3 to a new

Page 33 of 90


video at 20 FPS. The frame rate up-conversion tool then adds new frames,
with the same factor k, to produce a new video at 60 FPS.

Figure 7.1: Full-reference objective quality evaluation illustrated. A source video

is �rst down-converted and then up-converted. The source frames (marked as

A) and the up-converted frames (marked as B) are compared pair-wise using an

objective quality evaluation method. The output indicates how similar the pairs

of frames are.

7.1.1 Peak signal-to-noise ratio

Richardson (2010) describes peak signal-to-noise ratio (PSNR) as a fast and
easy full-reference metric to compare the luma between two frames. PSNR
gives a logarithmic value in decibel, as shown in Equation 7.1. MSE is the
Mean Squared Error (MSE) (described in section 4.2.4) and m is the number
of bits needed to sample one pixel. If two frames are identical the MSE will
be zero and the PSNRdB unde�ned.

PSNRdB = 10 log10
(2m − 1)2

MSE
(7.1)

Page 34 of 90


PSNR does not always correspond well with subjective quality evaluation
(Richardson 2010, Wang & Bovik 2009). PSNR is, despite this, the de-
facto standard for objective quality evaluation (Oelbaum et al. 2007), mainly
because it is very easy to implement and fast compared to more complex
methods.

7.1.2 Structural similarity

Structural Similarity (SSIM) (Wang & Bovik 2002, Wang et al. 2004) is
another full-reference objective quality evaluation metric. Like PSNR, SSIM
compares the luma data a pair of frames, but it also takes di�erences in
contrast and structure into consideration.

SSIM gives as result a normalized value in the range −1.0 to 1.0 indicating
how similar the frames are. A value of 1.0 is returned only if the frames are
identical. Equation 7.2 describes SSIM where x and y are the two frames, µx

and µy are the estimated mean luma of x and y, σx and σy are the estimated
contrasts of x and y, and σxy the covariance of σx and σy. C1 and C2 are
constants used to avoid instability in the function.

SSIM(x, y) =
(2µxµy + C1)(2σxy + C2)

(µ2
x + µ2

y + C1)(σ2
x + σ2

y + C2)
(7.2)

The comparison performed by SSIM corresponds better with subjective qual-
ity evaluation, but is more complex and harder to implement than PSNR.

7.1.3 Other methods

Much research e�ort has been put into developing objective quality evalua-
tion methods that corresponds well with subjective quality evaluation. Here
are some of the other methods presented in recent years:

• PSNR+ an extension of PSNR (Oelbaum et al. 2007).

• Predicted Mean Opinion Score (MOSp) (Bhat et al. 2009).

• Suthaharan et al. (2005) proposed a metric based on Just Noticeable
Di�erence (JND).

These methods have high reported correlation with subjective quality evalu-
ation methods, up to 70-90% (Richardson 2010).

Page 35 of 90


7.2 Subjective quality evaluation

Subjective quality evaluation is based on how humans perceive the quality
of images and video. Richardson (2010) argues that there are several factors
that may in�uence how the quality is perceived, e.g. the tester's state of
mind and the test environment. Also, humans tend to be sensitive to spatial
and temporal �delity, i.e. how clearly di�erent parts of the video can be seen
and how smooth it is.

In order to ensure that the results obtained from a subjective quality evalu-
ation are reliable, a large pool of testers are needed. This is because testers,
as they grade an increasing number of videos, often learn to look for arte-
facts, which may e�ect the results (Richardson 2010). The results, obtained
from all testers, are normalized (Bhat et al. 2009), to a Mean Opinion Score
(MOS), that "indicates the relative quality of the impaired and reference
sequences" (Richardson 2010, p.20).

Three di�erent subjective quality evaluation methods are presented below.

7.2.1 Double-stimulus impairment scale

The ITU Radiocommunication Sector (ITU-R) describes di�erent methods
for assessing the quality of images and videos in BT.500-11 (ITU Radiocom-
munication Sector 2002).

One of these methods is Double-Stimulus Impairment Scale (DSIS), where
a tester is presented with pairs of videos, one after another, knowing that
the �rst video is the reference video and the second one is impaired. The
tester then grades the second video, having the reference video in mind, on
a �ve-step scale ranging from very annoying to imperceptible.

7.2.2 Double-stimulus continuous quality-scale

Double-Stimulus Continuous Quality-Scale (DSCQS) is another method de-
scribed in BT.500-11. A tester is presented with pairs of videos, where the
videos are shown simultaneously. The tester grades both videos, without
knowing which is the reference video or the impaired video, on a �ve step
scale ranging from bad to excellent. The order should be randomised for
every pair of videos. This method is widely used according to Richardson
(2010).

Page 36 of 90


7.2.3 Subjective assessment method for video quality

Kozamernik et al. (2005) have suggested a method for subjective quality as-
sessment called Subjective Assessment Method for Video Quality (SAMVIQ).
The tester is presented with a reference video and a set of impaired videos.
The tester can watch the videos in any order, and grades the impaired videos
on a scale from 0 to 100. SAMVIQ gives results that are comparable with
DSIS and DSCQS (Huynh-Thu et al. 2007, Tominaga et al. 2010), and the
authors argue that "SAMVIQ is simpler, faster and more user-friendly than
traditional subjective evaluation methods".

7.3 Choice of quality evaluation methods

PSNR and SSIM were chosen to evaluate the quality of the frame rate up-
conversion tool. PSNR was chosen for implementation, even though it might
not correspond well with subjective methods, because of its simplicity and
its wide useage in the �eld. SSIM was chosen to complement the results from
PSNR. The third-party software MSU Video Quality Measurement Tool2 was
used for this.

None of the subjective quality methods were chosen. Instead was the up-
converted videos evaluated by ourselves in an ad-hoc manner.

2http://compression.ru/video/quality_measure/video_measurement_tool_en.html

Page 37 of 90


8. Implementation

The frame rate up-conversion tool (FRUCT) was implemented in C++,
where some parts critical for performance were hand-coded in assembly.
Open Computing Language (OpenCL) was used for the motion compensation
(MC) part. OpenMP compiler pragmas were used for all major parallelizable
loops in the code to ensure all available CPU cores could be utilized. frame
repetition (FR) and frame averaging (FA) were both implemented to serve
as a basis for comparison against the more advanced algorithms. The general
up-conversion procedure can be described with the pseudo-code in Figure 8.1.

Figure 8.1: Up-conversion pseudo code

Page 38 of 90


8.1 Naive methods

The FR implementation is trivial; simply reading one frame at a time from
the source �le and writing it to the output �le k number of times, where k
is the up-conversion factor.

FA was implemented in two versions: one CPU-based and one OpenCL-
based. The OpenCL-based FA was an exploration of how Graphics Process-
ing Unit (GPU) acceleration could be used for the motion estimation (ME)-
and MC algorithms in the FRUCT.

8.2 Motion estimation

ME was performed on luma from the raw video data. No extraction of motion
vectors from the bitstream was made, for the reasons outlined in Section 4.3.
By not using bitstream motion vectors the FRUCT is also not tied to a
speci�c video codec and can be used together with all contemporary and
future codecs.

A drawback is however the loss of sub-pixel motion vector precision; in
H.264/MPEG-4 AVC (H.264) motion vectors have up to quarter-pixel pre-
cision, which means motion vectors are not limited to the integer values
produced by standard implementations of most ME algorithms described in
this thesis. It would be possible to add sub-pixel accuracy to these algo-
rithms, but at a cost of additional processing time which would probably put
the goal of real-time capability at risk.

Of the block matching algorithms (BMAs) mentioned in section 4.1 Full
Search (FS), Four Step Search (4SS), Adaptive Rood Pattern Search (ARPS)
and Full Bidirectional Search (FBDS) were implemented. The choice of which
algorithms to use was based on reported performance, both in terms of quality
and runtime (Barjatya 2004). For 4SS the check for redundant search posi-
tions was not implemented. In order for the block-matching to be feasible for
real-time operation two main parts of any BMA should be the primary fo-
cus of optimization e�orts. First, the number of potential positions searched
should be kept as low as possible without degrading quality too much (a
necessary trade-o�). Second, the operation of matching two blocks against
each other should be as e�cient as at all possible, since this part of the code
will be run intensively. Details of the techniques implemented for increased
performance and quality is given in the following sections.

Page 39 of 90


8.2.1 SAD assembly optimization

By choosing an e�cient algorithm the number of matching operations can
be minimized, at a reasonable cost of small ME errors and lost accuracy.
However, an optimized similarity function (see Section 4.2) can also greatly
reduce time spent. The Sum of Absolute Di�erences (SAD) function, due
to its wide use in video encoding, has the bene�t of having specialized CPU
instructions available.

In the work of this thesis the Streaming SIMD Extensions (SSE) instruction
PSADBW (Intel Corporation 2011) was used, where PSAD stands for Packed
Sum of Absolute Di�erences. This allowed calculation of absolute di�erences
of 8 pairs of 8-bit values in a single instruction. In other words: if the
block size is set to 8x8 an entire block row of 8 values can be processed for
each PSADBW instruction. This means that the total number of calls to
PSADBW to compute the SAD for a 8x8 block was 8 and for a 4x4 block
only 2. Note that only 64-bit registers were used for PSADBW, not 128-bit,
since we during implementation were unaware of the possibility of using these
registers for this instruction. Using 128-bit instead of 64-bit registers could
speed up execution of SAD calculation for a block by a factor of two.

8.2.2 Overlapped block motion estimation

Using a smaller block size means there is less data to use when matching
blocks, e.g. 4x4 blocks contain 16 times less data than 16x16 blocks. This
can lead to increased sensitivity to image noise and in e�ect faulty motion
vectors. To reduce this issue when the block size is smaller than 16x16 we
used enlarged blocks in the BMAs, a technique used by e.g. Zhai et al.
(2005). Each 4x4 or 8x8 block is temporarily enlarged to 16x16 by including
surrounding pixels in each direction, and then matched against a similarly
enlarged block in the reference frame.

8.2.3 Zero motion prejudgement

In many types of video, especially the kind of mostly-static video this FRUCT
implementation is intended for, there is a large frame-to-frame coherence.
The high level of coherence can be found by checking the SAD of a macroblock
at position (i, j) in both frames (i.e. no o�set in search position) against a
threshold value. If the SAD is below the threshold value the block is said

Page 40 of 90


to have no motion and no further ME is performed for this block. This
technique is taken from ARPS (see Section 4.1.6). We have incorporated
this technique into all ME algorithms implemented for this thesis, except for
FS. The threshold value was set to 1024 for a 16x16 block.

8.3 Motion compensation

Direct motion compensation (DMC) was the �rst MC algorithm to be im-
plemented for the FRUCT. Due to its simplicity it proved to be just as fast
as predicted. However, since a potentially time-consuming post-processing
method is required (see Section 5.1), this track of development was not ex-
plored further. E�orts were instead focused on bidirectional motion com-
pensation (BDMC). As with FA, two versions of the BDMC algorithm were
implemented: one CPU-based and one OpenCL-based.

The OpenCL-based solution keeps a copy of the previous and next frames
in two 2D-image bu�ers (textures). By using a bu�er-swapping procedure
similar to the pseudo-code in Figure 8.1, only the newly read frame needs to
be sent to the device, as the previous frame was copied in a previous call.
The motion vectors for all macroblocks are copied to a bu�er on the device,
and the weighting factor is set as a kernel argument. The global work-item
size is set to the frame width and height, and the local work-group size (see
Chapter 6) is set to the speci�ed constant macroblock width and height. In
other words, every work item in a work-group belong to the same macroblock,
and there is a one-to-one correspondance between pixels and work items.

During kernel execution each work item uses its global position (i.e. pixel
position), together with an o�set calculated with the weighting factor and
block motion vector, for look-ups (sampling) in the image bu�ers to produce
an output value.

The sampling can be done by nearest-neighbour, rounding non-integer sam-
pling locations to the closest integer one (thus giving the same result as
the CPU-based solution). Alternatively, sampling can be done by bilin-
iear interpolation where the values of the four closest integer sample points
are weighted together (Figure 8.2). The use of bilinear interpolation trades
slightly decreased sharpness for increased smoothness in video motion. Since
there is special hardware for this kind of interpolation on modern graphics
cards there is no major time penalty compared to nearest-neighbour sam-
pling. This would not be the case for a CPU-based implementation.

Page 41 of 90


Figure 8.2: The non-integer sampling location (the �lled circle) is rounded to

the top left sample location. In bilinear interpolation (right side), the non-integer

sampling location is interpolated from the four closest integer sample points.

After testing MC (with a separate kernel), averaging and copying of chroma
(from previous frame), MC was found to give results which were indistin-
guishable from chroma averaging in most cases. The cost of increased MC
time could not be justi�ed. Copying chroma from previous frame was found
to give bad results for videos with moderate motion. For these reasons only
averaging of chroma was used, performed on the CPU for �exibility, since
the other methods were still available as a con�guration option.

Page 42 of 90


9. Results

9.1 Test videos

The videos used for experimental veri�cation were selected from standard
test videos3 used in existing literature. The selection was based on similarity
to the actual use case of remote surveillance, meaning that videos with a
static camera and low to moderate motion were favourized.

The Akiyo test video sequence (Common Intermediate Format (CIF) format)
shows a female news anchor presenting the news. It features a static camera
and background, with low motion as she moves her head while talking (see
Figure 9.1a for a sample image from the video). The most challenging part
of the video, from a frame rate up-conversion perspective, is to correctly
interpolate movements of lips and eyes, especially blinking.

The Container test video sequence (CIF format) also includes a static camera
and mostly low motion. It features two slowly moving ships, a �ag waving
in the wind and some water movement. Also, at the end of the sequence
two birds �y by close to the camera at high velocity (Figure 9.1b). This
video mainly tests the ability to handle low, predictable motion (the ships),
some unpredictable motion (�ag and water) and extremely high motion (the
birds).

Foreman (CIF format) is one of the most used test sequences for frame
rate up-conversion (Figure 9.1c). It features a close-up of a foreman at a
construction site, where the camera is static in the �rst half of the video
and quickly panning in the second half. It contains heavy motion, including
motion blur, where the motion of the foreman's face is especially hard to
estimate and compensate for. It is nowhere near the remote surveillance use
scenario, but is included due to its popularity in the �eld.

3Standard videos can be found at http://media.xiph.org/video/derf/

Page 43 of 90

http://media.xiph.org/video/derf/


(a) Akiyo (b) Container (c) Foreman

Figure 9.1: CIF test video sequences.

Due to the lack of standard test videos with 1080p resolution �tting the
criteria of static camera and low motion we decided to record a few on our
own4; Liseberg, FerryClose and FerryFar. These videos were recorded with a
Nikon D5100 DSLR camera at 30 frames per second (FPS)5 in H.264-encoded
format. They were later converted to raw uncompressed Y'CbCr 4:2:0 format
using MEncoder6.

The Liseberg video (Figure 9.2) is shot outside an amusement park. It fea-
tures fast vertical motion of an attraction, moderate horizontal motion of a
tram, and low motion of pedestrians in the background. Also, a fast-moving
bird introduces high-motion noise.

FerryClose and FerryFar, shown in Figure 9.3 and Figure 9.4, capture an
approaching commuter ferry at di�erent distances. Both videos have fairly
high motion in the water, with the water covering at least a third or more of
the frame area.

4The recorded videos are available on http://andreas.isberg.se/edu/masters_thesis/
5The videos were recorded in 29.97 FPS, or 30000/1001 to be even more precise. 30 is

however the number speci�ed when choosing frame rate and resolution in the camera.
6MEncoder can be found on http://www.mplayerhq.hu/

Page 44 of 90

http://andreas.isberg.se/edu/masters_thesis/
http://www.mplayerhq.hu/


Figure 9.2: Liseberg test video (1080p). The tower in the background is an amuse-

ment park attraction.

Figure 9.3: FerryClose test video (1080p).

Page 45 of 90


Figure 9.4: FerryFar test video (1080p).

9.2 Test system speci�cation

The speci�cations of systems used for testing can be seen in Table 9.1 below.
Two di�erent classes of test systems, LAPTOP and DESKTOP, were used
when collecting timing results.

PART LAPTOP DESKTOP
Processor Core i7-2720QM Core i5-750
Physical cores 4 4
Virtual cores 8 4
RAM amount 4 GB 4 GB
RAM specs 1.33 GHz DDR3 1.6 GHz DDR3
Graphics card Radeon 6750M GeForce GTX560 Ti
VRAM 1024 MB 1024 MB
Graphics driver 8.812.0.0 275.33
OS Win7 Pro Win7 Pro

Table 9.1: Speci�cations of test systems.

Both systems used processors from Intel, with di�erent number of physical
and logical processor cores. The DESKTOP system had four physical and
virtual cores, while the processor in LAPTOP also had four physical cores but

Page 46 of 90


a total of eight virtual cores thanks to Hyper-Threading (HT). The number
of Open Multi-Processing (OpenMP) threads were based on the number of
virtual cores. The graphics card in DESKTOP was a graphics card from
Nvidia, which at the time of writing could be placed in the high-end category.
The laptop had a mid-range graphics chip from AMD.

Microsoft Visual Studio 2010 was used to compile the code. The compiler
was set to full speed optimization, with link-time- and SSE2 code generation
turned on, as well as whole program optimization.

9.3 Subjective quality evaluation

9.3.1 Recorded videos

Looking at a 3x up-conversion, from 10 to 30 FPS (again, this is a rounding of
29.97), of the FerryFar sequence, the di�erent motion estimation algorithms
show some clear di�erences. The Four Step Search (4SS) algorithm produces
the best result for this video. A few artefacts appear by the �ag in the lower
left corner (Figure 9.5d), and some in the water. The non-uniform motion in
the water is well-handled and no artefacts are large or persistent enough to
be considered particularly disturbing for the viewer.

Adaptive Rood Pattern Search (ARPS) produces a few artefacts on the ferry
and around �ag. However, the main di�erence compared to the other algo-
rithms is the degree of artefacts in the water, which is extremely high for
ARPS in this video. This can be explained by the "wandering" behaviour of
the algorithm.

Artefacts in the water produced by the Full Bidirectional Search (FBDS)
algorithm are worse than 4SS but signi�cantly better than ARPS. Brief arte-
facts appear on the ferry, and some on a logo in the background (Figure 9.5e).
Full Search (FS) gives similar results to FBDS, with the same kind of arte-
facts near the logo in the background.

Page 47 of 90


(a) Original (b) Original (c) Original

(d) 4SS (e) FBDS (f) ARPS

Figure 9.5: Artefacts in 3x up-conversion from 10 to 30 FPS of FerryFar sequence,

8x8 extended block size. Note the unintended "shadow" of the �ag in (d) compared

to the original (a), and the misaligned waves in the logo (e) compared to (b). The

water in (f) contains several rows of miscoloured blocks.

The water motion in the FerryClose sequence, when up-converted from 10 to
30 FPS, is also best handled by 4SS and worst by ARPS. Disturbing artefacts,
primarily around the ferry's windows, are produced by 4SS (Figure 9.8f).
This is not the case for FBDS, where only a few artefacts can be found on
the side and front of the ferry, the thin antenna on top and heads of the
people on the upper deck (Figure 9.8d). FBDS also shows problems with
a bird �ying by fast, which appear to �icker (Figure 9.8e). Despite this,
FBDS gives the overall best subjective result for this up-conversion factor,
as illustrated in the sharpness comparisons in Figure 9.6. The bene�ts of up-
conversion with motion estimation (ME) and motion compensation (MC),
compared to simple frame averaging (FA), is shown in Figure 9.7.

Page 48 of 90


(a) FBDS (b) 4SS (c) FA

Figure 9.6: Sharpness comparison, 3x up-conversion from 10 to 30 FPS of Fer-

ryClose sequence. With FBDS the text is sharp and readable, while 4SS gives less

text shadow but also less readable text compared to FA.

(a) Original (b) FBDS

(c) 4SS (d) FA

Figure 9.7: Sharpness compared to original frame, 3x up-conversion from 10 to 30

FPS of FerryClose, 8x8 extended block size (see Section 8.2.2). Note the slightly

displaced antenna in (b) and (c), and the block artefact in (c). FA is outclassed.

Page 49 of 90


(a) Original (b) Original (c) Original

(d) FBDS (e) FBDS (f) 4SS

Figure 9.8: Artefacts in 3x up-conversion from 10 to 30 FPS of FerryClose se-

quence, 8x8 extended block size. The head of the person to the left in (a) is mostly

gone in (d), same goes for the bird in (b) when compared to (e). Artefacts produced

by 4SS on the side of the ferry can be seen between the windows in (f).

When converting FerryClose from 30 to 60 FPS however, doubling the orig-
inal frame rate, ARPS produce much less artefacts and 4SS few or none.
Artefacts produced by FBDS are limited to single blocks on the side and
front of the ferry.

For the Liseberg sequence an up-conversion from 10 to 30 FPS gives video
quality problems for all ME algorithms. Artefacts produced by ARPS around
the tram are particularly heavy (Figure 9.9c), but slightly less around the
amusement park tower compared to the other algorithms (Figure 9.10c). The
limited range of 4SS and FBDS is not enough to �nd the true motion of the
tram (Figures 9.9b and 9.9d). FBDS again can not handle a fast-moving
bird, which appear to �icker.

Page 50 of 90


(a) Original (b) 4SS

(c) ARPS (d) FBDS

Figure 9.9: Artefacts on the front of the tram in 3x up-conversion from 10 to 30

FPS of Liseberg sequence, 4x4 extended block size.

(a) Original (b) 4SS (c) ARPS (d) FBDS

Figure 9.10: Artefacts on a high-motion part (the "ring" is falling fast towards

the ground) of an amusement park tower, in 3x up-conversion from 10 to 30 FPS

of Liseberg sequence, 4x4 extended block size. The limited range of 4SS and FBDS

is probably the reason for the severe artefacts in (b) and (d).

When up-converting from 15 to 30 FPS all algorithms give better results

Page 51 of 90


on the Liseberg sequence (Figure 9.11), though ARPS still produce sporadic
artefacts. FBDS gives few to none artefacts on the tram and also rivals ARPS
for best results on the vertical motion (Figure 9.12), which at its maximum
speed still is too large for the range of FBDS. When going from 30 to 60
FPS, only ARPS has some limited trouble with the tram motion.

(a) Original (b) 4SS

(c) ARPS (d) FBDS

Figure 9.11: Artefact comparison on tram, 2x up-conversion from 15 to 30 FPS of

Liseberg sequence, 4x4 extended block size. In this frame ARPS gave less artefacts

than 4SS, but seen to the entire sequence the case is the opposite.

Page 52 of 90


(a) Original (b) 4SS (c) ARPS (d) FBDS

Figure 9.12: Artefacts on tower in Liseberg sequence, 2x up-conversion from 15

to 30 FPS.

9.3.2 Standard test videos

The motion in the news anchor's face in a 3x up-conversion of the Akiyo test
sequence, going from 10 to 30 FPS, is mostly handled well by 4SS, ARPS and
FBDS. Problematic ranges of frames often involve facial regions being covered
or uncovered as an e�ect of quick movements of eyes and lips. Figure 9.13
and Figure 9.15 show such events. As indicated by the semi-visible eyes in
Figure 9.13e the news anchor is about to open her eyes. This motion was
correctly interpolated by ARPS and FBDS, and with a single block artefact
by 4SS. Figure 9.14 is in the middle of vertical face motion. ARPS and
FBDS both produce acceptable results, whereas 4SS gives an extra pair of
eyebrows. 4SS fares better on frame 137, where FBDS is the worst of the
four algorithms compared (Figure 9.15d), making Akiyo one of the few test
videos where ARPS gives the best result.

(a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA

Figure 9.13: Excerpt of frame 38, 3x up-conversion of Akiyo sequence, from 10 to

30 FPS.

Page 53 of 90


(a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA

Figure 9.14: Excerpt of frame 76, 3x up-conversion of Akiyo sequence, from 10 to

30 FPS.

(a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA

Figure 9.15: Excerpt of frame 137, 3x up-conversion of Akiyo sequence, from 10

to 30 FPS.

The �rst part of the Container video sequence, before two birds enter the
scene, contains motion so slow that even a 6x up-conversion can give highly
acceptable results with 4SS, ARPS, FBDS or even FA (Figure 9.16). Fig-
ure 9.17, showing one of the birds, gives an example of how badly such events
are handled by the tested algorithms.

(a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA

Figure 9.16: Excerpt of frame 111, 6x up-conversion of Container sequence. 4SS,

ARPS and FBDS all give a result with acceptable sharp contours of the ships. FA

gives a little less sharpness compared to the others.

Page 54 of 90


(a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA

Figure 9.17: Excerpt of frame 262, 3x up-conversion of Container sequence.

Most parts of the Foreman sequence contain motion too large or complicated
for the chosen algorithms to tackle, illustrated in Figure 9.18. As stated
earlier, good results on this video has not been a focus of our e�orts, and
the video is included primarily for reference due to its high popularity in this
�eld of research.

(a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA

Figure 9.18: Excerpt of frame 13, 3x up-conversion of Foreman sequence.

With the subjective evaluations of all test videos in mind, the smaller block
size of 4x4 produce roughly equal amount of artefacts as 8x8. However, as
the artefacts are also smaller, they appear to be less disturbing for the viewer.

9.4 Objective quality evaluation

Most of the peak signal-to-noise ratio (PSNR) results obtained from up-
converting the videos indicates that FA performs really well, even though
the subjective quality evaluation sometimes suggests otherwise. PSNR uses
the pixel di�erences between two frames, and this di�erence is small for videos
with low motion and low color contrast. This is a major problem with PSNR
and the results should only be treated as a rough estimation.

Page 55 of 90


The diagrams below contain average results for each algorithm combined
with four di�erent block sizes. They also contain per-frame results for each
algorithm using 4x4 extended block size, as this was found to be the best
choice in the subjective quality evaluation.

9.4.1 Recorded videos

Structural Similarity (SSIM) was not computed for the 1080p videos because
of limitations in the program used for testing.

Figures 9.19 to 9.22 show that 4SS performs best of the motion estimation
algorithms on FerryFar and FerryClose, similar to the results from the sub-
jective quality evaluation. The PSNR is quite constant as the videos contain
just small motion.

Figure 9.19: Average PSNR of FerryFar after up-converting it from 10 to 30 FPS.

Page 56 of 90


Figure 9.20: Per-frame PSNR of FerryFar after up-converting it from 10 to 30

FPS.

Figure 9.21: Average PSNR of FerryClose after up-converting it from 10 to 30

FPS.

Page 57 of 90


Figure 9.22: Per-frame PSNR of FerryClose after up-converting it from 10 to 30

FPS. 4SS performs slightly better than the other algorithms.

Up-converting Liseberg from 10 to 30 FPS gives reasonably good objective
results for all algorithms, as shown in Figures 9.23 to 9.24. These results are
however an e�ect of the mostly static scene. Comparing these �gures to the
subjective quality evaluation (Section 9.4.1) con�rms that high PSNR values
does not necessarily correspond to high quality in the few parts of the video
frame where there is motion interesting for a human observer.

Page 58 of 90


Figure 9.23: Average PSNR of Liseberg after up-converting it from 10 to 30 FPS.

Figure 9.24: Per-frame PSNR of Liseberg after up-converting it from 10 to 30

FPS.

All algorithms perform slightly better (by a few decibels) when up-converting
Liseberg from 15 to 30 FPS. FBDS performs excellent, hitting over 40 decibel
on average.

Page 59 of 90


Figure 9.25: Average PSNR of Liseberg after up-converting it from 15 to 30 FPS.

Figure 9.26: Per-frame PSNR of Liseberg after up-converting it from 15 to 30

FPS.

9.4.2 Standard videos

All algorithms perform excellent on Akiyo, hitting over 40 decibels on average,
as shown in Figure 9.27.

Page 60 of 90


Figure 9.27: Average PSNR of akiyo after up-converting it from 10 to 30 FPS.

Figures 9.28 and 9.29 show how the result of FA drops low for many frames.
This is likely because of the color contrast between the light skin and the
dark areas such as the hair, eyebrows and eyes.

Figure 9.28: Per-frame PSNR of Akiyo after up-converting it from 10 to 30 FPS.

Notice how FA drops very low for certain frames.

Page 61 of 90


Figure 9.29: Per-frame SSIM of Akiyo after up-converting it from 10 to 30 FPS.

Notice how FA drops very low for certain frames.

The algorithms also perform excellent on Container, as shown in Figure 9.30.

Figure 9.30: Average PSNR of Container after up-converting it from 10 to 30

FPS.

Figures 9.31 and 9.32 show how the result of all algorithms drops low between
frames 245 and 276. This is due to the fact that none of the algorithms

Page 62 of 90


properly can handle the high motion of the two birds.

Figure 9.31: Per-frame PSNR of Container after up-converting it from 10 to 30

FPS. FBDS performs excellent on most frames, except when the two birds �y by,

hitting well over 40 decibels on average.

Figure 9.32: Per-frame SSIM of Container after up-converting it from 10 to 30

FPS.

As expected, none of the algorithms perform well on Foreman, as Figures 9.33

Page 63 of 90


to 9.35 show.

Figure 9.33: Average PSNR of Foreman after up-converting it from 10 to 30 FPS.

Figure 9.34: Per-frame PSNR of Foreman after up-converting it from 10 to 30

FPS.

Page 64 of 90


Figure 9.35: Per-frame SSIM of Foreman after up-converting it from 10 to 30

FPS. Note that the values of the y-axis ranges from 0.3 to 1.0 instead of 0.960 to

1.0, which was the case for the other standard videos. FA is the worst algorithm

for this video.

9.5 Real-time capability

The diagrams below contain per-frame timing results for 4SS, ARPS and
FBDS, with 4x4 extended block size, as these were found to give the best
quality.

9.5.1 Motion estimation time measurements

Enabling optimizations could in theory give huge improvements. OpenMP
could increase the performance by the number of virtual cores in the system,
and the Sum of Absolute Di�erences (SAD) assembly optimization could
increase the performance up to 16 times (see Section 8.2.1). Table 9.2 below
shows how they actually e�ect the runtime on ME. The optimizations are not
algorithm nor video-type dependent, so the table shows the average per-frame
time measurements for 4SS using 4x4 extended block size on the recorded
1080p videos.

Enabling OpenMP on LAPTOP (8 virtual cores) gives a speed increase factor
of 3.3, while the same optimization on DESKTOP (4 virtual cores) only

Page 65 of 90


increases the speed 2.4 times. Enabling the SAD assembly optimization
gives a speed increase factor of 11.5 on LAPTOP, and 10.8 on DESKTOP
(Table 9.3).

LAPTOP DESKTOP
No optimization 680.6 ms 721.3 ms
OpenMP only 205.6 ms 301.3 ms
SSE only 59.3 ms 66.6 ms
OpenMP and SSE 37.3 ms 43.3 ms

Table 9.2: Average time for di�erent optimizations on ME using 4SS with 4x4

extended block size.

LAPTOP DESKTOP
No optimization 1.0x 1.0x
OpenMP only 3.3x 2.4x
SSE only 11.5x 10.8x
OpenMP and SSE 18.2x 16.7x

Table 9.3: Relative time improvements with optimizations, from data in Table 9.2

All algorithms, with 4x4 extended block size and all optimizations enabled,
are extremely fast on the standard CIF videos, as Table 9.4 shows.

LAPTOP DESKTOP
4SS 1.2 ms 1.2 ms
ARPS 1.1 ms 1.1 ms
FBDS 3 ms 2.7 ms

Table 9.4: Average time for ME for the standard CIF videos using 4x4 extended

block size and optimizations.

Figures 9.36 to 9.41 show the performance of the ME algorithms, using 4x4
extended block size and optimizations enabled, on the recorded 1080p videos.
Some of the charts show spikes where processing times are increased by up
to 100 ms or more for some frames. This is not caused by instantaneous
changes in the videos. A likely explanation is the process scheduling of the
underlying operating system giving priority to other processes. Setting a
higher priority on the main frame rate up-converter thread should remedy

Page 66 of 90


this. The issue does however highlight the inherent di�culty of guaranteeing
that a maximum processing time limit is not exceeded.

Figure 9.36: ME on FerryFar using DESKTOP after up-converting it from 10 to

30 FPS.

Figure 9.37: ME on FerryFar using LAPTOP after up-converting it from 10 to

30 FPS.

Page 67 of 90


Figure 9.38: ME on FerryClose using DESKTOP after up-converting it from 10

to 30 FPS.

Figure 9.39: ME on FerryClose using LAPTOP after up-converting it from 10 to

30 FPS.

Page 68 of 90


Figure 9.40: ME on Liseberg using DESKTOP after up-converting it from 10 to

30 FPS.

Figure 9.41: ME on Liseberg using LAPTOP after up-converting it from 10 to 30

FPS.

Page 69 of 90


9.5.2 Motion compensation time measurements

Table 9.5 below shows the MC processing times for the CPU- and OpenCL-
based BDMC implementations. The test was performed on Liseberg using
4SS and 4x4 extended block size.

The results show a big performance di�erence between the CPUs and GPUs
in each test system. Looking only at processing times, the CPU-based im-
plementation is a better choice for the LAPTOP test system. However, the
GPU-based implementation gives increased quality thanks to the bilinear in-
terpolation. The DESKTOP test system bene�ts hugely from the high-end
graphics card.

LAPTOP DESKTOP
CPU 19.7 ms 27.5 ms
OpenCL (GPU) 25.5 ms 14.9 ms

Table 9.5: Comparison of CPU- and OpenCL-based BDMC implementations. The

test was performed on Liseberg using 4SS and 4x4 extended block size.

Based on the results above, the systems bene�t most from the OpenCL-
based BDMC implementation. Tables 9.6 and 9.7 below show how OpenMP
e�ects the OpenCL-based BDMC implementation on the two test systems.
Di�erences in motion between videos only give small variations on processing
times. The DESKTOP system combined with OpenMP is fast enough for a
true real-time implementation (see Section 2.1.2), where each processing step
needs to be below 16.7 ms. The di�erence of using OpenMP in the otherwise
GPU-based algorithm comes from the fact that chroma is still processed on
the CPU (see Section 8.3).

LAPTOP DESKTOP
With OpenMP 25.8 ms 13.1 ms
Without OpenMP 37.6 ms 25.2 ms

Table 9.6: Average time for MC on the recorded 1080p videos.

9.5.3 Total processing times

The total processing time, taking both ME and MC into account, can be used
to determine if the real-time goals set in Section 2.1.2 can be accomplished.

Page 70 of 90


LAPTOP DESKTOP
With OpenMP 3 ms 0.9 ms
Without OpenMP 6.7 ms 1.7 ms

Table 9.7: Average time for MC on the standard CIF videos.

All videos used in testing was �rst down-converted by a factor of three and
then up-converted by the same factor. This factor means the total processing
time must not exceed 50 milliseconds, marked as LIMIT_HIGH in the charts.

The total processing time is low when used on CIF videos. Figure 9.42 shows
that all algorithms just take a few milliseconds when used on Akiyo. The
result is similar for the other CIF videos.

Figure 9.42: Total processing time (ME and MC) on Akiyo using DESKTOP after

up-converting it from 10 to 30 FPS. The MC method used is BDMC.

On 1080p videos, the total processing time is much higher. Figures 9.43 to
9.46 show that none of the algorithms fall below the mark for FerryFar and
FerryClose.

Page 71 of 90


Figure 9.43: Total processing time (ME and MC) on FerryFar using DESKTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC.

Figure 9.44: Total processing time (ME and MC) on FerryFar using LAPTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC.

Page 72 of 90


Figure 9.45: Total processing time (ME and MC) on FerryClose using DESKTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC.

Figure 9.46: Total processing time (ME and MC) on FerryClose using LAPTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC.

Both test systems using 4SS or ARPS in combination with BDMC fall below
the mark on Liseberg, as Figures 9.47 and 9.48 show.

Page 73 of 90


Figure 9.47: Total processing time (ME and MC) on Liseberg using DESKTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC.

Figure 9.48: Total processing time (ME and MC) on Liseberg using LAPTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC.

FBDS falls below the 50 ms mark, on all recorded 1080p videos, if the block
size is changed from 4x4 extended to 8x8 extended using DESKTOP test
system. This is shown in Figures 9.49 to 9.51.

Page 74 of 90


Figure 9.49: Total processing time (ME and MC) on FerryFar using DESKTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC. The

blocksize used is 8x8 extended.

Figure 9.50: Total processing time (ME and MC) on FerryClose using DESKTOP

after up-converting it from 10 to 30 FPS. The MC method used is BDMC. The

blocksize used is 8x8 extended.

Page 75 of 90


Figure 9.51: Total processing time (ME and MC) on Liseberg using DESKTOP

after up-converting it from 10 to 30 FPS. Th