Frame rate up-conversion of real-time high-de�nition remote surveillance video Andreas Isberg andreas@isberg.se Master of Science in Software Engineering and Technology Johan Jostell johan.jostell@gmail.com Master of Science in Computer Science: Algorithms, Languages and Logic Chalmers University of Technology Department of Computer Science and Engineering Göteborg, Sweden, April 2012 mailto:andreas@isberg.se mailto:johan.jostell@gmail.com The Author grants to Chalmers University of Technology and University of Gothenburg the non-exclusive right to publish the Work electronically and in a non-commercial purpose make it accessible on the Internet. The Author warrants that he/she is the author to the Work, and warrants that the Work does not contain text, pictures or other material that violates copyright law. The Author shall, when transferring the rights of the Work to a third party (for example a publisher or a company), acknowledge the third party about this agreement. If the Author has signed a copyright agreement with a third party regarding the Work, the Author warrants hereby that he/she has ob- tained any necessary permission from this third party to let Chalmers Uni- versity of Technology and University of Gothenburg store the Work electron- ically and make it accessible on the Internet. Frame rate up-conversion of real-time high-de�nition remote surveillance video c© Andreas J. Isberg, April 2012. c© Johan C. Jostell, April 2012. Examiner: Devdatt Dubhashi Supervisors: Vinay Jethava, Henrik Engström Chalmers University of Technology Department of Computer Science and Engineering SE-412 96 Göteborg Sweden Telephone + 46 (0)31-772 1000 Cover: Frame rate up-conversion (see Chapter 3) of the Liseberg video (see Figure 9.2). Department of Computer Science and Engineering Göteborg, Sweden April 2012 Abstract Many surveillance and monitoring systems capture video using static cameras with relatively low frame rate. Low frame rates have bene�ts such as less storage requirements and less bandwidth used. These issues may be of even greater concern for high-de�nition video capture and remote surveillance. Also, a reduced frame rate makes the video appear less smooth, causing increased strain for the viewer. It is therefore desirable to increase the frame rate without increasing bandwidth use or changing equipment. This thesis presents the design and implementation of a tool for performing frame rate up-conversion in real-time. The tool makes use of GPUs and multi-core CPUs found in modern computer systems. Techniques and frameworks such as OpenMP, SSE and OpenCL are utilized to make full use of the systems capabilities. Frame rate up-conversion is performed in two steps: motion esti- mation and motion compensation. For motion estimation a number of block matching algorithms were evaluated and implemented. We present in this thesis our version of an exhaustive bidirectional search algorithm for block matching, called Full Bidirectional Search (FBDS). It uses zero motion prejudgement inspired from Adaptive Rood Pattern Search (ARPS). The tool performs bidirectional motion compensation (BDMC) on the GPU using OpenCL. Testing was performed on three high-de�nition videos recorded by us, �tting the use scenario of remote surveillance. Results show in- creased quality compared to simple frame averaging, although with occasional block artefacts. The tool is capable of up-converting the recorded videos by a factor of three in a timespan less than 50 ms. Hence, it is viable for use in a high-de�nition remote surveillance sys- tem. Sammanfattning Många övervakningssystem använder statiska kameror med relativt låg bildfrekvens för att spela in video. Låg bildfrekvens har vissa förde- lar, t.ex. mindre lagringsutrymme och mindre bandbreddsanvändning för fjärrövervakningssystem. För högupplöst video kan dessa fördelar vara ännu viktigare. Video kan upplevas som ryckig när den är in- spelad med en låg bildfrekvens, vilket kan orsaka ökad ansträngning för personen som kollar på videon. Det är därför önskvärt att öka bildfrekvensen utan att öka bandbredden eller byta utrustning. I detta examensarbete presenteras design och implementering av ett verktyg för att utföra uppkonvertering av bildfrekvens i realtid. Verktyget använder sig av gra�kkort och �erkärniga processorer till- gängliga i moderna datorer. Tekniker och ramverk så som OpenMP, SSE och OpenCL används för att utnyttja systemens fulla kapacitet. Uppkonvertering av video sker i två steg: rörelse-estimering och rörelse-kompensering. Ett �ertal blockmatchningsalgoritmer imple- menterades och utvärderades för rörelse-estimering. Vi presenterar i denna rapport en egenutvecklad version av en fullständig dubbelrik- tad sökalgoritm för blockmatchning, kallad Full Bidirectional Search (FBDS), med förkontroll av statiska bildområden inspirerat av Adap- tive Rood Pattern Search (ARPS). Dubbelriktad rörelse-kompensering (BDMC) utfördes på gra�kkortet med hjälp av OpenCL. Testning utfördes både på tre egeninspelade högupplösta sekvenser som passar användningsområdet för fjärrstyrd övervakning. Resul- taten visar att videokvaliteten ökar jämfört med enklare tekniker, även om vissa blockartefakter förekommer. Verktyget klarar uppkonverter- ing med faktor tre på de egeninspelade videosekvenserna, inom en tidsrymd kortare än 50 ms. Detta visar att verktyget lämpar sig för användning inom fjärrövervakningssytem med högupplöst video. Acknowledgments We would like to thank our supervisors Vinay Jethava and Hen- rik Engström for support and advice during the writing of this thesis. Vinay has been a great help in sifting through the relevant literature and providing insightful comments during the development. Henrik's expertise in x86 assembly has been a large contribution in our opti- mization e�orts. We would also like to thank Erik Levin who was kind enough to let us borrow his camera for �lming test videos. Finally we would like to thank our friends and families for all their support. Johan and Andreas Table of Contents 1 Introduction 1 1.1 Frame rate conversion . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis contributions . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Problem description 3 2.1 System speci�cations . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Software-based implementation . . . . . . . . . . . . . 4 2.1.2 De�nition of real-time . . . . . . . . . . . . . . . . . . 4 2.1.3 High-de�nition . . . . . . . . . . . . . . . . . . . . . . 5 2.1.4 Static scenes . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.5 Remote objects . . . . . . . . . . . . . . . . . . . . . . 5 2.1.6 High reliability . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Scope and delimitations . . . . . . . . . . . . . . . . . . . . . 6 3 Frame rate up-conversion 7 3.1 Frame repetition . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.2 Frame averaging . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3.3 Advanced methods . . . . . . . . . . . . . . . . . . . . . . . . 9 3.4 Existing research . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Motion estimation 11 4.1 Block matching algorithms . . . . . . . . . . . . . . . . . . . . 12 4.1.1 Full search . . . . . . . . . . . . . . . . . . . . . . . . . 12 4.1.2 Three Step Search . . . . . . . . . . . . . . . . . . . . 13 4.1.3 New Three Step Search . . . . . . . . . . . . . . . . . . 14 4.1.4 Four Step Search . . . . . . . . . . . . . . . . . . . . . 16 4.1.5 Diamond Search . . . . . . . . . . . . . . . . . . . . . . 18 4.1.6 Adaptive Rood Pattern Search . . . . . . . . . . . . . . 19 4.1.7 Full Bidirectional Search . . . . . . . . . . . . . . . . . 21 4.2 Block similarity functions . . . . . . . . . . . . . . . . . . . . 23 4.2.1 Sum of Absolute Di�erences . . . . . . . . . . . . . . . 23 4.2.2 Mean Absolute Di�erence . . . . . . . . . . . . . . . . 24 4.2.3 Sum of Squared Di�erences . . . . . . . . . . . . . . . 24 4.2.4 Mean Squared Error . . . . . . . . . . . . . . . . . . . 25 4.3 Bitstream motion vectors . . . . . . . . . . . . . . . . . . . . . 25 5 Motion compensation 26 5.1 Direct motion compensation . . . . . . . . . . . . . . . . . . . 26 5.2 Bidirectional motion compensation . . . . . . . . . . . . . . . 27 6 OpenCL 29 6.1 Processing �ow . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.2 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 7 Evaluation methods 33 7.1 Objective quality evaluation . . . . . . . . . . . . . . . . . . . 33 7.1.1 Peak signal-to-noise ratio . . . . . . . . . . . . . . . . . 34 7.1.2 Structural similarity . . . . . . . . . . . . . . . . . . . 35 7.1.3 Other methods . . . . . . . . . . . . . . . . . . . . . . 35 7.2 Subjective quality evaluation . . . . . . . . . . . . . . . . . . . 36 7.2.1 Double-stimulus impairment scale . . . . . . . . . . . . 36 7.2.2 Double-stimulus continuous quality-scale . . . . . . . . 36 7.2.3 Subjective assessment method for video quality . . . . 37 7.3 Choice of quality evaluation methods . . . . . . . . . . . . . . 37 8 Implementation 38 8.1 Naive methods . . . . . . . . . . . . . . . . . . . . . . . . . . 39 8.2 Motion estimation . . . . . . . . . . . . . . . . . . . . . . . . 39 8.2.1 SAD assembly optimization . . . . . . . . . . . . . . . 40 8.2.2 Overlapped block motion estimation . . . . . . . . . . 40 8.2.3 Zero motion prejudgement . . . . . . . . . . . . . . . . 40 8.3 Motion compensation . . . . . . . . . . . . . . . . . . . . . . . 41 9 Results 43 9.1 Test videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 9.2 Test system speci�cation . . . . . . . . . . . . . . . . . . . . . 46 9.3 Subjective quality evaluation . . . . . . . . . . . . . . . . . . . 47 9.3.1 Recorded videos . . . . . . . . . . . . . . . . . . . . . . 47 9.3.2 Standard test videos . . . . . . . . . . . . . . . . . . . 53 9.4 Objective quality evaluation . . . . . . . . . . . . . . . . . . . 55 9.4.1 Recorded videos . . . . . . . . . . . . . . . . . . . . . . 56 9.4.2 Standard videos . . . . . . . . . . . . . . . . . . . . . . 60 9.5 Real-time capability . . . . . . . . . . . . . . . . . . . . . . . 65 9.5.1 Motion estimation time measurements . . . . . . . . . 65 9.5.2 Motion compensation time measurements . . . . . . . . 70 9.5.3 Total processing times . . . . . . . . . . . . . . . . . . 70 10 Conclusions 79 10.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Bibliography 82 Glossary 87 Acronyms 88 List of Figures 2.1 Video path from source to screen with frame rate up-conversion 3 3.1 The frame rate up-converter illustrated as a black box . . . . . 7 3.2 Up-conversion using frame averaging . . . . . . . . . . . . . . 8 3.3 Up-converting a high-motion video using frame averaging . . . 9 4.1 Block matching . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2 The three steps of TSS . . . . . . . . . . . . . . . . . . . . . . 14 4.3 New Three Step Search . . . . . . . . . . . . . . . . . . . . . . 16 4.4 The four steps of 4SS . . . . . . . . . . . . . . . . . . . . . . . 17 4.5 Diamond Search . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.6 Adaptive Rood Pattern Search . . . . . . . . . . . . . . . . . . 21 4.7 Full Bidirectional Search . . . . . . . . . . . . . . . . . . . . . 23 5.1 Direct motion compensation . . . . . . . . . . . . . . . . . . . 26 5.2 Un�lled pixels when using direct motion compensation . . . . 27 5.3 Bidirectional motion compensation . . . . . . . . . . . . . . . 28 7.1 Full-reference objective quality evaluation . . . . . . . . . . . 34 8.1 Up-conversion pseudo code . . . . . . . . . . . . . . . . . . . . 38 8.2 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . . . 42 9.1 CIF test video sequences . . . . . . . . . . . . . . . . . . . . . 44 9.2 Liseberg test video . . . . . . . . . . . . . . . . . . . . . . . . 45 9.3 FerryClose test video . . . . . . . . . . . . . . . . . . . . . . . 45 9.4 FerryFar test video . . . . . . . . . . . . . . . . . . . . . . . . 46 9.5 Artefacts in 3x up-conversion of FerryFar sequence . . . . . . 48 9.6 Sharpness comparison, 3x up-conversion of FerryClose . . . . . 49 9.7 Sharpness compared to original, 3x up-conversion of FerryClose 49 9.8 Artefacts in 3x up-conversion of FerryClose . . . . . . . . . . . 50 9.9 Artefacts on tram, 3x up-conversion of Liseberg . . . . . . . . 51 9.10 Artefacts on tower, 3x up-conversion of Liseberg . . . . . . . . 51 9.11 Artefacts on tram, 2x up-conversion of Liseberg . . . . . . . . 52 9.12 Artefacts on tower, 2x up-conversion of Liseberg . . . . . . . . 53 9.13 Face artefacts, frame 38 of Akiyo . . . . . . . . . . . . . . . . 53 9.14 Face artefacts, frame 76 of Akiyo . . . . . . . . . . . . . . . . 54 9.15 Face artefacts, frame 137 of Akiyo . . . . . . . . . . . . . . . . 54 9.16 6x up-conversion of Container, frame 111 . . . . . . . . . . . . 54 9.17 Bird artefacts, frame 262 of Container . . . . . . . . . . . . . 55 9.18 Face artefacts, frame 13 of Foreman . . . . . . . . . . . . . . . 55 9.19 Average PSNR of FerryFar . . . . . . . . . . . . . . . . . . . . 56 9.20 Per-frame PSNR of FerryFar . . . . . . . . . . . . . . . . . . . 57 9.21 Average PSNR of FerryClose . . . . . . . . . . . . . . . . . . . 57 9.22 Per-frame PSNR of FerryClose . . . . . . . . . . . . . . . . . . 58 9.23 Average PSNR of Liseberg (3x) . . . . . . . . . . . . . . . . . 59 9.24 Per-frame PSNR of Liseberg (3x) . . . . . . . . . . . . . . . . 59 9.25 Average PSNR of Liseberg (2x) . . . . . . . . . . . . . . . . . 60 9.26 Per-frame PSNR of Liseberg (3x) . . . . . . . . . . . . . . . . 60 9.27 Average PSNR of Akiyo . . . . . . . . . . . . . . . . . . . . . 61 9.28 Per-frame PSNR of Akiyo . . . . . . . . . . . . . . . . . . . . 61 9.29 Per-frame SSIM of Akiyo . . . . . . . . . . . . . . . . . . . . . 62 9.30 Average PSNR of Container . . . . . . . . . . . . . . . . . . . 62 9.31 Per-frame PSNR of Container . . . . . . . . . . . . . . . . . . 63 9.32 Per-frame SSIM of Container . . . . . . . . . . . . . . . . . . 63 9.33 Average PSNR of Foreman . . . . . . . . . . . . . . . . . . . . 64 9.34 Per-frame PSNR of Foreman . . . . . . . . . . . . . . . . . . . 64 9.35 Per-frame SSIM of Foreman . . . . . . . . . . . . . . . . . . . 65 9.36 ME on FerryFar using DESKTOP . . . . . . . . . . . . . . . . 67 9.37 ME on FerryFar using LAPTOP . . . . . . . . . . . . . . . . . 67 9.38 ME on FerryClose using DESKTOP . . . . . . . . . . . . . . . 68 9.39 ME on FerryClose using LAPTOP . . . . . . . . . . . . . . . 68 9.40 ME on Liseberg using DESKTOP . . . . . . . . . . . . . . . . 69 9.41 ME on Liseberg using LAPTOP . . . . . . . . . . . . . . . . . 69 9.42 Total processing time on Akiyo using DESKTOP . . . . . . . 71 9.43 Total processing time on FerryFar using DESKTOP . . . . . . 72 9.44 Total processing time on FerryFar using LAPTOP . . . . . . . 72 9.45 Total processing time on FerryClose using DESKTOP . . . . . 73 9.46 Total processing time on FerryClose using LAPTOP . . . . . . 73 9.47 Total processing time on Liseberg using DESKTOP . . . . . . 74 9.48 Total processing time on Liseberg using LAPTOP . . . . . . . 74 9.49 Total processing time on FerryFar (8x8E) using DESKTOP . . 75 9.50 Total processing time on FerryClose (8x8E) using DESKTOP 75 9.51 Total processing time on Liseberg (8x8E) using DESKTOP . . 76 9.52 Total processing time on FerryFar (8x8E) using LAPTOP . . 77 9.53 Total processing time on FerryClose (8x8E) using LAPTOP . 77 9.54 Total processing time on Liseberg (8x8E) using LAPTOP . . . 78 List of Tables 4.1 List of notations used in this chapter. . . . . . . . . . . . . . . . 12 9.1 Speci�cations of test systems. . . . . . . . . . . . . . . . . . . . 46 9.2 Average time for di�erent optimizations on ME . . . . . . . . 66 9.3 Relative time improvements with optimization . . . . . . . . . 66 9.4 Average time for ME for the standard videos . . . . . . . . . . 66 9.5 Comparison of CPU- and OpenCL-based BDMC implementa- tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 9.6 Average time for MC on the recorded videos . . . . . . . . . . 70 9.7 Average time for MC on the standard videos . . . . . . . . . . 71 1. Introduction 1.1 Frame rate conversion Many surveillance and monitoring videos are captured using static cameras with relatively low frame rate. The low frame rate of the videos is economic both in terms of the monitoring equipment and storage requirement. Other factors may in�uence the necessity of keeping a low frame rate, such as limited transmission bandwidth for a remote surveillance solution. Storage space may be an issue for high-de�nition (HD) resolution video capture. However, this poses a challenge in terms for the manual inspection of such video feeds, since low frame rate videos frequently consist of non-smooth motion artefacts. This can lead to increased strain for the viewer. Therefore, a key challenge towards improving the user experience lies in successfully obtaining a smoother video by increasing the frame rate. In this thesis we present a solution aimed at removing these issues for re- mote surveillance HD video feeds, by a technique known as frame rate up- conversion. 1.2 Thesis contributions The main contributions of this thesis are: • The design and implementation of a frame rate up-conversion tool (FRUCT), specialized for use with remote surveillance video. • A comparative study of some well-known, published unidirectional block matching algorithms (BMAs) and their suitability for use with frame rate up-conversion of remote surveillance video. • Research and data on the topic of how Open Computing Language (OpenCL) can be used for GPU-acceleration of frame rate up-conversion. Page 1 of 90 • Full Bidirectional Search (FBDS), our version of an exhaustive search BMA, using ideas from Adaptive Rood Pattern Search (ARPS) for early termination of search in static areas of video. 1.3 Disposition The thesis is organized as follows: • Chapter 2 presents the problem description and goals, as well as the scope and delimitations. • Chapter 3 describes the theory behind both naive and more advanced ways of performing frame rate up-conversion. A short summary of some of the existing research in the area is also given. • Chapter 4 introduces a number of motion estimation algorithms used in this thesis. • Chapter 5 gives details on the motion compensation algorithms used in this thesis. • Chapter 6 familiarizes the reader with the terminology and processing �ow of OpenCL. • Chapter 7 contains information about evaluation methods by which the result can be measured. • Chapter 8 shows a detailed look on our implementation of a frame rate up-converter. • Chapter 9 displays the results after frame rate up-conversion. • Chapter 10 sums up the thesis with conclusions and notes on possible future work. Page 2 of 90 2. Problem description 2.1 System speci�cations The focus of this thesis is the design and implementation of a frame rate up- conversion tool (FRUCT). The up-converter will be part of a larger system, as can be seen from Figure 2.1. It will work as a "middle-man" between the decoder and renderer software. Input to the FRUCT is raw 1080p video at 20 frames per second (FPS) corresponding to the intended remote surveillance scenario. Output should be up-converted video in the same format at 60 FPS. In other words: two interpolated frames should be inserted for every original frame. Performance and quality should be high enough to be successfully applied in real-time. Figure 2.1: Video path from source to screen with the FRUCT. The area of application, remote surveillance with the settings described Page 3 of 90 above, adds a number of items to the speci�cation. These may not be as important for a general-purpose implementation of a FRUCT. 2.1.1 Software-based implementation The FRUCT should be implemented as software, not as hardware. Many hardware-based solutions exist today, especially for use in real-time and for high-de�nition (HD) video, e.g. by Lee & Nguyen (2010). Software-based solutions are however easier to maintain and can more easily interact with other parts of a software-based remote surveillance system. 2.1.2 De�nition of real-time The FRUCT should be capable of up-converting a received video stream from 20 FPS to 60 FPS (an up-conversion factor of 3) in real-time. For true real-time behaviour, frames constructed by the FRUCT must be completed within 1/60 seconds, roughly 16.7 ms, for this to be ful�lled. The real-time demand in the case of remote surveillance also imposes a re- striction on the delay between a video frame being captured in the camera and the the frame being displayed on-screen. This delay should ideally be kept as low as possible, and in this speci�c case no more than one (1) sec- ond. The variable transmission delay poses additional di�culty to put a hard limit on the maximum delay the FRUCT could introduce in the system. The FRUCT may require some initial bu�ering or setup routines, introducing a delay before the �rst received frame is written to the output channel. A solution involving a relaxation on the real-time demand is to produce frames to be placed in an output bu�er with a predetermined size, typically equal to the up-conversion factor. For a bu�er size of n and frame rate f , the time limit is then equal to n f , which for n = 3 and f = 60 calculates to 50 ms. In other words, we have 50 ms to �ll the bu�er with three frames where one is an original frame and two are constructed by the FRUCT. Since the original frame can be copied directly to the output bu�er, most of the 50 ms can in practice be used for construction of the two new frames. Page 4 of 90 2.1.3 High-de�nition Video received by the FRUCT will be in HD, 1080p. Video capture at such high resolutions produce large amounts of data to be encoded, transmitted and decoded. A majority of the FRUCTs described in existing literature have primarily been tested on low-resolution video formats with less focus on real-time capability. 2.1.4 Static scenes The intended usage scenario for the FRUCT is stationary cameras and mostly static outdoor scenery. A more general purpose FRUCT must be able to handle video motion as well as camera motion (panning, tilting, etc.), which in the latter case is known as global motion. Also, cuts and scene changes also never happen for continously �lming stationary cameras. There is however the case of events such as sudden changes in cloud coverage or weather which may give e�ects to a large portion of the visible area, which have to be taken into account. 2.1.5 Remote objects Objects found in video from the intended usage scenario are a sizeable dis- tance from the camera; from ten to a hundred meters up to several hundred meters. The distance means that they will appear on a relatively small area of the video frame. This opens up for both possibilities and problems: the motion of these objects from frame to frame in terms of pixels on the screen will be fairly small. However, depending on which method used for motion estimation (ME) there might be problems �nding true motion of the image areas covered by the objects. 2.1.6 High reliability For the video to be a reliable source of information, especially vital if it is to be used for security-critical surveillance, the processed video should stay as close to the original as possible. The frames used for interpolation are not to be manipulated in any way, and shall be shown "as-is". Visible artefacts in the video should preferably be kept to a minimum, more so in regions of interest. Page 5 of 90 2.2 Scope and delimitations In order to keep the focus on areas of importance, the following decisions were made: • The implementation should not have to be a complete product, a work- ing prototype would su�ce. • The prototype should be implemented in software, not hardware. Hard- ware solutions have been proven to be successful. However, a well im- plemented software solution should still be able to produce good results, with the bene�ts of being more easily implemented and maintained. • The prototype should focus on high performance and quality for small moving objects (far away, thus having small motion), rather than large objects close to the camera. This focus is due to the intended use for remote surveillance (remote in both senses; the video streaming to a remote location and objects under surveillance being a sizeable distance from the cameras). • The prototype should not handle any encoding, decoding or rendering except for possibly obtaining motion vectors from the H.264/MPEG-4 AVC (H.264) bitstream. • The prototype should only be used with video from stationary cameras, as opposed to a solution capable of up-converting any kind of video. • The prototype should work with 1080p video resolution (or lower), in Y'CbCr 4:2:0 color space (raw, uncompressed video). Page 6 of 90 3. Frame rate up-conversion Increasing the frame rate, i.e. up-conversion, is done to enhance the visual experience of video with low frame rate. Video at 30 frames per second (FPS) is clearly smoother than video at 20 FPS, especially for video with much motion. Up-conversion is performed by adding new frames to a video from existing ones, as Figure 3.1 illustrates. Much research has been put into doing this in a fast and e�cient manner. Figure 3.1: The frame rate up-converter illustrated as a black box. Input is a source video and output is an up-converted video. Existing research in the area falls under two broad categories of solutions: hardware-based and software-based. Hardware-based solutions using Field- Programmable Gate Arrays (FPGAs) perform well for real-time 1080p video (Lee & Nguyen 2010). Such solutions are however beyond the scope of this thesis. Page 7 of 90 3.1 Frame repetition The simplest way of increasing the frame rate is by repeating every frame F k (i.e. up-conversion factor) number of times. Equation 3.1 below describes frame repetition (FR), where 1 ≤ j ≤ k, j ∈ Z+ and t being a time index. Ft− j k = Ft−1 (3.1) For example, Ft− 1 2 is the interpolated frame between frames Ft−1 and Ft. This method is extremely fast but gives a poor visual experience. 3.2 Frame averaging A slightly better way to increase the frame rate is by interpolating all pixel values for a new frame from two neighbouring frames. Figure 3.2 below illustrates frame averaging (FA). Figure 3.2: Up-conversion using FA. Every new frame (non-striped) is averaged by the two neighbouring frames (striped). Let Pt(x, y) be the value of a pixel with coordinates (x, y) in frame ft, t being a time index. Then for all (x, y) inside the frame boundaries of ft, with Page 8 of 90 1 ≤ j < k and j ∈ Z+ , the pixel values for an intermediate interpolated frame ft− j k is given by Equation 3.2. Pt− j k (x, y) = ( j k ) Pt−1(x, y) + ( 1− j k ) Pt(x, y) (3.2) This method is very fast, and gives a smoother visual experience than FR. It is however not suitable for high-motion video, as seen in Figure 3.3. Smooth- ness is gained at the loss of sharpness. Figure 3.3: Up-converting a high-motion video using frame averaging. This video is the Soccer standard benchmarking video1. 3.3 Advanced methods The main challenge in frame rate up-conversion is how to e�ciently and cor- rectly perform motion estimation (ME) and produce the intermediate frames using motion compensation (MC). Motion-compensated frame interpolation is a term often used to describe this process. There are many existing algorithms for motion estimation. Approximating the optical �ow, motion of individual pixels from one frame to another, is one way. However, it has traditionally not been used in video encoding for two purposes. It is a relatively slow method, and encoding the motion of in- dividual pixels would require too much data to be feasible. Instead, methods based on the movements of rectangular blocks of pixels are often used. Since block matching is the de facto most-used family of motion estimation algo- rithms in use, fast algorithms have been developed during the years. Some 1Standard videos can be found at http://media.xiph.org/video/derf/ Page 9 of 90 http://media.xiph.org/video/derf/ of the block matching algorithms (BMAs), of which a few were selected for implementation in this thesis, are presented in Chapter 4. Once the video motion has been estimated other algorithms may be used to utilize the so-called motion vectors of each block to produce a sharp interme- diate frame. Algorithms for this are presented in Chapter 5. In the following section we give a brief overview of the research that has been done in the area of frame rate up-conversion. 3.4 Existing research ME is often done with various BMAs, to reduce computational complexity. Some form of BMA is performed in encoders for all video coding standards, e.g. H.264/MPEG-4 AVC (H.264) (Richardson 2010). The resulting motion vectors can be extracted and decoded from the bitstream at the receiving end to reduce or remove the need for ME when performing frame rate up- conversion (Chen et al. 1998, Sasai et al. 2004). In order to detect and account for abrupt illumination changes, which can cause faulty motion vectors dur- ing BMA, histogram-based methods have been developed (Thaipanich et al. 2009). By using an increased temporal window in BMA the accuracy of ME can be increased (Kang et al. 2008), though at the price of processing time. A very popular post-processing step on the Motion Vector Field (MVF) ob- tained from ME is median �ltering (Zhai et al. 2005, Choi et al. 2006, Gan et al. 2007, Luessi & Katsaggelos 2009). This step can also be performed on hardware (Tasdizen & Hamzaoglu 2010). The �ltering "smooths" the MVF, removing outliers. The problem of overlapping motion-compensated blocks is often solved by employing Overlapped Block Motion Compensation (OBMC) techniques (Lee et al. 2003, Zhai et al. 2005, Choi et al. 2006). An alternative method of handling occlusion is for example using the divergence of the MVF (Hong 2009). Page 10 of 90 4. Motion estimation There are several di�erent ways to estimate motion between frames, and many models which may be suitable to describe this motion. The model most commonly used is that of linear motion with constant acceleration. This is a reasonable assumption provided the frame rate is not too low, since the frame-to-frame coherence is usually high. Section 4.1 below presents a way to perform motion estimation (ME) by using a block matching algorithm (BMA). Image blocks in temporally adjacent frames are matched, as illustrated in Figure 4.1, using a block similarity function (see Section 4.2). Section 4.3 presents another way to estimate motion by using pre-calculated motion vectors from encoded bitstreams. Figure 4.1: Block matching between a block in the current frame against blocks in the reference frame. The best match is found in the lower-right corner. The search area is used to limit the amount of blocks matched. Page 11 of 90 4.1 Block matching algorithms In block matching algorithms (BMAs) a frame is divided into a set, B, of non-overlapping rectangular image blocks. An individual block b ∈ B is matched against blocks in a temporally adjacent reference frame. The best match found for a block b using some appropriate similarity function is used to form a motion vector for the block. The motion vectors are used in motion compensation (MC) described in Chapter 5. The most common block sizes used are 4x4, 8x8 and 16x16 pixels. This is because they are common divisors to the most well-known resolutions, such as Common Intermediate Format (CIF) and 720p. 1080p is however not divisible by 16, and this must be taken into consideration during implemen- tation. Matching blocks can be computationally expensive, and di�erent BMAs try to reduce the number of block matching operations required to �nd the best match. All BMAs presented below, except for Adaptive Rood Pattern Search (ARPS) (Nie & Ma 2002), are fully parallelizable as all blocks within a frame are independent from each other. α Length of rood arm b An image block bH Height of block in pixels bW Width of block in pixels H Search area height MVx Horizontal component of a block motion vector MVy Vertical component of a block motion vector n Total number of blocks matched for one block si Number of block mathing operations in step i of an algorithm t Time index W Search area width Z+ Positive integers (1, 2, 3, ...,) Table 4.1: List of notations used in this chapter. 4.1.1 Full search Full Search (FS) involves �nding a best match for a reference block against all possible block locations within a speci�ed search window. The number of Page 12 of 90 blocks matched, n, for blocks of width bW and height bH , and search window width W and height H (where W ≥ bW and H ≥ bH), can be calculated using Equation 4.1 below. n = (W − bW + 1) ∗ (H − bH + 1) (4.1) For a typical setting of bW , bH = 16 and W,H = 30 a total of 225 blocks are matched. Matching a block against an entire frame is possible, but is not practically feasible, especially for high frame resolutions. For example, matching a block against an entire frame using bW , bH = 16 and W = 1920 and H = 1080, a total of 2028825 block matching operations would be re- quired. 4.1.2 Three Step Search One of the very �rst BMAs was the Three Step Search (TSS) (Koga et al. 1981). This algorithm has in�uenced many other algorithms, such as New Three Step Search (NTSS) and Four Step Search (4SS). TSS reduces the number of search operations, compared to FS, by iteratively locating a best match in three steps. • Step one Match all positions, including the center position, at a distance of four pixels away from the center. This step matches nine blocks, i.e. s1 = 9. • Step two Center the search position around the best match found. Match all positions at a distance of two pixels away from the center. This step matches eight blocks, i.e. s2 = 8. • Step three Center the search position around the best match found. Match all positions at a distance of one pixel away from the center. This step matches eight blocks, i.e. s2 = 8. The best match found in this step is the match returned by the algorithm. This algorithm always perform 25 block matching operations (Equation 4.2). Figure 4.2 below illustrates an example run of TSS. n = s1 + s2 + s3 = 25 (4.2) Page 13 of 90 TSS reduces the number of blocks matched compared to full search. If bW , bH = 16 and W,H = 30 is used for full search, TSS performs nine times better. Barjatya (2004) argues that one disadvantage of TSS is that it can miss small motion due to the search pattern. Figure 4.2: Illustration of the TSS algorithm. The starting center position is indicated by a surrounding square. The circles indicate the best matches found in each step. The double circle indicates the best match found by the algorithm. TSS always perform 25 block matching operations. 4.1.3 New Three Step Search Li et al. (1994) suggested an improved version of TSS, called New Three Step Search (NTSS). Where TSS can have problems of �nding small motion, NTSS tackles this using a center-biased search pattern. In contrast to TSS it has two di�erent early termination conditions, where searching can be stopped early to reduce the total number of blocks matched. • Step one Match all positions, including the center position, at a distance of one and four pixels away from the center. This step matches 17 blocks, i.e. s1 = 17. If the best match is found in the center, return this as the best match found. Page 14 of 90 • Steps two and three If the best match in s1 is found at a distance of: � One pixel from the center. Then center the search position around the best match found. Match all positions not matched in the �rst step, at a distance of one pixel away from the center. This step matches either three or �ve blocks. The best match found at this point is the best matching block. � Four pixels from the center. Then follow steps two and three from TSS. The best match found in step three is the match returned by NTSS. Step two matches three, �ve or eight blocks, i.e. s2 ∈ {3, 5, 8}. Step three can be skipped, or matches eight blocks, i.e. s3 ∈ {0, 8}. The number of blocks matched, n, are the sum of all blocks matched in the steps above. This can vary from 17 to 33 matches (Equation 4.3). Figure 4.3 below illustrates an example run of NTSS. n = s1 + s2 + s3, where n ∈ {17, 20, 22, 30, 32, 33} (4.3) Li et al. (1994), Barjatya (2004) have presented results showing that NTSS performs slightly better than TSS. Page 15 of 90 Figure 4.3: Illustration of the NTSS algorithm. The third step was skipped as the best match found in the �rst step was 1 pixel away from the center. The starting center position is indicated by a surrounding square. The circles indicate the best matches found in each step. The double circle indicates the best match found by the algorithm. A total of 22 blocks were matched in this example. 4.1.4 Four Step Search The Four Step Search (4SS) algorithm (Po & Ma 1996) is based on TSS and NTSS. Like TSS and NTSS it uses a center biased search pattern and has early termination conditions in the two �rst steps. • Step one Match all positions, including the center position, at a distance of two pixels away from the center. This step matches 9 blocks, i.e. s1 = 9. If the best match is found in the center, skip to step four. • Step two Center the search position around the best match found. Match all positions not matched in the �rst step at a distance of two pixels away from the center. This step is either skipped, or matches three or �ve blocks, i.e. s2 ∈ {0, 3, 5}. If the best match is found in the center, skip to step four. • Step three Page 16 of 90 This step is exactly the same as step two. It is either skipped, or matches three or �ve blocks, i.e. s3 ∈ {0, 3, 5}. • Step four Center the search position around the best match found. Match all positions at a distance of one pixel away from the center. This step matches eight blocks, i.e. s4 = 8. The best match found in this step is the match returned by the algorithm. The number of blocks matched, n, are the sum of all blocks matched in the four steps. This can vary from 17 to 27 matches (Equation 4.4). Figure 4.4 below illustrates an example run of 4SS where 25 blocks are matched. n = s1 + s2 + s3 + s4, where n ∈ {17, 20, 22, 23, 25, 27} (4.4) Figure 4.4: Illustration of the 4SS algorithm. The starting center position is indicated by a surrounding square. The circles indicate the best found matches in each step. The double circle indicates the best match found by the algorithm. A total of 25 blocks were matched in this example. Po & Ma (1996), Barjatya (2004) show that 4SS reduces the number of blocks matched, compared to TSS, by 20-30%, depending on the amount of motion in the video. 4SS can in the worst case be 8% slower than TSS. Page 17 of 90 4.1.5 Diamond Search Diamond Search (DS) (Zhu & Ma 1997) is in�uenced by 4SS but uses a diamond shape search pattern instead. It has three steps, where the second step can be repeated any number of times. • Step one Start by matching the center position, the positions two pixels away in the horizontal and vertical directions, and the positions one pixel away in the diagonal directions. This step matches nine blocks, i.e. s1 = 9. If the best match is found in the center, skip to step three. • Step two Center the search position around the best match found. Match all positions not matched in the �rst step in a diamond shaped pattern. This step is either skipped, or matches three or �ve blocks, i.e. s2 ∈ {0, 3, 5}. Repeat this step until the best match is found in the center. • Step three Center the search position around the best match found. Match all positions one pixel away in the horizontal and vertical directions. This step matches four blocks, i.e. s3 = 4. The best match found in this step is the match returned by the algorithm. The number of blocks matched, n, are the sum of all blocks matched in the three steps (Equation 4.5). Figure 4.5 below illustrates an example run of DS where 21 blocks are matched. n = s1 + k ∗ s2 + s3, where n ∈ {13, 16, 18, 19, 21, 22, 23 . . .}, k ∈ {0, 1, 2, . . .} (4.5) Page 18 of 90 Figure 4.5: Illustration of the DS algorithm. The starting center position is indicated by a surrounding square. The circles indicate the best matches found in each step. Notice that the same position was found as the best match for both the �rst and second run of step two. The double circle indicates the best match found by the algorithm. A total of 21 blocks were matched in this example. Barjatya (2004) shows that DS slightly reduces the number of blocks matched compared to 4SS. 4.1.6 Adaptive Rood Pattern Search The Adaptive Rood Pattern Search (ARPS) (Nie & Ma 2002) is named after its rood-like pattern. Every block uses the motion vector of their left neighbouring block to set the length of the rood arm α. ARPS is thus only parallelizable in the vertical direction as the blocks are horizontally dependent. This dependency gives the algorithm the potential of accurately �nding large motion between frames, since it assumes that motion in general is coherent in a frame. ARPS uses zero-motion prejudgement for early termination of video with static content, which can signi�cantly reduce the total number of blocks matched for a frame. Three steps are used to match blocks, where the last step is repeated until the center position is the best match. Page 19 of 90 • Step one Perform zero-motion prejudgement; match the block against the block at the same position in the reference frame. If the di�erence is below a threshold τ , stop searching and return this match. This step matches one block, i.e. s1 = 1. • Step two If the block is the left-most block, set α = 2. Otherwise use the largest component,MVx orMVy, of the motion vector of the left neighbouring block, i.e. α = max(|MVx|, |MVy|). Letting (x, y) denote the current center of search, all positions (x ± α, y ± α) and (x +MVx, y +MVy) are matched. This step is either skipped, or matches �ve blocks, i.e. s2 ∈ {0, 5}. • Step three Center the search position around the best match found. Match all positions, not matched in the �rst or second step, at a distance of one pixel away from the center in the horizontal and vertical directions. Iterate this step until the best match is the center position. This posi- tion is returned as the best match by the algorithm. This step is either skipped, or matches two to four blocks, i.e. s3 ∈ {0, 2, 3, 4}. The number of blocks matched, n, are the sum of all blocks matched in the three steps (Equation 4.6). Notice that step three can be repeated k number of times. Figure 4.6 below illustrates an example run of ARPS where step three is iterated three times. A total of 15 blocks are matched in this example. n = s1 + s2 + k ∗ s3, where n ∈ {1, 8, 9, 10, . . .}, k ∈ {0, 1, 2, . . .} (4.6) Page 20 of 90 Figure 4.6: Illustration of the ARPS algorithm. The starting center position is indicated by a surrounding square. The rood arm α is illustrated by the thin line. The circles indicate the best found matches in each step. The double circle indicates the best match found by the algorithm. A total of 15 blocks were matched in this example. Nie & Ma (2002), Barjatya (2004) show that ARPS reduces the number of blocks matched, compared to DS, by a factor of two. 4.1.7 Full Bidirectional Search Bidirectional search, where bidirectional and bilateral are sometimes used in- terchangeably, is a family of BMAs which in theory should be more suited for use together with bidirectional motion compensation (BDMC). The search- ing procedure is changed compared to e.g. 4SS and ARPS by not keeping a position in one frame constant while searching for a match in another frame. Instead, positions in both frames are allowed to vary. The idea is not to estimate where one block of the frame might have moved to or from in the reference frame, but which surrounding parts may have moved through the current block. This corresponds well to the behaviour of BDMC, since the motion vectors for the block, as used by BDMC, can be said to originate from an un�lled block in an interpolated intermediate frame. Variations on the bidirectional search is used by e.g. Zhai et al. (2005) and Kang et al. (2008). Page 21 of 90 We present in this thesis Full Bidirectional Search (FBDS), our simple version of an exhaustive bidirectional search algorithm. It uses the zero motion prejudgement found in ARPS for early termination, which for the low to moderate motion found in the remote surveillance use scenario should prove highly bene�cial. The search area ranges from -4 to +4 in horizontal and vertical directions, for a total of 81 positions searched (82 including the early zero motion check). The algorithm rests heavily on the assumption of linear motion. The o�set in search position is negated for one of the frames; e.g. a position with o�set of minus one in both horizontal and vertical directions compared to the block's position is matched against a position with o�set plus one in horizontal and vertical directions in the other frame. The number of positions searched may seem large compared to e.g. 4SS and ARPS. However, zero motion prejudgement, aggressive optimization from the compiler and improved scheduling when using OpenMP can reduce this issue. Figure 4.7 shows how the two steps of FBDS work. In step one, the center position in both frames is matched and checked against the zero motion threshold. Step two (s2) involves 81 matches, starting at the top left (-4, -4) block in frame A matched to the bottom right (+4, +4) block in frame B. The last matching operation in this step is between the bottom right (+4, +4) block in frame A and the top left (-4, -4) block in frame B. Page 22 of 90 Figure 4.7: Illustration of the FBDS algorithm. The starting center position is indicated by a surrounding square. Step 2 involves 81 additional matches, starting at the top left block for frame A and the bottom right block in frame B. The last match is between the bottom right block in frame A and the top left block in frame B. A total of 82 blocks were matched in this example. 4.2 Block similarity functions There are a number of di�erent techniques to measure the similarity between two blocks (denoted by bA and bB in the following sections). All pixels in the blocks are compared in pairs and a total value indicating the similarity is computed. A low value indicates higher similiarity. As the pixels are independent from each other in each frame, the techniques are parallelizable and can be implemented in an e�cient manner. 4.2.1 Sum of Absolute Di�erences The Sum of Absolute Di�erences (SAD) summarizes the absolute di�erence pixel values (Bovik 2009, Marzat & Ducrot 2009), as described in Equa- Page 23 of 90 tion 4.7. SAD is sometimes referred to as Sum of Absolute Errors (SAE). SAD(bA, bB) = bW−1∑ i=0 bH−1∑ j=0 |bAij − bBij | (4.7) This technique is widely adopted (Cetin & Hamzaoglu 2010, Lee & Nguyen 2009, Luessi & Katsaggelos 2009), as it is computationally cheaper than many other block similarity functions. Streaming SIMD Extensions (SSE) exploits the inherent parallelism by having special instructions for computing the SAD for up to 16 pixels at a time (see Section 8.2.1). 4.2.2 Mean Absolute Di�erence Mean Absolute Di�erence (MAD) computes the mean of SAD (Barjatya 2004), as described in Equation 4.8. MAD is sometimes referred to as Mean Absolute Error (MAE). MAD(bA, bB) = 1 bW ∗ bH SAD(bA, bB) (4.8) MAD is computationally more expensive than SAD and an implementation requires �oating point precision. 4.2.3 Sum of Squared Di�erences Sum of Squared Di�erences (SSD) summarizes the squared di�erence pixel values (Marzat & Ducrot 2009), as described in Equation 4.9. SSD(bA, bB) = bW−1∑ i=0 bH−1∑ j=0 (bAij − bBij )2 (4.9) Page 24 of 90 4.2.4 Mean Squared Error Mean Squared Error (MSE) computes the mean of SSD (Bovik 2009), as described in Equation 4.10. MSE(bA, bB) = 1 bW ∗ bH SSD(bA, bB) (4.10) MSE is computationally more expensive than SSD and an implementation requires �oating point precision. This function is used to compute the peak signal-to-noise ratio (PSNR) between two frames, which is described in Sec- tion 7.1.1. 4.3 Bitstream motion vectors Some form of block matching is performed in encoders for all video coding standards. The resulting motion vectors can be extracted and decoded from the bitstream at the receiving end to reduce or remove the need for ME when performing frame rate up-conversion. This technique has successfully been used by e.g. Chen et al. (1998) and Sasai et al. (2004). Depending on the settings and runtime decisions on the encoder side the motion vectors in the bitstream may or may not be describing true motion in the video. They can in some cases be missing entirely (Huang & Nguyen 2008). If bitstream motion vectors are to be used some sort of judging of the quality of the received motion vectors must be incorporated into the algorithm, as done by e.g. Luessi & Katsaggelos (2009). Page 25 of 90 5. Motion compensation 5.1 Direct motion compensation The most simple form of motion compensation (MC) is direct motion com- pensation (DMC), also referred to as unidirectional motion compensation. In a new interpolated frame, blocks are �lled with data by moving blocks in the reference frame along their motion vectors (obtained in motion estima- tion (ME)). This can be done in either the forward or backward direction. Figure 5.1 below illustrates DMC in the forward direction. Figure 5.1: Illustration of DMC in the forward direction. Every block in Ft−1 is moved along its motion vector. The problem with DMC is that all pixels might not be �lled (Huang & Nguyen 2008, Luessi & Katsaggelos 2009). These "holes" are illustrated in Figure 5.2 below. Various methods exist to �ll these holes, such as video Page 26 of 90 inpainting (Criminisi et al. 2004, Wexler et al. 2004) and texture synthesis (Ashikhmin 2001). Figure 5.2: Un�lled pixels when using direct motion compensation. This frame is from the Foreman standard benchmarking video. 5.2 Bidirectional motion compensation Another way of performing motion compensation is bidirectional motion com- pensation (BDMC). For every block in a new frame, the motion vector found for the same block in the reference frame is used to interpolate in both the backward and forward directions. This method leaves no un�lled pixels, like DMC, as the method iterates over all blocks in the new frame instead of blocks from the reference frame. Figure 5.3 below illustrates BDMC. Page 27 of 90 Figure 5.3: BDMC illustrated. For every block in the interpolated frame Ft− 1 2 , the motion vector found for the same block in the reference frame Ft−1 is used to interpolate in both the backward and forward directions. Let Pt(x, y) be the value of a pixel with coordinates (x, y) in frame ft, t being a time index. Then for all (x, y) inside the boundaries of ft, with 1 ≤ j < k and j ∈ Z+, the pixel values for an intermediate interpolated frame ft− j k is given by Equation 5.1 below. Pt− j k (x, y) = 1 2 Pt−1 ( x− (1− j k ) ∗MVx, y − (1− j k ) ∗MVy) ) + 1 2 Pt ( x+ ( j k ) ∗MVx, y + ( j k ) ∗MVy) ) (5.1) Page 28 of 90 6. OpenCL The growing interest in parallel processing solutions has lead to the emer- gence of multi-core CPUs as well as GPGPU, general purpose computing on graphics processing units. The GPUs of today contain hundreds up to a thousand stream processing cores, a good match for the highly paralleliz- able algorithms in this thesis. E�orts to utilize this processing power include frameworks and APIs such as CUDA from Nvidia, OpenCL and Microsoft's DirectCompute. Open Computing Language (OpenCL) was chosen for the work of this thesis on the merits of having a wider industry support. CUDA is only supported by Nvidia, and DirectCompute is part of Microsoft's DirectX frameworks. We made use of OpenCL for GPU-acceleration of the motion compensation part of the code, taking advantage of on-device specialized hardware for bilinear interpolation. For a su�ciently high-end GPU it also proved to be faster than a CPU-based implementation of motion compensation. We used the �rst version of the standard, OpenCL 1.0. A later version, OpenCL 1.1, is available but does not contain any substantial changes relevant for our needs, and at the time of writing required beta-version drivers for some platforms. OpenCL is an open industry standard. It was developed for allowing general purpose code to run on a wide range of processor types, such as CPUs, GPUs or Digital Signal Processing (DSP) units. The speci�cation is de�ned by the Khronos OpenCL Working Group, whose mother organisation The Khronos Group is also responsible for the development of OpenGL (hence the naming of OpenCL). The standard is supported by companies such as Nvidia, AMD, IBM, Apple and others. OpenCL consists of a language, OpenCL C (based on the C99 standard), and a set of APIs for controlling the execution of the code written in this language. In the following sections we present a brief overview of the terminology and processing �ow when working with OpenCL. For a more detailed overview, please refer to the OpenCL speci�cation (Khronos OpenCL Working Group 2008). Page 29 of 90 6.1 Processing �ow Setting up OpenCL to run code on a GPU typically requires the following steps: 1. Acquire a platform ID and device ID. 2. Create a compute context. 3. Create a command queue. 4. Create a compute program. 5. Build the compute program. 6. Create a compute kernel in the compute program. 7. Allocate bu�ers on the device (if bu�ers are required by the kernel). Once these steps have been completed some additional steps are required to actually run the code and obtain the results: 8. Set the global and local work sizes. In many cases this only needs to be done once. 9. Transfer data from the host into device bu�ers (if necessary). 10. Set the kernel arguments. Some arguments may only have to be set once. 11. Execute the kernel on the device. 12. Transfer result from device back to host. 6.2 Terminology A platform ID contains the host and a collection of devices. The host is the part of the system from where the calls to the OpenCL API are made. There can be many OpenCL implementations available on the system. One example of this would be if implementations from both AMD and Nvidia were installed. A choice would then have to be made which of these to use. Devices consists of one or more compute units, which in turn consists of pro- cessing elements. A device can be a CPU, a Graphics Processing Unit (GPU) or other type of device (referred to as "dedicated OpenCL accelerators" in Page 30 of 90 the speci�cation). For a multi-core CPU the corresponding OpenCL repre- sentation would be a device with a single compute unit and one processing element per core. If a system contains more than one CPU these would be available as separate compute units. A context is the environment within which the code executes. This environ- ment includes a set of devices together with information about the memory properties of these devices. It also contains command queues associated to the devices. The command queue holds commands enqueued for a speci�c device. Kernels, functions written in the OpenCL C language, are grouped together in programs which are either built for speci�c devices during runtime or loaded from pre-built binaries. To execute a kernel in a successfully built program, the kernel arguments must �rst be set. These arguments may e.g. be primitive types such as integer values, handles to bu�ers allocated on the device, or texture sampler objects. The size and dimensions of the data to be processed must be speci�ed by setting the global and local work sizes. Data representations may be one-, two- or three-dimensional. The global work size determines the total number of work items needed to process all the data, and specifying a local work size gives the option to partition the data. Several instances of the kernel is executed in parallel on the device, each instance being a work-item. The work-items are grouped into work groups, whose sizes depends on the speci�ed local work sizes. Partitioning the data into smaller chunks e.g. enables sharing of data common to all work-items in a work group. E�cient sharing may optimize memory operations, leading to much higher performance. Each work group execute on a single compute unit, and each work item can be distinguished from the others by its so-called global and local IDs. The global ID is the coordinate in the data being processed, coordinates which may be one-, two- or three-dimensional depending on the type of data. If e.g. a two-dimensional image is being processed, the global ID of a work item might map to a speci�c pixel coordinate in the image. The local ID determines the coordinate of the work item within its work group. If on-device bu�ers are used by a kernel these must be �lled with data by enqueuing data transfer commands to a command queue associated with the device. Commands may be issued to be blocking, where the next statement is not executed before the transfer is complete, or non-blocking, resuming execution while allowing the transfer to happen "in the background". When Page 31 of 90 the arguments and work sizes have been set, and bu�ers �lled with data, the kernel can be enqueued for execution by adding an execution command to a command queue. This execution command can also be said to be non- blocking. Output from kernels often come in the form of a device bu�er �lled with modi�ed data. This data can be copied back from the device, also either blocking or non-blocking, by issuing a bu�er-reading command to the com- mand queue. The command can be added to the queue right after the kernel has been queued for execution, since the transfer will not initiate before the execution is complete (unless the command queue has been speci�ed to al- low out-of-order execution). Once data has been transferred back from the device it may be written to �le or used in other ways. Page 32 of 90 7. Evaluation methods There are two di�erent performance aspects to consider when evaluating a frame rate up-conversion tool (FRUCT): algorithm runtimes and image quality. The runtime depends on how fast the motion estimation (ME) and motion compensation (MC) is performed. This aspect is particularly interesting in a time-constrained setting, e.g. real-time environments. The image quality can be evaluated with both objective and subjective meth- ods; objective methods are based on mathematical formulas, whereas sub- jective methods are based on how humans perceive the quality. A number of objective and subjective methods are presented below along with the choice of methods for this thesis. 7.1 Objective quality evaluation Objective quality evaluation can be categorised into full-reference (source frame exists), reduced-reference (some part of a source frame exists) and no-reference (no source frame exists) quality estimation. To estimate the quality of a new frame without, or with a reduced, source frame is a di�cult and challenging task. Nevertheless, there have been suc- cessfull results; Woodard & Carley-Spencer (2006) uses no-reference objec- tive measurement to automate screening for structural magnetic resonance images, and Wang et al. (2002) uses no-reference objective measurement to assess quality of JPEG images. In full-reference evaluation, new frames are compared with source frames to determine the quality of the new frames. Frames are dropped at a factor k from a source video to produce a down-converted video. E.g. a source video at 60 frames per second (FPS) can be down-converted with k = 3 to a new Page 33 of 90 video at 20 FPS. The frame rate up-conversion tool then adds new frames, with the same factor k, to produce a new video at 60 FPS. Figure 7.1: Full-reference objective quality evaluation illustrated. A source video is �rst down-converted and then up-converted. The source frames (marked as A) and the up-converted frames (marked as B) are compared pair-wise using an objective quality evaluation method. The output indicates how similar the pairs of frames are. 7.1.1 Peak signal-to-noise ratio Richardson (2010) describes peak signal-to-noise ratio (PSNR) as a fast and easy full-reference metric to compare the luma between two frames. PSNR gives a logarithmic value in decibel, as shown in Equation 7.1. MSE is the Mean Squared Error (MSE) (described in section 4.2.4) and m is the number of bits needed to sample one pixel. If two frames are identical the MSE will be zero and the PSNRdB unde�ned. PSNRdB = 10 log10 (2m − 1)2 MSE (7.1) Page 34 of 90 PSNR does not always correspond well with subjective quality evaluation (Richardson 2010, Wang & Bovik 2009). PSNR is, despite this, the de- facto standard for objective quality evaluation (Oelbaum et al. 2007), mainly because it is very easy to implement and fast compared to more complex methods. 7.1.2 Structural similarity Structural Similarity (SSIM) (Wang & Bovik 2002, Wang et al. 2004) is another full-reference objective quality evaluation metric. Like PSNR, SSIM compares the luma data a pair of frames, but it also takes di�erences in contrast and structure into consideration. SSIM gives as result a normalized value in the range −1.0 to 1.0 indicating how similar the frames are. A value of 1.0 is returned only if the frames are identical. Equation 7.2 describes SSIM where x and y are the two frames, µx and µy are the estimated mean luma of x and y, σx and σy are the estimated contrasts of x and y, and σxy the covariance of σx and σy. C1 and C2 are constants used to avoid instability in the function. SSIM(x, y) = (2µxµy + C1)(2σxy + C2) (µ2 x + µ2 y + C1)(σ2 x + σ2 y + C2) (7.2) The comparison performed by SSIM corresponds better with subjective qual- ity evaluation, but is more complex and harder to implement than PSNR. 7.1.3 Other methods Much research e�ort has been put into developing objective quality evalua- tion methods that corresponds well with subjective quality evaluation. Here are some of the other methods presented in recent years: • PSNR+ an extension of PSNR (Oelbaum et al. 2007). • Predicted Mean Opinion Score (MOSp) (Bhat et al. 2009). • Suthaharan et al. (2005) proposed a metric based on Just Noticeable Di�erence (JND). These methods have high reported correlation with subjective quality evalu- ation methods, up to 70-90% (Richardson 2010). Page 35 of 90 7.2 Subjective quality evaluation Subjective quality evaluation is based on how humans perceive the quality of images and video. Richardson (2010) argues that there are several factors that may in�uence how the quality is perceived, e.g. the tester's state of mind and the test environment. Also, humans tend to be sensitive to spatial and temporal �delity, i.e. how clearly di�erent parts of the video can be seen and how smooth it is. In order to ensure that the results obtained from a subjective quality evalu- ation are reliable, a large pool of testers are needed. This is because testers, as they grade an increasing number of videos, often learn to look for arte- facts, which may e�ect the results (Richardson 2010). The results, obtained from all testers, are normalized (Bhat et al. 2009), to a Mean Opinion Score (MOS), that "indicates the relative quality of the impaired and reference sequences" (Richardson 2010, p.20). Three di�erent subjective quality evaluation methods are presented below. 7.2.1 Double-stimulus impairment scale The ITU Radiocommunication Sector (ITU-R) describes di�erent methods for assessing the quality of images and videos in BT.500-11 (ITU Radiocom- munication Sector 2002). One of these methods is Double-Stimulus Impairment Scale (DSIS), where a tester is presented with pairs of videos, one after another, knowing that the �rst video is the reference video and the second one is impaired. The tester then grades the second video, having the reference video in mind, on a �ve-step scale ranging from very annoying to imperceptible. 7.2.2 Double-stimulus continuous quality-scale Double-Stimulus Continuous Quality-Scale (DSCQS) is another method de- scribed in BT.500-11. A tester is presented with pairs of videos, where the videos are shown simultaneously. The tester grades both videos, without knowing which is the reference video or the impaired video, on a �ve step scale ranging from bad to excellent. The order should be randomised for every pair of videos. This method is widely used according to Richardson (2010). Page 36 of 90 7.2.3 Subjective assessment method for video quality Kozamernik et al. (2005) have suggested a method for subjective quality as- sessment called Subjective Assessment Method for Video Quality (SAMVIQ). The tester is presented with a reference video and a set of impaired videos. The tester can watch the videos in any order, and grades the impaired videos on a scale from 0 to 100. SAMVIQ gives results that are comparable with DSIS and DSCQS (Huynh-Thu et al. 2007, Tominaga et al. 2010), and the authors argue that "SAMVIQ is simpler, faster and more user-friendly than traditional subjective evaluation methods". 7.3 Choice of quality evaluation methods PSNR and SSIM were chosen to evaluate the quality of the frame rate up- conversion tool. PSNR was chosen for implementation, even though it might not correspond well with subjective methods, because of its simplicity and its wide useage in the �eld. SSIM was chosen to complement the results from PSNR. The third-party software MSU Video Quality Measurement Tool2 was used for this. None of the subjective quality methods were chosen. Instead was the up- converted videos evaluated by ourselves in an ad-hoc manner. 2http://compression.ru/video/quality_measure/video_measurement_tool_en.html Page 37 of 90 8. Implementation The frame rate up-conversion tool (FRUCT) was implemented in C++, where some parts critical for performance were hand-coded in assembly. Open Computing Language (OpenCL) was used for the motion compensation (MC) part. OpenMP compiler pragmas were used for all major parallelizable loops in the code to ensure all available CPU cores could be utilized. frame repetition (FR) and frame averaging (FA) were both implemented to serve as a basis for comparison against the more advanced algorithms. The general up-conversion procedure can be described with the pseudo-code in Figure 8.1. Figure 8.1: Up-conversion pseudo code Page 38 of 90 8.1 Naive methods The FR implementation is trivial; simply reading one frame at a time from the source �le and writing it to the output �le k number of times, where k is the up-conversion factor. FA was implemented in two versions: one CPU-based and one OpenCL- based. The OpenCL-based FA was an exploration of how Graphics Process- ing Unit (GPU) acceleration could be used for the motion estimation (ME)- and MC algorithms in the FRUCT. 8.2 Motion estimation ME was performed on luma from the raw video data. No extraction of motion vectors from the bitstream was made, for the reasons outlined in Section 4.3. By not using bitstream motion vectors the FRUCT is also not tied to a speci�c video codec and can be used together with all contemporary and future codecs. A drawback is however the loss of sub-pixel motion vector precision; in H.264/MPEG-4 AVC (H.264) motion vectors have up to quarter-pixel pre- cision, which means motion vectors are not limited to the integer values produced by standard implementations of most ME algorithms described in this thesis. It would be possible to add sub-pixel accuracy to these algo- rithms, but at a cost of additional processing time which would probably put the goal of real-time capability at risk. Of the block matching algorithms (BMAs) mentioned in section 4.1 Full Search (FS), Four Step Search (4SS), Adaptive Rood Pattern Search (ARPS) and Full Bidirectional Search (FBDS) were implemented. The choice of which algorithms to use was based on reported performance, both in terms of quality and runtime (Barjatya 2004). For 4SS the check for redundant search posi- tions was not implemented. In order for the block-matching to be feasible for real-time operation two main parts of any BMA should be the primary fo- cus of optimization e�orts. First, the number of potential positions searched should be kept as low as possible without degrading quality too much (a necessary trade-o�). Second, the operation of matching two blocks against each other should be as e�cient as at all possible, since this part of the code will be run intensively. Details of the techniques implemented for increased performance and quality is given in the following sections. Page 39 of 90 8.2.1 SAD assembly optimization By choosing an e�cient algorithm the number of matching operations can be minimized, at a reasonable cost of small ME errors and lost accuracy. However, an optimized similarity function (see Section 4.2) can also greatly reduce time spent. The Sum of Absolute Di�erences (SAD) function, due to its wide use in video encoding, has the bene�t of having specialized CPU instructions available. In the work of this thesis the Streaming SIMD Extensions (SSE) instruction PSADBW (Intel Corporation 2011) was used, where PSAD stands for Packed Sum of Absolute Di�erences. This allowed calculation of absolute di�erences of 8 pairs of 8-bit values in a single instruction. In other words: if the block size is set to 8x8 an entire block row of 8 values can be processed for each PSADBW instruction. This means that the total number of calls to PSADBW to compute the SAD for a 8x8 block was 8 and for a 4x4 block only 2. Note that only 64-bit registers were used for PSADBW, not 128-bit, since we during implementation were unaware of the possibility of using these registers for this instruction. Using 128-bit instead of 64-bit registers could speed up execution of SAD calculation for a block by a factor of two. 8.2.2 Overlapped block motion estimation Using a smaller block size means there is less data to use when matching blocks, e.g. 4x4 blocks contain 16 times less data than 16x16 blocks. This can lead to increased sensitivity to image noise and in e�ect faulty motion vectors. To reduce this issue when the block size is smaller than 16x16 we used enlarged blocks in the BMAs, a technique used by e.g. Zhai et al. (2005). Each 4x4 or 8x8 block is temporarily enlarged to 16x16 by including surrounding pixels in each direction, and then matched against a similarly enlarged block in the reference frame. 8.2.3 Zero motion prejudgement In many types of video, especially the kind of mostly-static video this FRUCT implementation is intended for, there is a large frame-to-frame coherence. The high level of coherence can be found by checking the SAD of a macroblock at position (i, j) in both frames (i.e. no o�set in search position) against a threshold value. If the SAD is below the threshold value the block is said Page 40 of 90 to have no motion and no further ME is performed for this block. This technique is taken from ARPS (see Section 4.1.6). We have incorporated this technique into all ME algorithms implemented for this thesis, except for FS. The threshold value was set to 1024 for a 16x16 block. 8.3 Motion compensation Direct motion compensation (DMC) was the �rst MC algorithm to be im- plemented for the FRUCT. Due to its simplicity it proved to be just as fast as predicted. However, since a potentially time-consuming post-processing method is required (see Section 5.1), this track of development was not ex- plored further. E�orts were instead focused on bidirectional motion com- pensation (BDMC). As with FA, two versions of the BDMC algorithm were implemented: one CPU-based and one OpenCL-based. The OpenCL-based solution keeps a copy of the previous and next frames in two 2D-image bu�ers (textures). By using a bu�er-swapping procedure similar to the pseudo-code in Figure 8.1, only the newly read frame needs to be sent to the device, as the previous frame was copied in a previous call. The motion vectors for all macroblocks are copied to a bu�er on the device, and the weighting factor is set as a kernel argument. The global work-item size is set to the frame width and height, and the local work-group size (see Chapter 6) is set to the speci�ed constant macroblock width and height. In other words, every work item in a work-group belong to the same macroblock, and there is a one-to-one correspondance between pixels and work items. During kernel execution each work item uses its global position (i.e. pixel position), together with an o�set calculated with the weighting factor and block motion vector, for look-ups (sampling) in the image bu�ers to produce an output value. The sampling can be done by nearest-neighbour, rounding non-integer sam- pling locations to the closest integer one (thus giving the same result as the CPU-based solution). Alternatively, sampling can be done by bilin- iear interpolation where the values of the four closest integer sample points are weighted together (Figure 8.2). The use of bilinear interpolation trades slightly decreased sharpness for increased smoothness in video motion. Since there is special hardware for this kind of interpolation on modern graphics cards there is no major time penalty compared to nearest-neighbour sam- pling. This would not be the case for a CPU-based implementation. Page 41 of 90 Figure 8.2: The non-integer sampling location (the �lled circle) is rounded to the top left sample location. In bilinear interpolation (right side), the non-integer sampling location is interpolated from the four closest integer sample points. After testing MC (with a separate kernel), averaging and copying of chroma (from previous frame), MC was found to give results which were indistin- guishable from chroma averaging in most cases. The cost of increased MC time could not be justi�ed. Copying chroma from previous frame was found to give bad results for videos with moderate motion. For these reasons only averaging of chroma was used, performed on the CPU for �exibility, since the other methods were still available as a con�guration option. Page 42 of 90 9. Results 9.1 Test videos The videos used for experimental veri�cation were selected from standard test videos3 used in existing literature. The selection was based on similarity to the actual use case of remote surveillance, meaning that videos with a static camera and low to moderate motion were favourized. The Akiyo test video sequence (Common Intermediate Format (CIF) format) shows a female news anchor presenting the news. It features a static camera and background, with low motion as she moves her head while talking (see Figure 9.1a for a sample image from the video). The most challenging part of the video, from a frame rate up-conversion perspective, is to correctly interpolate movements of lips and eyes, especially blinking. The Container test video sequence (CIF format) also includes a static camera and mostly low motion. It features two slowly moving ships, a �ag waving in the wind and some water movement. Also, at the end of the sequence two birds �y by close to the camera at high velocity (Figure 9.1b). This video mainly tests the ability to handle low, predictable motion (the ships), some unpredictable motion (�ag and water) and extremely high motion (the birds). Foreman (CIF format) is one of the most used test sequences for frame rate up-conversion (Figure 9.1c). It features a close-up of a foreman at a construction site, where the camera is static in the �rst half of the video and quickly panning in the second half. It contains heavy motion, including motion blur, where the motion of the foreman's face is especially hard to estimate and compensate for. It is nowhere near the remote surveillance use scenario, but is included due to its popularity in the �eld. 3Standard videos can be found at http://media.xiph.org/video/derf/ Page 43 of 90 http://media.xiph.org/video/derf/ (a) Akiyo (b) Container (c) Foreman Figure 9.1: CIF test video sequences. Due to the lack of standard test videos with 1080p resolution �tting the criteria of static camera and low motion we decided to record a few on our own4; Liseberg, FerryClose and FerryFar. These videos were recorded with a Nikon D5100 DSLR camera at 30 frames per second (FPS)5 in H.264-encoded format. They were later converted to raw uncompressed Y'CbCr 4:2:0 format using MEncoder6. The Liseberg video (Figure 9.2) is shot outside an amusement park. It fea- tures fast vertical motion of an attraction, moderate horizontal motion of a tram, and low motion of pedestrians in the background. Also, a fast-moving bird introduces high-motion noise. FerryClose and FerryFar, shown in Figure 9.3 and Figure 9.4, capture an approaching commuter ferry at di�erent distances. Both videos have fairly high motion in the water, with the water covering at least a third or more of the frame area. 4The recorded videos are available on http://andreas.isberg.se/edu/masters_thesis/ 5The videos were recorded in 29.97 FPS, or 30000/1001 to be even more precise. 30 is however the number speci�ed when choosing frame rate and resolution in the camera. 6MEncoder can be found on http://www.mplayerhq.hu/ Page 44 of 90 http://andreas.isberg.se/edu/masters_thesis/ http://www.mplayerhq.hu/ Figure 9.2: Liseberg test video (1080p). The tower in the background is an amuse- ment park attraction. Figure 9.3: FerryClose test video (1080p). Page 45 of 90 Figure 9.4: FerryFar test video (1080p). 9.2 Test system speci�cation The speci�cations of systems used for testing can be seen in Table 9.1 below. Two di�erent classes of test systems, LAPTOP and DESKTOP, were used when collecting timing results. PART LAPTOP DESKTOP Processor Core i7-2720QM Core i5-750 Physical cores 4 4 Virtual cores 8 4 RAM amount 4 GB 4 GB RAM specs 1.33 GHz DDR3 1.6 GHz DDR3 Graphics card Radeon 6750M GeForce GTX560 Ti VRAM 1024 MB 1024 MB Graphics driver 8.812.0.0 275.33 OS Win7 Pro Win7 Pro Table 9.1: Speci�cations of test systems. Both systems used processors from Intel, with di�erent number of physical and logical processor cores. The DESKTOP system had four physical and virtual cores, while the processor in LAPTOP also had four physical cores but Page 46 of 90 a total of eight virtual cores thanks to Hyper-Threading (HT). The number of Open Multi-Processing (OpenMP) threads were based on the number of virtual cores. The graphics card in DESKTOP was a graphics card from Nvidia, which at the time of writing could be placed in the high-end category. The laptop had a mid-range graphics chip from AMD. Microsoft Visual Studio 2010 was used to compile the code. The compiler was set to full speed optimization, with link-time- and SSE2 code generation turned on, as well as whole program optimization. 9.3 Subjective quality evaluation 9.3.1 Recorded videos Looking at a 3x up-conversion, from 10 to 30 FPS (again, this is a rounding of 29.97), of the FerryFar sequence, the di�erent motion estimation algorithms show some clear di�erences. The Four Step Search (4SS) algorithm produces the best result for this video. A few artefacts appear by the �ag in the lower left corner (Figure 9.5d), and some in the water. The non-uniform motion in the water is well-handled and no artefacts are large or persistent enough to be considered particularly disturbing for the viewer. Adaptive Rood Pattern Search (ARPS) produces a few artefacts on the ferry and around �ag. However, the main di�erence compared to the other algo- rithms is the degree of artefacts in the water, which is extremely high for ARPS in this video. This can be explained by the "wandering" behaviour of the algorithm. Artefacts in the water produced by the Full Bidirectional Search (FBDS) algorithm are worse than 4SS but signi�cantly better than ARPS. Brief arte- facts appear on the ferry, and some on a logo in the background (Figure 9.5e). Full Search (FS) gives similar results to FBDS, with the same kind of arte- facts near the logo in the background. Page 47 of 90 (a) Original (b) Original (c) Original (d) 4SS (e) FBDS (f) ARPS Figure 9.5: Artefacts in 3x up-conversion from 10 to 30 FPS of FerryFar sequence, 8x8 extended block size. Note the unintended "shadow" of the �ag in (d) compared to the original (a), and the misaligned waves in the logo (e) compared to (b). The water in (f) contains several rows of miscoloured blocks. The water motion in the FerryClose sequence, when up-converted from 10 to 30 FPS, is also best handled by 4SS and worst by ARPS. Disturbing artefacts, primarily around the ferry's windows, are produced by 4SS (Figure 9.8f). This is not the case for FBDS, where only a few artefacts can be found on the side and front of the ferry, the thin antenna on top and heads of the people on the upper deck (Figure 9.8d). FBDS also shows problems with a bird �ying by fast, which appear to �icker (Figure 9.8e). Despite this, FBDS gives the overall best subjective result for this up-conversion factor, as illustrated in the sharpness comparisons in Figure 9.6. The bene�ts of up- conversion with motion estimation (ME) and motion compensation (MC), compared to simple frame averaging (FA), is shown in Figure 9.7. Page 48 of 90 (a) FBDS (b) 4SS (c) FA Figure 9.6: Sharpness comparison, 3x up-conversion from 10 to 30 FPS of Fer- ryClose sequence. With FBDS the text is sharp and readable, while 4SS gives less text shadow but also less readable text compared to FA. (a) Original (b) FBDS (c) 4SS (d) FA Figure 9.7: Sharpness compared to original frame, 3x up-conversion from 10 to 30 FPS of FerryClose, 8x8 extended block size (see Section 8.2.2). Note the slightly displaced antenna in (b) and (c), and the block artefact in (c). FA is outclassed. Page 49 of 90 (a) Original (b) Original (c) Original (d) FBDS (e) FBDS (f) 4SS Figure 9.8: Artefacts in 3x up-conversion from 10 to 30 FPS of FerryClose se- quence, 8x8 extended block size. The head of the person to the left in (a) is mostly gone in (d), same goes for the bird in (b) when compared to (e). Artefacts produced by 4SS on the side of the ferry can be seen between the windows in (f). When converting FerryClose from 30 to 60 FPS however, doubling the orig- inal frame rate, ARPS produce much less artefacts and 4SS few or none. Artefacts produced by FBDS are limited to single blocks on the side and front of the ferry. For the Liseberg sequence an up-conversion from 10 to 30 FPS gives video quality problems for all ME algorithms. Artefacts produced by ARPS around the tram are particularly heavy (Figure 9.9c), but slightly less around the amusement park tower compared to the other algorithms (Figure 9.10c). The limited range of 4SS and FBDS is not enough to �nd the true motion of the tram (Figures 9.9b and 9.9d). FBDS again can not handle a fast-moving bird, which appear to �icker. Page 50 of 90 (a) Original (b) 4SS (c) ARPS (d) FBDS Figure 9.9: Artefacts on the front of the tram in 3x up-conversion from 10 to 30 FPS of Liseberg sequence, 4x4 extended block size. (a) Original (b) 4SS (c) ARPS (d) FBDS Figure 9.10: Artefacts on a high-motion part (the "ring" is falling fast towards the ground) of an amusement park tower, in 3x up-conversion from 10 to 30 FPS of Liseberg sequence, 4x4 extended block size. The limited range of 4SS and FBDS is probably the reason for the severe artefacts in (b) and (d). When up-converting from 15 to 30 FPS all algorithms give better results Page 51 of 90 on the Liseberg sequence (Figure 9.11), though ARPS still produce sporadic artefacts. FBDS gives few to none artefacts on the tram and also rivals ARPS for best results on the vertical motion (Figure 9.12), which at its maximum speed still is too large for the range of FBDS. When going from 30 to 60 FPS, only ARPS has some limited trouble with the tram motion. (a) Original (b) 4SS (c) ARPS (d) FBDS Figure 9.11: Artefact comparison on tram, 2x up-conversion from 15 to 30 FPS of Liseberg sequence, 4x4 extended block size. In this frame ARPS gave less artefacts than 4SS, but seen to the entire sequence the case is the opposite. Page 52 of 90 (a) Original (b) 4SS (c) ARPS (d) FBDS Figure 9.12: Artefacts on tower in Liseberg sequence, 2x up-conversion from 15 to 30 FPS. 9.3.2 Standard test videos The motion in the news anchor's face in a 3x up-conversion of the Akiyo test sequence, going from 10 to 30 FPS, is mostly handled well by 4SS, ARPS and FBDS. Problematic ranges of frames often involve facial regions being covered or uncovered as an e�ect of quick movements of eyes and lips. Figure 9.13 and Figure 9.15 show such events. As indicated by the semi-visible eyes in Figure 9.13e the news anchor is about to open her eyes. This motion was correctly interpolated by ARPS and FBDS, and with a single block artefact by 4SS. Figure 9.14 is in the middle of vertical face motion. ARPS and FBDS both produce acceptable results, whereas 4SS gives an extra pair of eyebrows. 4SS fares better on frame 137, where FBDS is the worst of the four algorithms compared (Figure 9.15d), making Akiyo one of the few test videos where ARPS gives the best result. (a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA Figure 9.13: Excerpt of frame 38, 3x up-conversion of Akiyo sequence, from 10 to 30 FPS. Page 53 of 90 (a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA Figure 9.14: Excerpt of frame 76, 3x up-conversion of Akiyo sequence, from 10 to 30 FPS. (a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA Figure 9.15: Excerpt of frame 137, 3x up-conversion of Akiyo sequence, from 10 to 30 FPS. The �rst part of the Container video sequence, before two birds enter the scene, contains motion so slow that even a 6x up-conversion can give highly acceptable results with 4SS, ARPS, FBDS or even FA (Figure 9.16). Fig- ure 9.17, showing one of the birds, gives an example of how badly such events are handled by the tested algorithms. (a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA Figure 9.16: Excerpt of frame 111, 6x up-conversion of Container sequence. 4SS, ARPS and FBDS all give a result with acceptable sharp contours of the ships. FA gives a little less sharpness compared to the others. Page 54 of 90 (a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA Figure 9.17: Excerpt of frame 262, 3x up-conversion of Container sequence. Most parts of the Foreman sequence contain motion too large or complicated for the chosen algorithms to tackle, illustrated in Figure 9.18. As stated earlier, good results on this video has not been a focus of our e�orts, and the video is included primarily for reference due to its high popularity in this �eld of research. (a) Original (b) 4SS (c) ARPS (d) FBDS (e) FA Figure 9.18: Excerpt of frame 13, 3x up-conversion of Foreman sequence. With the subjective evaluations of all test videos in mind, the smaller block size of 4x4 produce roughly equal amount of artefacts as 8x8. However, as the artefacts are also smaller, they appear to be less disturbing for the viewer. 9.4 Objective quality evaluation Most of the peak signal-to-noise ratio (PSNR) results obtained from up- converting the videos indicates that FA performs really well, even though the subjective quality evaluation sometimes suggests otherwise. PSNR uses the pixel di�erences between two frames, and this di�erence is small for videos with low motion and low color contrast. This is a major problem with PSNR and the results should only be treated as a rough estimation. Page 55 of 90 The diagrams below contain average results for each algorithm combined with four di�erent block sizes. They also contain per-frame results for each algorithm using 4x4 extended block size, as this was found to be the best choice in the subjective quality evaluation. 9.4.1 Recorded videos Structural Similarity (SSIM) was not computed for the 1080p videos because of limitations in the program used for testing. Figures 9.19 to 9.22 show that 4SS performs best of the motion estimation algorithms on FerryFar and FerryClose, similar to the results from the sub- jective quality evaluation. The PSNR is quite constant as the videos contain just small motion. Figure 9.19: Average PSNR of FerryFar after up-converting it from 10 to 30 FPS. Page 56 of 90 Figure 9.20: Per-frame PSNR of FerryFar after up-converting it from 10 to 30 FPS. Figure 9.21: Average PSNR of FerryClose after up-converting it from 10 to 30 FPS. Page 57 of 90 Figure 9.22: Per-frame PSNR of FerryClose after up-converting it from 10 to 30 FPS. 4SS performs slightly better than the other algorithms. Up-converting Liseberg from 10 to 30 FPS gives reasonably good objective results for all algorithms, as shown in Figures 9.23 to 9.24. These results are however an e�ect of the mostly static scene. Comparing these �gures to the subjective quality evaluation (Section 9.4.1) con�rms that high PSNR values does not necessarily correspond to high quality in the few parts of the video frame where there is motion interesting for a human observer. Page 58 of 90 Figure 9.23: Average PSNR of Liseberg after up-converting it from 10 to 30 FPS. Figure 9.24: Per-frame PSNR of Liseberg after up-converting it from 10 to 30 FPS. All algorithms perform slightly better (by a few decibels) when up-converting Liseberg from 15 to 30 FPS. FBDS performs excellent, hitting over 40 decibel on average. Page 59 of 90 Figure 9.25: Average PSNR of Liseberg after up-converting it from 15 to 30 FPS. Figure 9.26: Per-frame PSNR of Liseberg after up-converting it from 15 to 30 FPS. 9.4.2 Standard videos All algorithms perform excellent on Akiyo, hitting over 40 decibels on average, as shown in Figure 9.27. Page 60 of 90 Figure 9.27: Average PSNR of akiyo after up-converting it from 10 to 30 FPS. Figures 9.28 and 9.29 show how the result of FA drops low for many frames. This is likely because of the color contrast between the light skin and the dark areas such as the hair, eyebrows and eyes. Figure 9.28: Per-frame PSNR of Akiyo after up-converting it from 10 to 30 FPS. Notice how FA drops very low for certain frames. Page 61 of 90 Figure 9.29: Per-frame SSIM of Akiyo after up-converting it from 10 to 30 FPS. Notice how FA drops very low for certain frames. The algorithms also perform excellent on Container, as shown in Figure 9.30. Figure 9.30: Average PSNR of Container after up-converting it from 10 to 30 FPS. Figures 9.31 and 9.32 show how the result of all algorithms drops low between frames 245 and 276. This is due to the fact that none of the algorithms Page 62 of 90 properly can handle the high motion of the two birds. Figure 9.31: Per-frame PSNR of Container after up-converting it from 10 to 30 FPS. FBDS performs excellent on most frames, except when the two birds �y by, hitting well over 40 decibels on average. Figure 9.32: Per-frame SSIM of Container after up-converting it from 10 to 30 FPS. As expected, none of the algorithms perform well on Foreman, as Figures 9.33 Page 63 of 90 to 9.35 show. Figure 9.33: Average PSNR of Foreman after up-converting it from 10 to 30 FPS. Figure 9.34: Per-frame PSNR of Foreman after up-converting it from 10 to 30 FPS. Page 64 of 90 Figure 9.35: Per-frame SSIM of Foreman after up-converting it from 10 to 30 FPS. Note that the values of the y-axis ranges from 0.3 to 1.0 instead of 0.960 to 1.0, which was the case for the other standard videos. FA is the worst algorithm for this video. 9.5 Real-time capability The diagrams below contain per-frame timing results for 4SS, ARPS and FBDS, with 4x4 extended block size, as these were found to give the best quality. 9.5.1 Motion estimation time measurements Enabling optimizations could in theory give huge improvements. OpenMP could increase the performance by the number of virtual cores in the system, and the Sum of Absolute Di�erences (SAD) assembly optimization could increase the performance up to 16 times (see Section 8.2.1). Table 9.2 below shows how they actually e�ect the runtime on ME. The optimizations are not algorithm nor video-type dependent, so the table shows the average per-frame time measurements for 4SS using 4x4 extended block size on the recorded 1080p videos. Enabling OpenMP on LAPTOP (8 virtual cores) gives a speed increase factor of 3.3, while the same optimization on DESKTOP (4 virtual cores) only Page 65 of 90 increases the speed 2.4 times. Enabling the SAD assembly optimization gives a speed increase factor of 11.5 on LAPTOP, and 10.8 on DESKTOP (Table 9.3). LAPTOP DESKTOP No optimization 680.6 ms 721.3 ms OpenMP only 205.6 ms 301.3 ms SSE only 59.3 ms 66.6 ms OpenMP and SSE 37.3 ms 43.3 ms Table 9.2: Average time for di�erent optimizations on ME using 4SS with 4x4 extended block size. LAPTOP DESKTOP No optimization 1.0x 1.0x OpenMP only 3.3x 2.4x SSE only 11.5x 10.8x OpenMP and SSE 18.2x 16.7x Table 9.3: Relative time improvements with optimizations, from data in Table 9.2 All algorithms, with 4x4 extended block size and all optimizations enabled, are extremely fast on the standard CIF videos, as Table 9.4 shows. LAPTOP DESKTOP 4SS 1.2 ms 1.2 ms ARPS 1.1 ms 1.1 ms FBDS 3 ms 2.7 ms Table 9.4: Average time for ME for the standard CIF videos using 4x4 extended block size and optimizations. Figures 9.36 to 9.41 show the performance of the ME algorithms, using 4x4 extended block size and optimizations enabled, on the recorded 1080p videos. Some of the charts show spikes where processing times are increased by up to 100 ms or more for some frames. This is not caused by instantaneous changes in the videos. A likely explanation is the process scheduling of the underlying operating system giving priority to other processes. Setting a higher priority on the main frame rate up-converter thread should remedy Page 66 of 90 this. The issue does however highlight the inherent di�culty of guaranteeing that a maximum processing time limit is not exceeded. Figure 9.36: ME on FerryFar using DESKTOP after up-converting it from 10 to 30 FPS. Figure 9.37: ME on FerryFar using LAPTOP after up-converting it from 10 to 30 FPS. Page 67 of 90 Figure 9.38: ME on FerryClose using DESKTOP after up-converting it from 10 to 30 FPS. Figure 9.39: ME on FerryClose using LAPTOP after up-converting it from 10 to 30 FPS. Page 68 of 90 Figure 9.40: ME on Liseberg using DESKTOP after up-converting it from 10 to 30 FPS. Figure 9.41: ME on Liseberg using LAPTOP after up-converting it from 10 to 30 FPS. Page 69 of 90 9.5.2 Motion compensation time measurements Table 9.5 below shows the MC processing times for the CPU- and OpenCL- based BDMC implementations. The test was performed on Liseberg using 4SS and 4x4 extended block size. The results show a big performance di�erence between the CPUs and GPUs in each test system. Looking only at processing times, the CPU-based im- plementation is a better choice for the LAPTOP test system. However, the GPU-based implementation gives increased quality thanks to the bilinear in- terpolation. The DESKTOP test system bene�ts hugely from the high-end graphics card. LAPTOP DESKTOP CPU 19.7 ms 27.5 ms OpenCL (GPU) 25.5 ms 14.9 ms Table 9.5: Comparison of CPU- and OpenCL-based BDMC implementations. The test was performed on Liseberg using 4SS and 4x4 extended block size. Based on the results above, the systems bene�t most from the OpenCL- based BDMC implementation. Tables 9.6 and 9.7 below show how OpenMP e�ects the OpenCL-based BDMC implementation on the two test systems. Di�erences in motion between videos only give small variations on processing times. The DESKTOP system combined with OpenMP is fast enough for a true real-time implementation (see Section 2.1.2), where each processing step needs to be below 16.7 ms. The di�erence of using OpenMP in the otherwise GPU-based algorithm comes from the fact that chroma is still processed on the CPU (see Section 8.3). LAPTOP DESKTOP With OpenMP 25.8 ms 13.1 ms Without OpenMP 37.6 ms 25.2 ms Table 9.6: Average time for MC on the recorded 1080p videos. 9.5.3 Total processing times The total processing time, taking both ME and MC into account, can be used to determine if the real-time goals set in Section 2.1.2 can be accomplished. Page 70 of 90 LAPTOP DESKTOP With OpenMP 3 ms 0.9 ms Without OpenMP 6.7 ms 1.7 ms Table 9.7: Average time for MC on the standard CIF videos. All videos used in testing was �rst down-converted by a factor of three and then up-converted by the same factor. This factor means the total processing time must not exceed 50 milliseconds, marked as LIMIT_HIGH in the charts. The total processing time is low when used on CIF videos. Figure 9.42 shows that all algorithms just take a few milliseconds when used on Akiyo. The result is similar for the other CIF videos. Figure 9.42: Total processing time (ME and MC) on Akiyo using DESKTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. On 1080p videos, the total processing time is much higher. Figures 9.43 to 9.46 show that none of the algorithms fall below the mark for FerryFar and FerryClose. Page 71 of 90 Figure 9.43: Total processing time (ME and MC) on FerryFar using DESKTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. Figure 9.44: Total processing time (ME and MC) on FerryFar using LAPTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. Page 72 of 90 Figure 9.45: Total processing time (ME and MC) on FerryClose using DESKTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. Figure 9.46: Total processing time (ME and MC) on FerryClose using LAPTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. Both test systems using 4SS or ARPS in combination with BDMC fall below the mark on Liseberg, as Figures 9.47 and 9.48 show. Page 73 of 90 Figure 9.47: Total processing time (ME and MC) on Liseberg using DESKTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. Figure 9.48: Total processing time (ME and MC) on Liseberg using LAPTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. FBDS falls below the 50 ms mark, on all recorded 1080p videos, if the block size is changed from 4x4 extended to 8x8 extended using DESKTOP test system. This is shown in Figures 9.49 to 9.51. Page 74 of 90 Figure 9.49: Total processing time (ME and MC) on FerryFar using DESKTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. The blocksize used is 8x8 extended. Figure 9.50: Total processing time (ME and MC) on FerryClose using DESKTOP after up-converting it from 10 to 30 FPS. The MC method used is BDMC. The blocksize used is 8x8 extended. Page 75 of 90 Figure 9.51: Total processing time (ME and MC) on Liseberg using DESKTOP after up-converting it from 10 to 30 FPS. Th