Learning Continuous Video Representation from Event Cameras A Local Implicit Function approach for video reconstruction and spatiotemporal superresolution Master’s thesis in Complex Adaptive Systems DAVID TONDERSKI DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 www.chalmers.se www.chalmers.se Master’s thesis 2024 Learning Continuous Video Representation from Event Cameras A Local Implicit Function approach for video reconstruction and spatiotemporal superresolution David Tonderski Department of Physics Chalmers University of Technology Gothenburg, Sweden 2024 Learning Continuous Video Representation from Event Cameras A Local Implicit Function approach for video reconstruction and spatiotemporal superresolution David Tonderski © David Tonderski, 2024. Supervisor: Dr. Valery Vishnevskiy, ETH Zurich Examiner: Prof. Giovanni Volpe, Department of Physics, University of Gothenburg Master’s Thesis 2024 Department of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Overview of the proposed method. Event data is encoded into a continuous video representation, which can then be queried to generate intensity frames at any continuous timestamp and spatial resolution. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2024 iv Learning Continuous Video Representation from Event Cameras A Local Implicit Function approach for video reconstruction and spatiotemporal superresolution David Tonderski Department of Physics Chalmers University of Technology Abstract Event cameras are biologically inspired sensors that operate differently from con- ventional cameras. Rather than measuring pixel intensities at fixed intervals, event cameras detect per-pixel intensity changes, offering high dynamic range, low latency, high temporal resolution, minimal motion blur, and low power consumption. How- ever, traditional computer vision algorithms cannot be applied to event data due to the radically different operating paradigm. One approach to bridge this gap is to reconstruct conventional images from event data. While this approach retains the high dynamic range and minimal motion blur, it does not fully capture the high temporal resolution of event cameras. In this thesis, we utilize Local Implicit Functions for spatiotemporal video recon- struction, aiming to preserve the high temporal resolution of event data as well as allow for the generation of videos with an arbitrary spatial resolution. We show that our method reaches reconstruction quality similar to comparable state-of-the- art approaches, and significantly outperforms simple baselines for spatial upscaling up to 3x. Our analysis also suggests that our representation retains the high tempo- ral resolution of event data. Additionally, our approach offers per-pixel uncertainty estimations, which have the potential to enhance the performance of downstream computer vision applications. Keywords: event cameras, reconstruction, superresolution, uncertainty quantifica- tion. v Acknowledgements I would like to thank my supervisor Valery Vishnevskiy for invaluable advice and great discussions throughout the project. Also, thank you to Sony Stuttgart Labo- ratory 1 for providing crucial resources. David Tonderski, Gothenburg, June 2024 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: APS Active-Pixel Sensor CNN Convolutional Neural Network ConvGRU Convolutional Gated Recurrent Unit ConvLSTM Convolutional Long Short-Term Memory ELU Exponential Linear Unit EPF Event per Pixel per Frame EVS Event-based Vision Sensor LL Log-Likelihood LPIPS Learned Perceptual Image Patch Similarity LPIPS_CORR LPIPS and Standard Deviation Correlation Index MLP Multi-Layer Perceptron MSE Mean Squared Error MVE Mean Variance Estimation PICP Prediction Interval Coverage Probability RCNN Recurrent Convolutional Neural Network ReLU Rectified Linear Unit SSIM Structured Similarity Index Measure ix Nomenclature Below is the nomenclature of indices, sets, parameters, and variables that have been used throughout this thesis. Indices i Index for sequence τ Index for event frame b Index for event bin Sets ψ Parameters of decoder ϕ Parameters of encoder ϵ Set of events S Subset of events Parameters M Number of images passed into simulator Nmax Maximum number of transformation control points B Number of temporal bins λMVE Weight of MVE loss γ Learning rate Niters Number of training iterations Variables xi x, y Pixel coordinates t Timestamp p Event polarity L Log intensity C Contrast threshold h Neuron state µ True mean σ True standard deviation µ̂ Predicted mean σ̂ Predicted standard deviation z Quantile of probability distribution ρ Spatial resolution A Affine matrix Nj Number of control points for transformation j α Rotation transformation values sx, sy Scaling transformation values dx, dy Displacement transformation values Cs, Cα Scaling and rotation normalization values p Control point value Lmin, Lmax Illuminance range k Image exponentiation multiplier E Event tensor z Latent code v Coordinate of latent code s Predicted signal x 3D spatiotemporal coordinate M (i) Feature map νϕ Encoder Cψ Decoder L Loss function xii Contents List of Acronyms ix Nomenclature xi List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Background 3 2.1 Event cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . 5 2.2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 5 2.2.3 Implicit neural representations . . . . . . . . . . . . . . . . . . 5 2.2.4 Mean-Variance Estimation Networks . . . . . . . . . . . . . . 6 2.3 Related literature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Image reconstruction from event data . . . . . . . . . . . . . . 7 2.3.2 Spatiotemporal Video Superresolution . . . . . . . . . . . . . 7 3 Methods 9 3.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.1.1 Ground truth generation . . . . . . . . . . . . . . . . . . . . . 10 3.1.2 Event generation . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Event Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3 Continuous Video Representation . . . . . . . . . . . . . . . . . . . . 14 3.4 Network Architecture and Training . . . . . . . . . . . . . . . . . . . 16 3.4.1 Loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4.2 Training procedure . . . . . . . . . . . . . . . . . . . . . . . . 17 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Results 21 4.1 Simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 Prediction without superresolution . . . . . . . . . . . . . . . 21 4.1.1.1 Intensity predictions . . . . . . . . . . . . . . . . . . 22 xiii Contents 4.1.1.2 Uncertainty predictions . . . . . . . . . . . . . . . . 25 4.1.2 Spatiotemporal superresolution . . . . . . . . . . . . . . . . . 30 4.1.2.1 Intensity prediction in spatial superresolution . . . . 30 4.1.2.2 Intensity prediction in temporal superresolution . . . 31 4.1.2.3 Uncertainty quantification in spatiotemporal super- resolution . . . . . . . . . . . . . . . . . . . . . . . . 32 4.2 Real data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Sim-to-real gap . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5 Conclusion 41 Bibliography 43 xiv List of Figures 1.1 Overview of our method. Events are discretized and encoded by a RCNN into latent codes. These latent codes are then used by a MLP decoder to predict intensities and uncertainty estimates at spatiotem- porally continuous pixel coordinates. . . . . . . . . . . . . . . . . . . 2 3.1 Comparison of cubic and linear interpolation of control point values. . 12 3.2 Following LIIF [4], a video is represented as a 3D feature map and a function Cψ. The final output is predicted by trilinearly interpolating neighboring predictions. . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Visualization of the architecture we use for νϕ. We use kernel size 3 and hidden size 24 for all convolutional layers. Conv refers to a con- volutional layer with stride 1 and Conv↓ is a convolutional layer with stride 2. The empty circle represents concatenation. The ResBlock is visualized in figure 3.4. . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.4 Visualization of the residual block (ResBlock) we use in our architec- ture. The circle with a plus represents addition. . . . . . . . . . . . . 16 4.1 Examples of predictions generated by our networks with LPIPS (Alex) far above the average, with short explanations for the bad performance. 22 4.2 Examples of predictions generated by our networks with LPIPS (Alex) close to the average. . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 2D heatmap of LPIPS versus EPF over all frames in the test dataset, with predictions generated without superresolution or interpolation. Data points are aggregated into 50 bins per dimension, where each bin represents a specific range of LPIPS and EPF values. Note that LPIPS is clipped to [0, 0.3] and EPF to [0, 5]. Additionally, statistical lines are plotted using 1D binning along the EPF axis. For each EPF bin, we show the mean (blue line) and standard deviation (dashed red line) of the LPIPS values of patches within that bin. . . . . . . . 24 4.4 Distribution of the LPIPS_CORR and PICP80 values of the example predictions shown in figure 4.5. . . . . . . . . . . . . . . . . . . . . . 25 4.5 Examples showcasing the complementary roles of the novel LPIPS_CORR and traditional LL and PICP80 uncertainty quantification metrics. The value ranges in the figure titles represent colors from black to white. For PICP80 plots, white pixels indicate µ is within the pre- diction interval. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 xv List of Figures 4.6 2D heatmap of LPIPS versus Median STD, visualizing the distribu- tion of patch LPIPS values against the median predicted STD over 16 by 16 patches over the test dataset. Data points are aggregated into 50 bins per dimension. Statistical lines were calculated as in figure 4.3. 29 4.7 Prediction with examples of patches with given median predicted standard deviations highlighted. . . . . . . . . . . . . . . . . . . . . . 29 4.8 Relation between LPIPS_CORR and Patch size. The figure was generated by calculating the LPIPS_CORR over the test dataset for patch sizes between 162 and 962, with a step of 4. The minimum evaluated patch size is due to LPIPS (AlexNet) needing inputs of size at least 162, whereas the maximum is chosen such that it is smaller than the smallest image in the test dataset. . . . . . . . . . . . . . . 30 4.9 LPIPS vs spatial superresolution scale, showing means and standard deviations over all test sequences. . . . . . . . . . . . . . . . . . . . . 31 4.10 LPIPS vs temporal superresolution scale, showing means and stan- dard deviations of LPIPS over test sequences. . . . . . . . . . . . . . 32 4.11 Spatiotemporal superresolution example. The x vs t plots have y = Y/2, and the x vs y plots have t = T/2. The green lines denote x = X/2. The red rectangle in the full image is the showcased area. . 33 4.12 Comparison of our method to FireNet+ and E2VID+ on sample scenes from the HQF and ECD datasets. . . . . . . . . . . . . . . . . 35 4.13 Spatiotemporal superresolution example on the desk scene from the HQF dataset. Note that ground truth intensities are not available for scales larger than 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.14 Example prediction from the slow_hand scene from the HQF dataset. 37 4.15 The inference time required to generate one second of video output at various frame rates using E2VID, FireNET, and our method. The red dashed line indicates the threshold for real-time inference. . . . . 39 4.16 Example of failure when there are no input events, taken from the slow_hand scene in HQF. The timestamps of the three frames are shown by the red dashed lines in the Y-T panel. Lack of motion for only 6 event frames makes the encoder "forget" the scene. . . . . . . . 40 xvi List of Tables 3.1 Data generation parameters used for the training and test datasets. These values were chosen empirically by qualitatively comparing the generated sequences to the E2VID dataset. To exemplify these values, α = 2π represents a full 360◦ rotation of the image, sx = 2 represents a 2x horizontal "zoom-in", and tx = 1 represents translating the image center to its right edge. The test dataset contains slightly longer and more challenging sequences. . . . . . . . . . . . . . . . . . . . . . . . 12 4.1 Results on the test dataset using prediction without superresolution or temporal interpolation. . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Means and standard deviations of key uncertainty metrics over the test dataset calculated by generating predictions at given scales, in- terpolating them to 4x upscaling, and evaluating against the ground truth. Arrows indicate the direction of better performance for each metric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Comparison of our method with several state-of-the-art RCNN recon- struction methods, as reported by Ercan et al. in [25]. LPIPS refers to the AlexNet version here. The best and second best scores are given in bold and underlined. . . . . . . . . . . . . . . . . . . . . . 34 4.4 Uncertainty performance on real data. . . . . . . . . . . . . . . . . . 37 4.5 Inference time for one event frame on GPU and CPU at the spatial resolutions reported in [23]. Enc and Dec refer to the encoder νϕ and the decoder Cθ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xvii List of Tables xviii 1 Introduction Event cameras, also called Event-based Vision Sensors (EVS), are biologically in- spired sensors with a radically different operating paradigm compared to conven- tional Active-Pixel Sensor (APS) cameras. Instead of measuring pixel intensities at a fixed rate, they measure per-pixel intensity changes independently and asyn- chronously. Their advantages include a high dynamic range (140 dB vs 60 dB of APS cameras), low latency and high temporal resolution (on the order of microseconds), low motion blur, and low power consumption (∼ 10 mW) [1]. However, the unique data acquisition method of event cameras also means that traditional Computer Vision techniques cannot be directly applied to them, and novel methods must be developed. Furthermore, event cameras usually have a limited spatial resolution. While high resolutions cameras exist, Gehrig and Scaramuzza showed that they suf- fer from increased noise, which can cause lower performance on many vision tasks [2]. In the seminal paper E2VID [3], Rebecq et al present a method for bridging the gap between traditional computer vision and event cameras by reconstructing intensity frames from event data using Recurrent Convolutional Neural Networks (RCNN). They show that their method is able to capture the advantages of event cameras by synthesizing videos of high-speed phenomena as well as providing high dynamic range reconstructions in challenging light conditions. They also show that tradi- tional Computer Vision techniques can be directly applied to their reconstructions, outperforming methods specialized for event data. Because methods like E2VID output reconstructions at a fixed framerate, they do not retain the high temporal resolution of event cameras1. Furthermore, such meth- ods can cause variability in the performance of downstream tasks which cannot be accounted for by the tasks themselves. For example, if the reconstruction algo- rithms fails to reconstruct a face, then a face detection algorithm applied to the reconstruction would have a high confidence that a face is not present in the data. In this work, we propose a method to reconstruct videos of arbitrary spatiotemporal resolution from event data. To achieve this, we draw inspiration from the Chen et al. [4], who propose a method to learn a continuous image representation. At 1Although very high framerates can be achieved with E2VID, this is done by re-running the method multiple times. 1 1. Introduction first, a Convolutional Neural Network encodes the input image into a set of spatially distributed latent codes. A Multi-Layer Perceptron (MLP) then predicts the pixel intensity at any continuous image coordinate based on these latent codes, allow- ing image generation at arbitrary resolutions. Similarly to their follow-up method VideoINR [5], we extend this concept to the temporal dimension. We use a RCNN to encode event data into a spatiotemporally distributed latent codes. Then, we use an MLP to predict image intensities and their variances at continuous spatiotemporal coordinates, enabling the reconstruction of videos at any spatiotemporal resolution. Furthermore, we quantify the uncertainty inherent to our reconstructions by using a Mean Variance Estimation (MVE) network [6]. The MVE network is designed to predict the mean and variance of the target distribution at each pixel, assuming the errors follow a normal distribution. In practice, this means that we get a pixel-by- pixel measure of the network uncertainty. We show an overview of our method in figure 1.1. Figure 1.1: Overview of our method. Events are discretized and encoded by a RCNN into latent codes. These latent codes are then used by a MLP decoder to predict intensities and uncertainty estimates at spatiotemporally continuous pixel coordinates. 1.1 Limitations Although intensity reconstruction from event data is frequently considered in the context of applying computer vision algorithms to the reconstructions, we do not investigate this. We also do not investigate how the predicted uncertainties could be used to improve the performance of such algorithms. When evaluating spatiotem- poral superresolution, we do not provide quantitative results on real data, as we did not find a suitable dataset with high FPS and high resolution intensity images. We also do not compare the superresolution performance of our method to other methods. 2 2 Background This section introduces event cameras and deep learning, and provides an overview of related literature. 2.1 Event cameras In contrast to standard cameras, which output the brightness of all pixels at a fixed rate, event cameras respond to brightness changes asynchronously and indepen- dently for each pixel [1]. They output a stream of events, with each event consisting of a discrete pixel coordinate (x, y), a continuous timestamp t, and a binary polarity p. Each event represents a change of brightness of a pixel at a specific time, and the polarity determines whether the brightness increased or decreased. In their survey, Gallego et al. identify four key advantages of event cameras [1]: 1. Temporal resolution - events are read with a clock with a frequency on the order of 1 MHz [1], so events are captured with a temporal resolution on the order of microseconds. This lets event cameras capture very fast motion without motion blur. 2. Low latency - since each pixel transmits events asynchronously and indepen- dently, event cameras have latencies as low as 15 µs [7], making them especially suitable for real-time systems. 3. Low power - event cameras do not process redundant brightness data, and they do not consume power unless the brightness changes. This leads to a drastically lower power consumption compared to traditional cameras (on the order of milliwatts instead of watts [2]. 4. High dynamic range - because the pixel photoreceptors operate in a logarithmic scale, event cameras have a drastically higher dynamic range than standard cameras (120 dB compared to 60 dB). However, because the data they generate is fundamentally different than regular cameras, traditional computer vision algorithms are not directly applicable to event 3 2. Background cameras. Furthermore, event cameras are inherently noisier than traditional cameras [1]. Therefore, novel methods are required to extract information from event data. Due to these trade-offs, event cameras are typically used in situations requiring low power consumption, low latency, and robustness to varying lighting conditions. Examples include wearable electronics [1], robotics [8]–[10], tactile sensing [11], [12], and high-speed control [13], [14]. Each pixel in an event camera responds to changes in log intensity L = log(I). In an ideal, noise-free scenario, an event ek = (xk, yk, tk, pk) is emitted at pixel (xk, yk) and time tk when the change in log intensity ∆L(xk, yk, tk) = L(xk, yk, tk)− L(xk, yk, tk −∆tk) (2.1) surpasses a contrast threshold C, i.e. ∆L(xk, yk, tk) = pkC, (2.2) where C > 0, and pk ∈ {−1,+1} indicates the direction of the brightness change. The contrast threshold C can be controlled by the user. In practice, the effective contrast thresholds vary from pixel to pixel due to circuit noise and sensor non- idealities, and follow a distribution with an average value of C [1], [15]. 2.2 Deep Learning Deep Learning is a computational methodology that allows algorithms to learn from observational data. It employs structures known as Artificial Neural Networks (ANNs), which mimic the interconnected neurons in the brain. The fundamental unit of these networks is the neuron, which processes incoming data and passes it on to subsequent neurons based on the strength of connections, determined by the weights of the ANN. These neurons are organized into layers, with each layer’s neu- rons receiving inputs only from the preceding layer. This organization enables each layer to transform its input data into increasingly abstract representations; for in- stance, initial layers in a computer vision model might detect simple features such as edges or lines, while deeper layers might recognize complex objects. Crucially, dur- ing the training process, the network learns to extract these features autonomously. The weights are initially set randomly and are subsequently updated to minimize a specific objective function, typically through gradient descent. Deep learning has made a significant impact across various domains, ranging from healthcare to autonomous driving. In healthcare, it can enhance medical diagnostics by analyzing medical images, such as detecting tumors from MRIs. In autonomous driving, it processes real-time sensor data, helping vehicles navigate safely by recog- nizing objects and predicting other road users’ behaviors. In language processing, it enables applications like ChatGPT to understand and generate human language. 4 2. Background 2.2.1 Convolutional Neural Networks Convolutional Neural Networks (CNNs) are specialized in processing data with grid- like topologies, such as images. They differ significantly from traditional fully con- nected neural network architectures. Unlike fully connected layers, where every neuron is connected to every neuron in the previous layer, convolutional layers are structured such that each neuron is only connected to a small region of the previous layer. This design is inspired by the biological processes in the human visual cor- tex. By focusing on local regions, convolutional layers can more effectively capture spatial hierarchies. Additionally, the weights are shared across all neurons in each layer, improving both learning and computational efficiency. This shared weight approach also provides translation invariance, enabling the network to recognize features regardless of their position in the image. These characteristics make CNNs particularly powerful for tasks like image and video analysis. 2.2.2 Recurrent Neural Networks Recurrent Neural Networks (RNNs) are neural networks designed to handle sequen- tial data by maintaining a form of memory. Unlike traditional neural networks, RNNs have connections that form directed cycles, allowing information to persist across time steps. This architecture is particularly well-suited for tasks such as language modeling, time series prediction, and speech recognition. In an RNN, the state of neurons is given by hl,t = f(hl,t−1, hl−1,t), (2.3) where hl,t denotes the state of neurons at layer l and timestamp t. The function f depends on the network architecture. In this thesis, we use Gated Recurrent Units (GRUs). GRUs are a type of RNN architecture designed to address some of the limitations of standard RNNs, particularly the vanishing gradient problem, which makes it difficult for the network to learn long-term dependencies in sequential data. GRUs introduce gating mechanisms that control the flow of information from previous time steps, enabling the network to maintain memory more effectively. GRUs use two gates: the update gate and the reset gate. The update gate determines how much of the past information needs to be passed along to the future, while the reset gate decides how much of the past information to forget. These gates allow the GRU to keep relevant information from previous time steps and discard irrelevant information, making it easier to capture long-term dependencies. 2.2.3 Implicit neural representations Implicit Neural Representations (INRs), also known as coordinate-based representa- tions, present a novel method for parameterizing continuous signals [16]. Standard signal representations of signals are usually discrete - for example pixel grids in im- ages, amplitude samples in audio, or voxel grids and meshes in 3D models. INRs, by 5 2. Background contrast, parametrize these signals as continuous functions from the signal domain (for example, a 2D pixel coordinate) to its value (e.g. RGB color). Since these continuous functions are not analytically tractable, they are instead approximated by a neural network. INRs are having an impact across various domains due to their unique advantages and applications. One of their most notable benefits is memory efficiency, allowing them to represent complex signals without the large memory footprint required by traditional methods like pixel or voxel grids. Additionally, INRs are not constrained by fixed resolutions; they can generate data at any desired resolution. A key appli- cation of INRs is in the creation of Neural Radiance Fields (NeRFs), which are used for reconstructing and rendering intricate 3D scenes from sparse and irregularly sam- pled images. NeRFs employ a neural network to map spatial and directional inputs to color and density, enabling photorealistic renderings from novel viewpoints. 2.2.4 Mean-Variance Estimation Networks A popular way of quantifying the uncertainty of neural networks is through Predic- tion Intervals (PI), as they are one of the most understandable uncertainty prediction mechanisms [17]. Nix and Weighed proposed [6] to construct these intervals through Mean-Variance Estimation networks. They build on the assumption that y|x ∼ N (µ, σ2), (2.4) where µ and σ2 are the mean and variance of the conditional distribution of target variable y given an input x. In a traditional regression task, we would then train a neural network fθ to output estimates µ̂(x) of the true mean µ(x). In MVE networks, the network additionally outputs an estimate σ̂ estimating the true variance σ of that distribution. Then, the (1− α)% prediction interval can be formed as ŷ(x)± z1− α 2 σ̂(x), (2.5) where z1− α 2 is the 1 − α 2 quantile of the N (0, 1) probability distribution. These networks are usually trained through Maximum Likelihood Estimation [17]. We begin by expressing the log likelihood of the targets given the inputs and a network fθ as ln p(µi|xi, fθ,µ, fθ,σ) = ln  1√ 2πσ̂2(xi) exp ( −(µi − µ̂(xi))2 2σ̂2(xi) ) = −1 2 ln(2π)− 1 2 ln(σ̂2(xi))− (µi − µ̂(xi))2 2σ̂2(xi) . (2.6) Then, ignoring constants, we can define the loss function to be minimized as: LMVE(µ̂, σ̂, µ) = ∑ i 1 2 ln(σ̂2 i ) + (µi − µ̂i)2 2σ̂2 i , (2.7) which is precisely the loss function used in MVE networks [6]. 6 2. Background 2.3 Related literature 2.3.1 Image reconstruction from event data Events can be interpreted as providing information about the derivative of the log intensity. Assuming small ∆tk, we can use equation 2.2 to approximate the derivative as ∂L ∂t (xk, yk, tk) ≈ pkC tk . (2.8) This interpretation has been used to reconstruct intensity images in algorithms such as [18]–[21]. However, these approaches can suffer from loss of detail, visual artifacts, and reliance on hand-crafted priors [3]. In recent years, the state-of-the-art has shifted to reconstructing intensity images using deep learning. Rebecq et al. [3] first proposed the E2VID model, which uses a RCNN to achieve a significant performance improvement compared to hand-crafted methods. Schneer- link et al. [22] then significantly reduced the network complexity with only a minor drop in prediction quality through the FireNET model. Then, Stoffregen et al. [23] showed that the performance of these methods is limited by the quality of the simulated data used to train them. By increasing the realism of the data, they provided the E2VID+ and FireNET+ models with significantly improved perfor- mance compared to the base models. Wang et al. [24] shifted from a convolutional to an attention-based architecture, improving reconstruction quality at the cost of inference time and computational complexity. Finally, Ercan et al. [25] provided a comprehensive framework to evaluate and compare these methods. 2.3.2 Spatiotemporal Video Superresolution Spatiotemporal Video Superresolution (STSVR) aims to simultaneously increase the spatial and temporal resolution of videos. This is related to our method even though the input is different (events instead of videos), as we aim to provide a video repre- sentation of arbitrary spatiotemporal resolution. While researchers have tackled this problem with traditional algorithms [26], [27], they have lately started to employ deep learning-based solutions. In Zooming Slow-Mo [28], Xiang et al. proposed a method for temporal interpolation of the missing frame features, followed by a ConvLSTM for aligning and aggregating information before frame reconstruction. Haris et al. [29] leverage the relationship by space and time by integrating optical flow into their method. Chen et al. [5] leveraged INRs to construct a continuous video representation that can be queried at any spatial and temporal resolution. This is similar to our work, although they leverage optical flow and only consider information within consecutive frames. 7 2. Background 8 3 Methods 3.1 Training data Our method requires training data in the form of pairs of event sequences and corresponding ground-truth image sequences. While such datasets do exist (e.g. [23], [30]), we also require the ability to generate ground truth images at continuous resolutions and timestamps. Because of this we train the network using synthetically generated data, and later show that it generalizes to real data. Our approach is inspired by E2VID [3], where they map MS-COCO images to a plane in a 3D space and simulate events by moving a camera randomly in this 3D space using the ESIM event simulator [31]. However, the ESIM simulator does not realistically model sensor noise and non-idealities [31], which hinders generalization to real data. Instead, we use a proprietary, physically accurate event simulator provided by Sony Semiconductor Solutions. It accurately models the sensor circuitry and accounts for non-idealities such as transistor mismatch, background activity, circuit block frequency responses, refractory periods, temperature dependence, and read-out delays [15]. However, this simulator has a different operation principle than the ESIM simulator. Instead of simulating events directly from a 3D scene, it converts sequences of grayscale images into events. To train our network, we require event sequences ϵ(i) = {e(i) 1 , ..., e (i) K } generated at some resolution ρmin ∈ N2 as well as a method to generate corresponding ground truth intensities µ(i)(t, x, y) at continuous timestamps t and spatial coordinates x, y. To achieve this, we first collect a set of grayscale images I(i) of some resolutions ρ(i) max. Specifically, we draw 550 random images (500 for the training set, 50 for the test set) from the MS-COCO dataset [32], convert them to grayscale, and normalize them to [0, 1]. Then, for each image, we define temporally continuous affine transforms f (i) affine(t), thus yielding y(i)(t, ρ(i) max) = f (i) affine(t), (3.1) where y(i)(t, ρ(i) max) is an image of resolution ρ(i) max ∈ R2 containing the ground truth intensities. To get a pixel intensity at a continuous spatiotemporal coordinate (t∗, x∗, y∗), we can simply cubically interpolate the intensities in y(i) max(t∗, ρ(i) max) at coordinates (x∗, y∗). This gives us images at any resolution, so why do we need generate events at a lower resolution ρmin? If we didn’t, our spatial superresolu- 9 3. Methods tion quality would be limited to at best cubic interpolation. Thus, we must choose some maximum scaling factor, and then generate sequences ϵ(i) at a lower resolution ρ (i) min = ρ (i) max Smax : ϵ(i) ← Simul(G(i)(y(i)(t1, ρ(i) min)), ..., G(i)(y(i)(tM , ρ(i) min))), (3.2) where M denotes the total number of images passed into the simulator, G(i) is a stochastic pre-processing function, and y(i)(tm, ρ(i) min) is generated by cubic down- sampling. 3.1.1 Ground truth generation First, we must define f (i) affine(t) for each sequence i. We do this by constructing temporally continuous affine matrices A(i)(t) as follows: 1. For each sequence i, we define transformation control points for rotation N (i) α , displacement N (i) d , and scale N (i) s . First, we sample the number of control points from a uniform integer distribution N (i) j ∼ U{2, Nmax}, where j represents each type of transformation. 2. At each control point k ∈ {1, ..., N (i) j }, transformation values are sampled as follows: s (i) x,k, s (i) y,k ∼ U [0, smax], α (i) k ∼ U [0, αmax], d (i) x,k, d (i) y,k ∼ U [−dmax, dmax], 3. These values are then further scaled and normalized sequence-wise: ŝ (i) x,k = 1 + C(i) s (s(i) x,k − s (i) x,1), ŝ (i) y,k = 1 + C(i) s (s(i) y,k − s (i) y,1), α̂ (i) k = C(i) α (α(i) k − α (i) 1 ), d̂ (i) x,k = d (i) x,k − d (i) x,1, d̂ (i) y,k = d (i) y,k − d (i) y,1, where C(i) s ∼ U [0, 1] and C(i) α ∼ U [0, 1]. The additional scaling was derived empirically by qualitatively comparing the motion in the generated sequences to the motion in the E2VID dataset. Intuitively, it makes smaller rotation and scaling values more likely, while keeping the possible value range intact. The purpose of the subtractions and additions is to make sure that the first image of the sequence is exactly the original image I(i). 10 3. Methods 4. To generate the transformation values for a timestamp t, we first associate the control points for interpolation type j with t (i) j,k = k − 1 N (i) j − 1 . (3.3) Then, we interpolate the control point values using either linear or cubic inter- polation, with a 1 2 probability of either interpolation type, as we want to learn both smooth and sharp motion changes, see figure 3.1. For linear interpolation of a control value p(i) j at some time t between control points k∗ and k∗ + 1, the interpolated value p(i) j (t) is given by: p (i) j (t) = p (i) k∗ + (t− t(i)j,k∗) (t(i)j,k∗+1 − t (i) j,k∗) (p(i) j,k∗+1 − p (i) j,k∗) (3.4) For cubic interpolation, the formulas are more complex and will not be shown here. We use the interp1d function from the scipy library [33]. 5. Next, we generate the affine matrices for each sequence i, timestamp t, and transformation type j. Note that we compute the transformed images using the affine_grid and grid_sample methods from the Pytorch library [34]. These methods expect the affine matrices to define a mapping from the output image to the input image, so the transformation matrices are inverted compared to their standard form: S(i)(t) =  1 ŝ (i) x (t) 0 0 0 1 ŝ (i) y (t) 0 0 0 1  , (3.5) D(i)(t) = 1 0 −d̂(i) x (t) 0 1 −d̂(i) y (t) 0 0 1  , (3.6) R(i)(t) =  cos(α̂(i)(t)) sin(α̂(i)(t)) 0 − sin(α̂(i)(t)) cos(α̂(i)(t)) 0 0 0 1  , (3.7) The affine transformation matrix is then given by A(i)(t) = S(i)(t) · D(i)(t) · R(i)(t). The matrices A(i)(t) can then be used to generate y(i)(t, ρ(i) max). First, we use affine_grid to calculate a 2D sampling grid c(i)(t) ∈ RX×Y×2, which maps each output pixel to a continuous input coordinate. Note that [−1, 1] denotes the image edges. Symmetric extrapolation is implemented by setting c(i) sym(t) = ((c(i)(t) + 1) mod 2) − 1. We then generate the output frame y(i)(t, ρ(i) max) by feeding I(i) and c(i) sym(t) into grid_sample with cubic interpolation. Note that for efficiency, one can generate µ(i)(t, x, y) by interpolating c(i) sym(t) directly, and then feeding the resultant value into grid_sample. 11 3. Methods Figure 3.1: Comparison of cubic and linear interpolation of control point values. 3.1.2 Event generation As stated earlier, the input to the event simulator is in the form of image sequences, see equation 3.2. The last two steps consist of defining the timestamps tm and the pre-processing function G(i), whose purpose is to help with generalization across different sensors and sensor parameters. We define it to consists of randomly expo- nentiating the image, then normalizing it, and finally simulating illuminance ranges by scaling it as follows: G(i)(x) = L (i) min + norm[0,1](exp(k(i) · x))(L(i) max − L (i) min), (3.8) (3.9) where Lmin ∼ U [450, 850], Lmax ∼ U [10′000, 15′000], and k ∼ U [0.7, 2.5]. We also define tm, m ∈ {1, ...,M} as tm = m− 1 M − 1 . (3.10) Finally, we use the parameters in table 3.1 to generate the training and test datasets. Table 3.1: Data generation parameters used for the training and test datasets. These values were chosen empirically by qualitatively comparing the generated se- quences to the E2VID dataset. To exemplify these values, α = 2π represents a full 360◦ rotation of the image, sx = 2 represents a 2x horizontal "zoom-in", and tx = 1 represents translating the image center to its right edge. The test dataset contains slightly longer and more challenging sequences. M Nmax smax αmax dmax Smax Train 1000 10 0.7 1.7 0.6 4 Test 1500 15 1 2 0.75 6 12 3. Methods 3.2 Event Representation Before events ϵ can be processed by a RCNN, they have to be converted to a fixed-size tensor E. We use the standard approach of encoding events into a spatiotemporal voxel grid [3]. The input events are split into T temporal frames, each corresponding to some timespan (tτ−1, tτ ). Note that this timespan can be of different length for each frame τ . For example, E2VID proposes creating frames such that the number of events in each frame is constant. In real datasets where APS frames are available, the event frames are usually fixed such that their edges coincide with the APS frames. In our approach, we use frames of a fixed temporal width, such that tτ − tτ−1 is constant. For use Ttrain = 100 and Ttest = 150 the the training and test datasets. Each frame is then further split into B temporal bins of width δtτ,b = tτ − tτ−1 B , (3.11) such that each bin b ∈ {1, ..., B} corresponds to the time interval Iτ,b = [tτ−1 + (b− 1)tτ,b, tτ−1 + btτ,b]. (3.12) This allows us to form voxel grids of dimensionality RT×B×X×Y , where each voxel index (τ, b, x, y) corresponds to the events Sτ,b,x,y = {ei ∈ ϵ|xi = x, yi = y, ti ∈ Iτ,b}. (3.13) Then, we form four voxel grids as follows: Epolarity(τ, b, x, y) = ∑ i:ei∈Sτ,b,x,y pi, (3.14) Emean(τ, b, x, y) = 1 |Sτ,b,x,y| ∑ i:ei∈Sτ,b,x,y f̂τ,b(ti), (3.15) Estd(τ, b, x, y) = √√√√ 1 |Sτ,b,x,y| ∑ i:ei∈Sτ,b,x,y (f̂τ,b(ti)− Emean(τ, b, x, y))2, (3.16) Ecount(τ, b, x, y) = |Sτ,b,x,y|, (3.17) where f̂τ,b(t) = t− (tτ−1 + (b− 1)tτ,b) tτ,b (3.18) normalizes the timestamp to [0, 1] within a given frame τ and bin b. Note that if Sτ,b,x,y is empty, we set the corresponding values to 0. Semantically, Emean is the av- erage normalized timestamp of events in a voxel, Estd is the standard deviation, and Ecount is the number of such events. We then form the final tensor E ∈ RT×4B×X×Y by concatenating these voxel grids along the second dimension. The standard approach (e.g. [3], [24]) is to only include information about the polarity within a bin, similar to our Epolarity. However, in these methods, the neural 13 3. Methods network only outputs the intensity at the end of every frame. In our approach, we can query the intensity at any timestamp t, so we benefit from the additional temporal information provided by Emean and Estd. Our experiments confirm this, and also show that including Ecount is beneficial. Note that the discretization step is much faster than running the neural network, so we lose little efficiency by including this extra information. 3.3 Continuous Video Representation As opposed to the approach presented in VideoINR [5], our representation treats the temporal dimension similarly to the spatial dimensions. It can be seen as a straight- forward extension of LIIF [4] from the 2D spatial case to the 3D spatiotemporal case. Therefore, we attempt to closely follow their notation and derivation here. We represent each continuous video V (i) as a 3D feature map M (i) ∈ RT×X×Y×D. All videos share a decoding function Cψ which takes the form s = Cψ(z,x), (3.19) where z ∈ RD, x ∈ R3 is a 3D continuous video coordinate, and s ∈ S is the predicted signal. In our case, s consists of two scalars - the intensity and the variance, so S = R2, and the range of x is [0, T ], [0, X − 1], and [0, Y − 1]. Fixing Cψ, a vector z can then be seen as representing a function Cψ(z, ·) : R3 → S. Next, we assign evenly distributed coordinates v to the T ×X × Y latent codes in M (i) as follows: vt,x,y = t− 1 x− 1 y − 1  . (3.20) Then, the predicted values at continuous coordinate xq are defined by V (i)(xq) = Cψ(z∗,xq − v∗), (3.21) where z∗ is the latent code nearest to xq, and v∗ is the coordinate of z∗. For example, in figure 3.2, z∗ = z∗ 100, and v∗ is its coordinate. Thus, each latent code z in M (i) represents a volume of the continuous video surrounding its coordinates, and is responsible for predicting the signal for coordinates in that volume. 14 3. Methods Figure 3.2: Following LIIF [4], a video is represented as a 3D feature map and a function Cψ. The final output is predicted by trilinearly interpolating neighboring predictions. Continuing to follow LIIF, we also employ feature unfolding and prediction interpo- lation (which they call local ensembles). Feature unfolding refers to the concatenation of the 3 × 3 × 3 neighboring latent codes in M (i) to form M̂ (i): M̂ (i) t,h,w = Concat({Mt+i,h+j,w+k}i,j,k∈{−1,0,1}), (3.22) where M (i) is zero-padded. Following feature unfolding, M̂ (i) replaces M (i) for all computations. This is similar to a 3 × 3 convolutional kernel, and the idea is that the additional spatial information is helpful when decoding the latent code. Prediction interpolation addresses a problem in equation 3.21, namely the possibil- ity of discontinuities in the predicted signal. Specifically, there exist planes where the selection of the nearest latent code switches, and the predictions for two in- finitesimally close coordinates can be different as long as M (i) or Cψ are not perfect. In practice, this effect is especially noticeable in the temporal dimension, as the predicted videos show a discontinuous jump when the latent code switches. To circumvent this, we predict the signal at xq using the surrounding eight latent codes z∗ thw, where t, h, w ∈ {0, 1}, see figure 3.2. Then, we assign each prediction the coordinates of the corresponding latent code, and use trilinear interpolation to compute the final prediction for xq. This method makes the transition between nearest latent codes z∗ smooth, and the predictions continuous. For example, the 15 3. Methods prediction for a coordinate exactly in the middle between two latent codes is the average of the two predictions generated by them. Note that with the latent vector coordinates defined in 3.20 and the given range of x, all coordinates x are spatially surrounded by latent codes. However, the temporal range of x is [0, T ], whereas the largest latent code temporal coordinate is T − 1. Therefore, for x such that x0 ≥ T − 1, we only use the four latent codes z∗ 0hw, and use bilinear interpolation to get the final prediction. 3.4 Network Architecture and Training Our method employs two neural networks; an RCNN νϕ, which encodes event tensors E ∈ RT×4B×X×Y into feature mapsM ∈ RT×X×Y×D, and an MLP Cψ, which decodes unfolded latent codes z ∈ R27D and relative coordinates xrel ∈ R3 [−1,1] into predicted intensities and variances ŷ ∈ R2. It is a small model with 4 hidden layers and a hidden size of 128. The architecture of νϕ is inspired by the two most popular event to image recon- struction architectures - E2VID [3] and FireNet [22]. The exact setup is visualized in figure 3.3. Like FireNet, we replace the ConvLSTM layers in the E2VID architecture with the more efficient ConvGRU [35] layers. We also use instance normalization [36] instead of batch normalization and Exponential Linear Units (ELU) [37] instead of ReLUs. Figure 3.3: Visualization of the architecture we use for νϕ. We use kernel size 3 and hidden size 24 for all convolutional layers. Conv refers to a convolutional layer with stride 1 and Conv↓ is a convolutional layer with stride 2. The empty circle represents concatenation. The ResBlock is visualized in figure 3.4. Figure 3.4: Visualization of the residual block (ResBlock) we use in our architec- ture. The circle with a plus represents addition. 3.4.1 Loss Simply using LMVE as shown in equation 2.7 leads to low quality reconstructions. This is because with a fixed σ̂, the function LMVE ∼ (µi − µ̂i)2, which is simply per-pixel mean squared error (MSE). Johnson et al showed that replacing MSE with a perceptual loss function leads to significantly better performance in image 16 3. Methods transformation tasks [38]. Instead of minimizing the per-pixel difference between the prediction and the target, perceptual losses instead aim to minimize the difference in features extracted by passing the prediction and target images through pre-trained CNNs. We define our loss function as L(µ̂, σ̂, µ) = λMVELMVE(µ̂, σ̂) + LLPIPS_V GG(µ̂), (3.23) where LLPIPS_V GG is the Learned Perceptual Image Patch Similarity metric based on the VGG network [39]. By comparing the value ranges of the two loss functions, we find that λMVE = 0.2 works well. Furthermore, we do not back-propagate the gradients of LMVE through µ̂. The rationale behind this is two-fold: 1. MVE networks are notorious for convergence difficulties due to the coupling of the mean and variance [40]. Our approach lets us naturally decouple the two, leading to stable training. 2. Without this gradient manipulation, our loss function could be seen as a weighted sum of LPIPS and MSE, whereas [38] showed that purely perceptual losses lead to better results. Finally, to avoid issues with predicted negative standard deviations, we define the network output as σ̂′ = ln σ̂. Thus, the final loss function becomes Lfinal(µ̂, σ̂′, µ) = λMVELMVE(stop_grad[µ̂], exp[σ̂′]) + LLPIPS_V GG(µ̂). (3.24) 3.4.2 Training procedure Our optimization problem can be formulated as follows min ϕ,ψ ∑ i ∑ τ≤T Lfinal ( Cψ(νϕ(E(i)), τ + δtτ − 1, ρ(i) min · s), y(i)(τ + δtτ − 1 T , ρ (i) min · s) ) , (3.25) where δtτ ∼ U [0, 1] and s ∼ U [1, 4]. Note that the term τ+δtτ −1 T is a remapping from the domain of τ +δtτ ([1, T +1]) to the domain used to generate ground truths ([0, 1]). By νϕ(E(i) τ ), we denote the feature map M (i) of shape generated by encoding the event tensor E(i) with the encoder νϕ. By Cψ(M (i), t, ρ), we denote the image generated from the feature map M (i) at resolution ρ and timestamp t ∈ [0, T ]. The exact training procedure is described in algorithm 1. For simplicity, we omit batches in the algorithm, but the networks are trained with batch size 2. Note that the loop over the pixels in yτ,crop is vectorized, which significantly improves training efficiency at the cost of GPU memory usage. We use Niters = 3 · 105, and optimize the network weights using the Adam optimizer [41] with learning rate γ = 0.0005 and other settings left at their Pytorch defaults. Training takes around 72 hours on one NVIDIA RTX 6000 ADA 48 GB. The batch and crop sizes were chosen such that they fit on the GPU. 17 3. Methods Note that even though our representation treats the spatial dimensions very sim- ilarly to the temporal dimensions, our training algorithm treats them completely differently. This is due to the way we generate ground-truths. We first generate the transformation matrix A(t), which is then used to generate a grid mapping each output pixel to an input coordinate. This is computationally expensive, so in our training algorithm, we only do it once per event frame. While one could generate a set of fully independent spatiotemporal coordinates, it would be significantly less efficient, as well as harder to implement, visualize, and debug. However, we suspect that constraining the temporal coordinate in this way might be contributing to our very long training times, and further investigation is needed. Algorithm 1 Training algorithm 1: Input: Event tensors E(i), ground-truth generating functions y(i), RCNN νϕ, and MLP Cψ 2: Output: Trained network weights ψ and ϕ 3: for iter = 1 to Niters do 4: sample i ∼ U{1, 500} 5: sample a frame index τstart ∼ U{1, T − 15} 6: sample a spatiotemporal crop E ∈ R16×4B×56×56 from E(i), with the first frame corresponding to τstart 7: sample prediction resolution ρ ∼ U{ρ(i) min,1, ρ (i) max,1} × U{ρ (i) min,2, ρ (i) max,2}. 8: encode E using νϕ to get Miter ∈ R16×D×56×56 9: loss = 0 10: for τ = 1 to 16 do 11: sample prediction timestamp tτ = (τstart−2+τ+δt)/T , where δt ∼ U [0, 1] 12: generate yτ = y(i) ( tτ,iter, ρ (i) max ) 13: interpolate yτ at spatial coordinates corresponding to the crop in E at resolution ρ, generating yτ,crop 14: for each pixel x, y in yτ,crop do 15: predict µ̂x,y and σ̂′ x,y from Miter using Cψ, see section 3.3 16: loss + = Lfinal(µ̂x,y, σ̂′ x,y) 17: end for 18: end for 19: backpropagate loss and update weights ψ, ϕ using Adam optimizer 20: end for 3.5 Evaluation Following academic standards in image reconstruction from events [25], we use three metrics for evaluation of our reconstructions: per-pixel Mean Squared Error (MSE), LPIPS (AlexNet), and Structured Similarity Index Measure (SSIM). For evaluating uncertainty estimations, we begin by employing two standard metrics [42]: 18 3. Methods 1. 80% Prediction Interval Coverage Probability (PICP80) - the fraction of pixel intensities µ in the test set that fall within the 80% prediction interval, which given our distribution N(µ̂, exp(σ̂′)) is equal to [43]: PI80%(µ̂, σ̂′) = [µ̂− 1.28 exp(σ̂′), µ̂+ 1.28 exp(σ̂′)] . (3.26) 2. Loglikelihood (LL) - the logarithmic likelihood of the test data, which in our case can be calculated as LL(µ̂, σ̂′) = ln  1√ 2π exp(σ̂′)2 exp −1 2 ( µ− µ̂ exp(σ̂′) )2  = − ln(2π) 2 − σ̂′ − 1 2 ( µ− µ̂ exp σ̂′ )2 . (3.27) These two methods have their drawbacks - PICP can be misleading, as simply always predicting very large variances would lead to very good scores, whereas LL values can be difficult to interpret. Furthermore, these values measure the difference between µ and µ̂ simply as (µ − µ̂)2, ignoring perceptual context. Therefore, we propose a third metric: 3. LPIPS and Standard Deviation Correlation (LPIPS_CORR) - per-patch cor- relation between LPIPS and median predicted standard deviation over the test data. This metric is meant to describe how well the network predicts the per- ceptual difference between the ground truth and the predicted reconstruction. Its utility is explained in section 4.1.1.2. 19 3. Methods 20 4 Results In this chapter, we present quantitative and qualitative results. We begin by evalu- ating our method on the simulated test dataset, showing performance on prediction, spatial superresolution, frame interpolation, and uncertainty quantification. Then, we move on to real data, applying our method to the Event Camera Dataset (ECD) [30] and the High Quality Frames (HQF) dataset [23]. We compare our intensity pre- dictions to several established RCNN-based algorithms such as E2VID [3], FireNET [22], FireNET+, and E2VID+ [23]. Then, we evaluate the uncertainty predictions, and show qualitative results on superresolution and frame interpolation. 4.1 Simulated data In this section, we show results on our simulated test dataset. 4.1.1 Prediction without superresolution We begin by evaluating predictions without any spatiotemporal superresolution, i.e. given an event tensor E ∈ RT×4B×X×Y , we predict an output ŷ ∈ RT×X×Y×2. Evaluated on our test dataset, we get the results shown in table 4.1. Table 4.1: Results on the test dataset using prediction without superresolution or temporal interpolation. MSE LPIPS (Alex) SSIM PICP80 LL LPIPS_CORR 0.018± 0.012 0.13± 0.05 0.61± 0.13 0.79± 0.12 0.76± 0.35 0.19± 0.2 While one could easily apply other SOTA intensity prediction methods to our test dataset, we elect not to show these results. Because our method is trained on data that is highly similar to the test data, no conclusions could be drawn from such a comparison. Instead, we compare our method to others only on real data, and the values above mainly serve as a baseline. 21 4. Results 4.1.1.1 Intensity predictions To begin with, we analyze our intensity predictions µ̂, beginning by showing exam- ples. First, we highlight some low performance cases with LPIPS (Alex) far worse than the average in figure 4.1. Then, we show examples of predictions with LPIPS (Alex) close to average in figure 4.2. (a) LPIPS (Alex) = 0.292. With too many events, the method fails to reconstruct fine details, such as the giraffe coat patterns. (b) LPIPS (Alex) = 0.386. Due to the inherent sparsity of event data, the method can fail to correctly reconstruct the brightness of the image. (c) LPIPS (Alex) = 0.237. Example of failure due to a lack of events. Figure 4.1: Examples of predictions generated by our networks with LPIPS (Alex) far above the average, with short explanations for the bad performance. 22 4. Results (a) LPIPS (Alex) = 0.13. (b) LPIPS (Alex) = 0.14. Figure 4.2: Examples of predictions generated by our networks with LPIPS (Alex) close to the average. The representative examples from figure 4.2 show that our method generates high- quality reconstructions, generally capturing both fine details and overall image struc- ture. However, the examples in figures 4.1c and 4.1a indicate that intensity predic- tion performance can be sensitive to the number of events in the event frames. To gain a deeper understanding of this relationship, we plot a heatmap of LPIPS and event counts in figure 4.3. We normalize the event counts by the image resolution, thus forming the EPF (Event per Pixel per Frame) metric. 23 4. Results Figure 4.3: 2D heatmap of LPIPS versus EPF over all frames in the test dataset, with predictions generated without superresolution or interpolation. Data points are aggregated into 50 bins per dimension, where each bin represents a specific range of LPIPS and EPF values. Note that LPIPS is clipped to [0, 0.3] and EPF to [0, 5]. Additionally, statistical lines are plotted using 1D binning along the EPF axis. For each EPF bin, we show the mean (blue line) and standard deviation (dashed red line) of the LPIPS values of patches within that bin. While figure 4.3 shows the expected increase in LPIPS values for very low and very high EPF, the general tendency is that LPIPS increases with EPF. This is surprising; intuitively, it seems that more events would give the networks more information, and lead to more accurate representations. However, note that EPF depends on three main factors: 1. sensor sensitivity S, determined by the sensor parameters and sensor noise, and further augmented by our preprocessing function G, 2. scene characteristics C, such as brightness (lower brightness leads to more events), contrast, or level of detail, 3. magnitude of motion M , determined by the affine matrix generating function A(t). While higher S increases the number of events without impacting the inherent dif- ficulty of intensity prediction, C influences both the number of events and the diffi- culty. Additionally, increasing M generates more events at the cost of increasing the difficulty of intensity prediction, and figure 4.3 indicates that this is the dominating factor. While it would be informative to analyze the relationship between S and LPIPS, S cannot readily be measured, making such analysis difficult. 24 4. Results 4.1.1.2 Uncertainty predictions Next, we analyze the uncertainty predictions σ̂. We begin by showcasing the com- plementary roles of the three metrics (LL, PICP80, and LPIPS_CORR) used to evaluate the uncertainty predictions. In figure 4.5, we show examples of predic- tions with combinations of high and low LPIPS_CORR, PICP80, and LL values. High LL and PICP80 values suggest that the predicted σ̂′ is proportional to the MSE, whereas high LPIPS_CORR indicates that it corresponds to the perceptual difference between the predicted image and the ground truth. Note that the net- work is not trained to estimate the perceptual difference, so examples with high LPIPS_CORR and low LL or PICP80 are rare. Figure 4.4: Distribution of the LPIPS_CORR and PICP80 values of the example predictions shown in figure 4.5. 25 4. Results (a) High LL = 1.2 and PICP80 = 0.89 with low LPIPS_CORR = −0.31. The predicted σ̂′ correctly identifies areas with high MSE, leading to high LL and PICP80 values. However, note that the face is barely reconstructed, and therefore has high LPIPS values despite low MSE values. On the other hand, the plate has high σ̂ and high MSE, even though it is perceptually reconstructed well. These two areas explain the negative LPIPS_CORR. 26 4. Results (b) Low LL = 0.66 and PICP80 = 0.67 with high LPIPS_CORR = 0.42. The network attributes a high uncertainty to the facial area, which has a low MSE, resulting in a low LL. However, the intensity predictions in this area have low per- ceptual quality, and this can be seen in the patch LPIPS values, resulting in a high LPIPS_CORR. Furthermore, the high confidences assigned to the incorrectly pre- dicted shirt and background colors result in a low PICP80, despite these areas being perceptually well-reconstructed. (c) Low LL = 0.36, PICP80 = 0.65, and LPIPS_CORR = −0.47. The predicted σ̂′ does not identify erroneous areas such as the lamp or the top part of the image. 27 4. Results (d) High LL = 1.14, PICP80 = 0.94, and LPIPS_CORR = 0.69. The predicted σ̂′ identifies both well reconstructed areas (image center) and erroneous areas (bottom center, center right). Figure 4.5: Examples showcasing the complementary roles of the novel LPIPS_CORR and traditional LL and PICP80 uncertainty quantification metrics. The value ranges in the figure titles represent colors from black to white. For PICP80 plots, white pixels indicate µ is within the prediction interval. The examples from figure 4.5 suggest that if images are reconstructed in the context of downstream computer vision applications, the novel LPIPS_CORR score can arguably be more important than LL or PICP80. For example, consider applying a face detection algorithm to our intensity predictions. In figure 4.5a, the face is barely reconstructed at all, and yet the predicted σ̂′ in this area is very low, resulting in a negative LPIPS_CORR value. This would be fatal for the face detection algorithm, as it would likely have a high confidence that there is no face in the image, and our predicted uncertainties would further confirm this. In contrast, in figure 4.5b, our predicted uncertainties correctly identify that the face is perceptually not reconstructed well which might be reflected in the performance of the downstream algorithm, even though the intensity error is low in that area. In order to analyze the relationship between predicted STD and LPIPS beyond just the LPIPS_CORR score, we visualize their distributions in figure 4.6. We identify three critical points, at median(σ̂) ≈ 0.07, median(σ̂) ≈ 0.13 and median(σ̂) ≈ 0.2. For the interval [0.07, 0.13], the curve is relatively flat. At 0.13, we identify a sharp increase, followed by flatness until 0.2. At 0.2 we have an elbow point, followed by a clear positive correlation between median(σ̂) and LPIPS. In figure 4.7, we show examples of patches with median(σ̂) close to each critical point. This relationship could be used for a downstream computer vision application. For example, one might apply a simple heuristic, stating that σ̂ < 0.13 does not affect the downstream task, σ̂ ∈ [0.13, 0.2] lowers the downstream confidences by a constant factor, and for 28 4. Results σ̂ > 0.2, this factor becomes proportional to the predicted σ̂. Figure 4.6: 2D heatmap of LPIPS versus Median STD, visualizing the distribution of patch LPIPS values against the median predicted STD over 16 by 16 patches over the test dataset. Data points are aggregated into 50 bins per dimension. Statistical lines were calculated as in figure 4.3. Figure 4.7: Prediction with examples of patches with given median predicted standard deviations highlighted. The LPIPS_CORR metric has one hyperparameter - the patch size. In figure 4.8, we investigate the influence of the selected patch size on the LPIPS_CORR value over the test dataset. The LPIPS_CORR generally decreases up until a patch size of 32, after which increases. For our evaluation, we use the smallest patch size of 16, as it leads to the smallest variance in LPIPS_CORR. 29 4. Results Figure 4.8: Relation between LPIPS_CORR and Patch size. The figure was generated by calculating the LPIPS_CORR over the test dataset for patch sizes between 162 and 962, with a step of 4. The minimum evaluated patch size is due to LPIPS (AlexNet) needing inputs of size at least 162, whereas the maximum is chosen such that it is smaller than the smallest image in the test dataset. 4.1.2 Spatiotemporal superresolution In the next experiment, we evaluate the spatiotemporal superresolution of our method. We begin by analysing how the perceptual intensity prediction quality changes with temporal and spatial superresolution scale. Then, we do the same for the uncertainty predictions. We conclude by showing examples of predictions generated at different spatiotemporal superresolution scales in figure 4.11. 4.1.2.1 Intensity prediction in spatial superresolution We begin by showing LPIPS as a function of spatial superresolution scale in blue in figure 4.9a. As a baseline, we also show cubic interpolation to the given scale from predictions without superresolution in red. Figure 4.9a shows that we outperform our baseline, with an LPIPS increase by roughly 200% at 6x superresolution, compared to the ∼ 400% of cubic interpolation. However, the LPIPS distance between cubic interpolation and our method remains fairly constant for upscaling factors larger than 3. This might suggest that our method is only better than cubic interpolation up to a cut-off at 3x spatial upscaling, and for bigger scales, the difference is negligible. To confirm this finding, we perform a second experiment, where we first use our method for superresolution by initial spatial scales s, followed by cubic interpolation to 6x spatial upscaling. We calculate LPIPS at 6x upscaling, and plot the values against s in blue in figures 4.9b. 30 4. Results (a) Predictions are generated for spatial scales between 1 and 6 and evaluated against the ground truth. As a baseline, we generate predictions without upscal- ing and upscale them spatially using cu- bic interpolation. (b) Predictions and ground truths are generated for spatial scales between 1 and 6, upscaled to 6x using cubic in- terpolation, and evaluated against the ground truth. Figure 4.9: LPIPS vs spatial superresolution scale, showing means and standard deviations over all test sequences. We see an initial decrease in LPIPS followed by a flattening out for scales larger than 3, seemingly confirming our hypothesis. However, this cut-off point does not necessarily have to be intrinsic to our representation, and could be caused by the data itself. For example, consider images of varying resolutions representing a fixed scene. Then, one can choose resolutions large enough that true scene can be approximated arbitrarily well by cubic interpolation. To test if our data has this problem, we generate the ground truth at initial spatial scales s, and then cubically interpolate to 6x spatial upscaling. If the cut-off is intrinsic to the data, we would expect this curve to have a similar flattening for spatial scales larger than 3. If such a flattening does not exist, we can say that the cut-off is intrinsic to the representation. We plot the curve in green in figure 4.9b. It decreases smoothly to 0, indicating that the cut-off scale of 3 is intrinsic to our representation, and not the data. 4.1.2.2 Intensity prediction in temporal superresolution We first show LPIPS as a function of temporal superresolution scale in blue in figure 4.10a. In green, we also show cubic interpolation from predictions without superres- olution as a baseline. As opposed to spatial superresolution, where LPIPS decreased significantly as a function of scale, the intensity prediction quality is almost invariant to the temporal superresolution scale. While it is tempting to attribute this to the characteristics of event data, which is spatially discrete and temporally continuous, one must note that the cubic interpolation quality also remains constant for scales larger than 2. This suggests that there is a data cut-off at 2x upscaling after which cubic temporal interpolation approximates true intensities very well. To confirm this, we generate figure 4.10b, the temporal counterpart of figure 4.9b. 31 4. Results While the expected flattening exists, it happens around 3x upscaling, not 2x as ex- pected. However, the prediction curve in figure 4.10b does not have a clear elbow point, and no definitive conclusions about the temporal resolution of our represen- tation can be drawn. To find this value, one would have to generate data with more motion, reducing the temporal consistency of the ground truth. We expect that this would cause the temporal relationships to be more similar to the spatial ones, with more pronounced critical points. (a) Predictions are generated for tem- poral scales between 1 and 6 and eval- uated against the ground truth. As a baseline, we generate predictions with- out upscaling and upscale them tempo- rally using cubic interpolation. (b) Predictions and ground truths are generated for temporal scales between 1 and 6, upscaled to 6x using cubic in- terpolation, and evaluated against the ground truth. Figure 4.10: LPIPS vs temporal superresolution scale, showing means and stan- dard deviations of LPIPS over test sequences. 4.1.2.3 Uncertainty quantification in spatiotemporal superresolution We continue by analyzing the performance of uncertainty quantification as the scale changes. Similarly to figures 4.9b and 4.9b, we generate predictions at spatiotempo- ral scales between 1 and 4, cubically interpolate them to 4x upscaling, and calculate metrics over the test set, but now we upscale spatially and temporally at the same time. The relationship is similar to what we saw between LPIPS and temporal scale, with a significant increase in performance between scales 1 and 2, followed by a fairly flat relationship. We do not show performance evaluated against ground truths directly, since the evaluations then use different ground truths and are not comparable. This is espe- cially problematic for LPIPS_CORR, where the perceptual context of a patch is highly dependent on the portion of the scene that it covers. Note that this also ex- plains why the values of LPIPS_CORR shown in table 4.11 are significantly higher than in table 4.1. Generally, it seems that our uncertainty prediction performance is fairly invariant to the upscaling factor. This suggests that while the intensity prediction performance degrades with the scaling factor, the model is able to adapt its uncertainties accordingly. 32 4. Results Figure 4.11: Spatiotemporal superresolution example. The x vs t plots have y = Y/2, and the x vs y plots have t = T/2. The green lines denote x = X/2. The red rectangle in the full image is the showcased area. 33 4. Results Table 4.2: Means and standard deviations of key uncertainty metrics over the test dataset calculated by generating predictions at given scales, interpolating them to 4x upscaling, and evaluating against the ground truth. Arrows indicate the direction of better performance for each metric. Scale ¯̂σ MSE ↓ PICP80 ↑ LPIPS_CORR ↑ 1 0.11± 0.18 0.024± 0.012 0.75± 0.11 0.37± 0.18 2 0.11± 0.03 0.017± 0.009 0.80± 0.11 0.41± 0.17 3 0.11± 0.03 0.016± 0.009 0.81± 0.11 0.40± 0.18 4 0.11± 0.03 0.016± 0.009 0.81± 0.11 0.40± 0.17 4.2 Real data Finally, we evaluate our method on two real datasets - the Event Camera Dataset (ECD) [30], and the High Quality Frames (HQF) dataset [23]. These datasets contain events and ground-truth intensity images from a DAVIS240C event cam- era. Note that these intensity images do not necessarily have a fixed temporal fre- quency. In order to synchronize our reconstructions with the ground truth frames, we form discretized event frames by combining the events between consecutive intensity frames instead of using a fixed temporal width. We begin by providing quantitative results on image reconstruction without any superresolution, and compare them to several other SOTA RCNN-based methods in table 4.3. Table 4.3: Comparison of our method with several state-of-the-art RCNN recon- struction methods, as reported by Ercan et al. in [25]. LPIPS refers to the AlexNet version here. The best and second best scores are given in bold and underlined. Method ECD [30] HQF [23] MSE ↓ SSIM ↑ LPIPS ↓ MSE ↓ SSIM ↑ LPIPS ↓ E2VID [3] 0.179 0.450 0.322 0.099 0.463 0.388 FireNET [22] 0.133 0.459 0.321 0.100 0.422 0.463 E2VID+ [23] 0.070 0.503 0.236 0.036 0.536 0.255 FireNET+ [23] 0.062 0.452 0.337 0.045 0.472 0.323 Ours 0.055 0.520 0.306 0.063 0.479 0.366 In terms of MSE and SSIM, our method surpasses the other methods on ECD, but lags behind both E2VID+ and FireNET+ on HQF. In terms of LPIPS, E2VID+ is better on both datasets, and FireNET+ is better on HQF. Notably, our method was trained on VGG LPIPS, in contrast to the other methods, which were trained on AlexNet LPIPS. This might skew these results in favour of the other methods, as we evaluate on AlexNet LPIPS. In general, we reach performance comparable to E2VID+ and FireNet+, and in figure 4.12, we show examples reinforcing that claim. In order to avoid any bias, we show the same sample scenes as Ercan et al. [25] and Stoffregen et al. [23]. 34 4. Results Figure 4.12: Comparison of our method to FireNet+ and E2VID+ on sample scenes from the HQF and ECD datasets. In the shapes scene from the ECD, our method generates fewer artifacts than E2VID+ and FireNET+. However, in the desk and reflective_materials scenes from the HQF, the intensity of many objects (background, water bottle in desk, and overall scene brightness in reflective_materials) is significantly worse than in E2VID+. This likely suggests that our simulated data is more similar to the ECD dataset than the HQF dataset, and we discuss this sim-to-real gap in more detail in section 4.2.2. Evaluating spatiotemporal superresolution quantitatively is difficult, as we do not have access to any superresolved ground-truths. While one could evaluate temporal superresolution by merging neighboring event frames and using the skipped intensity frames as ground truths within the event frame, this would be difficult to interpret, as the task would inherently become harder due to the increase in motion within the frame. Instead, we only evaluate superresolution qualitatively. An example is visualized in figure 4.13, with the patterns on the zebra clearly showing that the superresolution generalizes to real data. However, it remains to be investigated whether this relationship follows the analysis on the simulated data. 35 4. Results Figure 4.13: Spatiotemporal superresolution example on the desk scene from the HQF dataset. Note that ground truth intensities are not available for scales larger than 1. 36 4. Results Finally, we show the uncertainty metrics on the two datasets in table 4.4. LL and PICP80 are both drastically lower than on the simulated dataset, suggesting that our uncertainty prediction fails to generalize to real data. However, the average LPIPS_CORR value over the two datasets is surprisingly similar to the simulated data. This indicates that while the predicted uncertainties are a bad estimator of the intensity error on real data, they can still be used to estimate the perceptual error, and could therefore be used for downstream vision applications. Table 4.4: Uncertainty performance on real data. ECD HQF LL LPIPS_CORR PICP80 LL LPIPS_CORR PICP80 -0.73 0.29 0.19 -0.72 0.10 0.23 In many of the analyzed predictions, we have noticed a strong correlation between the brightness of a region and its predicted uncertainties. This is especially obvious in the checkerboard patterns shown in figure 4.14. We have identified three factors that might be contributing to this: 1. Regions with low brightness generate more events, as smaller absolute bright- ness changes are enough to trigger the Contrast Threshold. While this could enable better predictions, it is important to note that these events are also more noisy, and it is unclear whether the overall effect is positive or negative. 2. Low frequency brightness information is difficult to reconstruct accurately from event data due to its sparsity. For example, in figure 4.14, the brightness of the white cells in the main checkerboard is significantly lower in the bottom right than in the other cells, and the network fails to capture. This low resolution information does not affect the dark cells to the same degree, so the network learns to assign higher uncertainties to bright cells. 3. In the real data, the scene brightness is not static, as movement can cause effects such as reflections and shadows. This is another example of low fre- quency information which does not affect dark parts of the image as much as bright parts, which can lead to higher uncertainties in bright regions. Figure 4.14: Example prediction from the slow_hand scene from the HQF dataset. 37 4. Results 4.2.1 Performance Table 4.5 compares the computational performance of our method to FireNET and E2VID. A NVIDIA GeForce RTX 4060 Laptop GPU and an Intel Core i5-12500H were used for all experiments. For FireNET and E2VID, we used the implementation from Scheerlinck et al. [22]. We analyze the performance of the encoder and decoder separately, as the encoder is usually ran at a fixed framerate dependent on the input data, and the decoder determines the output spatiotemporal resolution. On the GPU, our encoder takes roughly 75% more time than FireNET to process an event frame, whereas our decoder takes roughly 40% less time. In figure 4.15, we plot the time it takes to output one second of 240× 180) video against the output framerate assuming a fixed event frame duration of 50 ms and fully sequential reconstruction. It should be noted that this comparison relies on unrealistic assumptions. Mainly, it assumes that the methods run sequentially, whereas in practice, all three are parallelizable, and multiple intensity frames can be decoded simultaneously. Our ap- proach to high-FPS video reconstruction is also very different from the E2VID/FireNET approaches. In their method, the high temporal resolution is achieved by tempo- rally shifting the discretized event frames, whereas in our method, it is intrinsic to the representation. These factors make it hard to draw any conclusion from this comparison, and it should not be interpreted as a claim that our method is faster than FireNET or E2VID. Table 4.5: Inference time for one event frame on GPU and CPU at the spatial resolutions reported in [23]. Enc and Dec refer to the encoder νϕ and the decoder Cθ. Resolution GPU (ms) CPU (ms) E2VID FireNET Enc Dec E2VID FireNET Enc Dec 240× 180 7.3 2.0 3.7 1.1 120 16 31 11 346× 260 18 4.3 7.2 2.4 185 37 79 29 640× 480 44 17 31 8.9 640 160 290 94 1280× 720 220 60 100 41 2000 640 1200 420 38 4. Results Figure 4.15: The inference time required to generate one second of video output at various frame rates using E2VID, FireNET, and our method. The red dashed line indicates the threshold for real-time inference. 4.2.2 Sim-to-real gap We have tried several strategies to improve prediction quality on HQF and ECD, in- cluding increasing encoder size, replacing our encoder architecture with the E2VID architecture, increasing decoder size, adding data normalization, and training on longer sequences, but none of these attempts led to a significant improvement. Therefore, we believe that the key to improving our performance is reducing the sim-to-real gap in the training data. This is also supported by the quantitative gap between performance on real and simulated data. We have identified a number of issues in our training data, and propose strategies to resolve them: 1. Lack of sequences with periods of no motion - in the HQF frames, there are periods with little-to-no motion, and therefore no events. As shown in figure 4.16, our encoder does not retain the scene intensities in these situations. This is likely because these situations do not exist in the training data. To solve this, we can employ the same pause augmentation strategy as E2VID [3], and occasionally set the input events to zero, using the previous ground truth to compute the loss. 39 4. Results Figure 4.16: Example of failure when there are no input events, taken from the slow_hand scene in HQF. The timestamps of the three frames are shown by the red dashed lines in the Y-T panel. Lack of motion for only 6 event frames makes the encoder "forget" the scene. 2. Continuous optical flow - in our training data, we use continuous affine trans- formations applied to still images to generate scenes. However, this leads to a spatially continuous optical flow, and cannot for example model multiple objects moving independently against a still background. We believe that this unrealistic motion might introduce biases that widen the sim-to-real gap. To solve this, we can generate affine matrices for multiple still images, and then overlap them to generate the scene. 3. Contrast threshold - Stoffregen et al. have shown that the contrast threshold (CT) is a “key simulation parameter that impacts performance of supervised CNNs” [23]. In our method, we use a fixed CT of 0.35, as the simulator in- trinsically introduces significant noise to this. We reasoned that this noise in combination with our stochastic pre-processing function would let the method generalize to real data. However, preliminary analysis showed that when run- ning the HQF intensity frames through the simulator, the number of events we generated was roughly three times smaller than the real number of events. To achieve a similar number of events, we had to set the CT to 0.18. This suggests that employing a similar strategy to [23] and generating events for a wide number of CTs will likely help bridge the sim-to-real gap. 40 5 Conclusion In this thesis, we presented a novel Local Implicit Function model for spatiotempo- rally continuous video representation from event data. We explored its application in video reconstruction, spatiotemporal superresolution, and uncertainty quantifica- tion. On video reconstruction, we achieved performance similar to comparable state-of- the-art methods on real-world datasets. We identified the quality of simulated train- ing data as a potential limitation and suggested several strategies to bridge the gap between simulated and real data. We showed that for spatial superresolution, our method significantly outperforms a simple baseline up to 3x upscaling on the simulated data. For temporal superresolu- tion, similar conclusions are difficult to draw due to limitations in the data, and we suggested a method to evaluate this more thoroughly. We also presented examples from real-world datasets, showing that our method generalizes to real event data. We also proposed a novel metric for evaluating uncertainty predictions, and argued that it can be more informative than the usual metrics in the context of down- stream computer vision applications. We analyzed how our predicted uncertainties correlate with perceptual reconstruction errors and suggested a simple heuristic to potentially enhance downstream vision algorithms. Despite a degradation in tradi- tional uncertainty metrics when our method is applied to real data, the performance on our novel metric remained similar to simulated data, indicating that the method might be applicable to real-world scenarios. Looking forward, we identified several avenues for continuing the work presented in this thesis. First, addressing the sim-to-real gap is essential, as it could sub- stantially enhance the performance of our model. Benchmarking our spatial super- resolution approach against state-of-the-art methods would help contextualize our performance. Additionally, a more in-depth analysis of temporal superresolution could show how well our method captures the high temporal resolution of event cameras. Lastly, further exploration into potential applications of our uncertainty predictions on downstream vision algorithms is needed. 41 5. Conclusion 42 Bibliography [1] G. Gallego et al., “Event-based Vision: A Survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 1, pp. 154–180, Jan. 2022, arXiv:1904.08405 [cs], issn: 0162-8828, 2160-9292, 1939-3539. doi: 10. 1109/TPAMI.2020.3008413. [Online]. Available: http://arxiv.org/abs/ 1904.08405 (visited on 04/16/2024). [2] D. Gehrig and D. Scaramuzza, “Are High-Resolution Event Cameras Really Needed?” en, [3] H. Rebecq et al., “High Speed and High Dynamic Range Video with an Event Camera,” en, IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 43, no. 6, pp. 1964–1980, Jun. 2021, issn: 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2019.2963386. [Online]. Available: https: //ieeexplore.ieee.org/document/8946715/ (visited on 04/16/2024). [4] Y. Chen, S. Liu, and X. Wang, Learning Continuous Image Representation with Local Implicit Image Function, arXiv:2012.09161 [cs], Apr. 2021. [Online]. Available: http://arxiv.org/abs/2012.09161 (visited on 04/16/2024). [5] Z. Chen et al., VideoINR: Learning Video Implicit Neural Representation for Continuous Space-Time Super-Resolution, arXiv:2206.04647 [cs, eess], Jun. 2022. [Online]. Available: http://arxiv.org/abs/2206.04647 (visited on 04/25/2024). [6] D. Nix and A. Weigend, “Estimating the mean and variance of the target probability distribution,” en, in Proceedings of 1994 IEEE International Con- ference on Neural Networks (ICNN’94), Orlando, FL, USA: IEEE, 1994, 55–60 vol.1, isbn: 978-0-7803-1901-1. doi: 10.1109/ICNN.1994.374138. [Online]. Available: http://ieeexplore.ieee.org/document/374138/ (visited on 04/17/2024). [7] P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128$\times$128 120 dB 15 $\mu$s Latency Asynchronous Temporal Contrast Vision Sensor,” en, IEEE Journal of Solid-State Circuits, vol. 43, no. 2, pp. 566–576, 2008, issn: 0018- 9200. doi: 10.1109/JSSC.2007.914337. [Online]. Available: http://ieeexplore. ieee.org/document/4444573/ (visited on 05/19/2024). [8] A. Glover and C. Bartolozzi, “Robust visual tracking with a freely-moving event camera,” en, in 2017 IEEE/RSJ International Conference on Intelli- gent Robots and Systems (IROS), Vancouver, BC: IEEE, Sep. 2017, pp. 3769– 3776, isbn: 978-1-5386-2682-5. doi: 10.1109/IROS.2017.8206226. [Online]. 43 https://doi.org/10.1109/TPAMI.2020.3008413 https://doi.org/10.1109/TPAMI.2020.3008413 http://arxiv.org/abs/1904.08405 http://arxiv.org/abs/1904.08405 https://doi.org/10.1109/TPAMI.2019.2963386 https://ieeexplore.ieee.org/document/8946715/ https://ieeexplore.ieee.org/document/8946715/ http://arxiv.org/abs/2012.09161 http://arxiv.org/abs/2206.04647 https://doi.org/10.1109/ICNN.1994.374138 http://ieeexplore.ieee.org/document/374138/ https://doi.org/10.1109/JSSC.2007.914337 http://ieeexplore.ieee.org/document/4444573/ http://ieeexplore.ieee.org/document/4444573/ https://doi.org/10.1109/IROS.2017.8206226 Bibliography Available: http://ieeexplore.ieee.org/document/8206226/ (visited on 05/19/2024). [9] J. J. Hagenaars et al., “Evolved Neuromorphic Control for High Speed Divergence- Based Landings of MAVs,” en, IEEE Robotics and Automation Letters, vol. 5, no. 4, pp. 6239–6246, Oct. 2020, issn: 2377-3766, 2377-3774. doi: 10.1109/ LRA.2020.3012129. [Online]. Available: https://ieeexplore.ieee.org/ document/9149674/ (visited on 05/19/2024). [10] D. Falanga, K. Kleber, and D. Scaramuzza, “Dynamic obstacle avoidance for quadrotors with event cameras,” Science Robotics, vol. 5, no. 40, eaaz9712, Mar. 2020, Publisher: American Association for the Advancement of Science. doi: 10.1126/scirobotics.aaz9712. [Online]. Available: https://www. science.org/doi/10.1126/scirobotics.aaz9712 (visited on 05/19/2024). [11] T. Taunyazov et al., “Event-Driven Visual-Tactile Sensing and Learning for Robots,” en, in Robotics: Science and Systems XVI, Robotics: Science and Systems Foundation, Jul. 2020, isbn: 978-0-9923747-6-1. doi: 10.15607/RSS. 2020.XVI.020. [Online]. Available: http://www.roboticsproceedings.org/ rss16/p020.pdf (visited on 05/19/2024). [12] F. Baghaei Naeini et al., “A Novel Dynamic-Vision-Based Approach for Tactile Sensing Applications,” en, IEEE Transactions on Instrumentation and Mea- surement, vol. 69, no. 5, pp. 1881–1893, May 2020, issn: 0018-9456, 1557-9662. doi: 10.1109/TIM.2019.2919354. [Online]. Available: https://ieeexplore. ieee.org/document/8723387/ (visited on 05/19/2024). [13] A. Vitale et al., “Event-driven Vision and Control for UAVs on a Neuromorphic Chip,” en, in 2021 IEEE International Conference on Robotics and Automa- tion (ICRA), Xi’an, China: IEEE, May 2021, pp. 103–109, isbn: 978-1-72819- 077-8. doi: 10.1109/ICRA48506.2021.9560881. [Online]. Available: https: //ieeexplore.ieee.org/document/9560881/ (visited on 05/19/2024). [14] R. S. Dimitrova et al., “Towards Low-Latency High-Bandwidth Control of Quadrotors using Event Cameras,” en, in 2020 IEEE International Confer- ence on Robotics and Automation (ICRA), Paris, France: IEEE, May 2020, pp. 4294–4300, isbn: 978-1-72817-395-5. doi: 10.1109/ICRA40945.2020. 9197530. [Online]. Available: https://ieeexplore.ieee.org/document/ 9197530/ (visited on 05/19/2024). [15] V. Vishnevskiy et al., Optimal OnTheFly Feedback Control of Event Sensors, en. (visited on 11/22/2023). [16] V. Sitzmann, Awesome Implicit Representations - A curated list of resources on implicit neural representations. [Online]. Available: https://github.com/ vsitzmann/awesome-implicit-representations. [17] H. M. D. Kabir et al., “Neural Network-Based Uncertainty Quantification: A Survey of Methodologies and Applications,” IEEE Access, vol. PP, Jun. 2018. doi: 10.1109/access.2018.2836917. [18] C. Scheerlinck, N. Barnes, and R. Mahony, “Continuous-Time Intensity Es- timation Using Event Cameras,” en, in Computer Vision – ACCV 2018, C. Jawahar et al., Eds., Cham: Springer International Publishing, 2019, pp. 308– 324, isbn: 978-3-030-20873-8. doi: 10.1007/978-3-030-20873-8_20. 44 http://ieeexplore.ieee.org/document/8206226/ https://doi.org/10.1109/LRA.2020.3012129 https://doi.org/10.1109/LRA.2020.3012129 https://ieeexplore.ieee.org/document/9149674/ https://ieeexplore.ieee.org/document/9149674/ https://doi.org/10.1126/scirobotics.aaz9712 https://www.science.org/doi/10.1126/scirobotics.aaz9712 https://www.science.org/doi/10.1126/scirobotics.aaz9712 https://doi.org/10.15607/RSS.2020.XVI.020 https://doi.org/10.15607/RSS.2020.XVI.020 http://www.roboticsproceedings.org/rss16/p020.pdf http://www.roboticsproceedings.org/rss16/p020.pdf https://doi.org/10.1109/TIM.2019.2919354 https://ieeexplore.ieee.org/document/8723387/ https://ieeexplore.ieee.org/document/8723387/ https://doi.org/10.1109/ICRA48506.2021.9560881 https://ieeexplore.ieee.org/document/9560881/ https://ieeexplore.ieee.org/document/9560881/ https://doi.org/10.1109/ICRA40945.2020.9197530 https://doi.org/10.1109/ICRA40945.2020.9197530 https://ieeexplore.ieee.org/document/9197530/ https://ieeexplore.ieee.org/document/9197530/ https://github.com/vsitzmann/awesome-implicit-representations https://github.com/vsitzmann/awesome-implicit-representations https://doi.org/10.1109/access.2018.2836917 https://doi.org/10.1007/978-3-030-20873-8_20 Bibliography [19] C. Brandli, L. Muller, and T. Delbruck, “Real-time, high-speed video decom- pression using a frame- and event-based DAVIS sensor,” en, in 2014 IEEE International Symposium on Circuits and Systems (ISCAS), Melbourne VIC, Australia: IEEE, Jun. 2014, pp. 686–689, isbn: 978-1-4799-3432-4. doi: 10. 1109/ISCAS.2014.6865228. [Online]. Available: http://ieeexplore.ieee. org/document/6865228/ (visited on 05/20/2024). [20] P. Bardow, A. J. Davison, and S. Leutenegger, “Simultaneous Optical Flow and Intensity Estimation from an Event Camera,” en, in 2016 IEEE Con- ference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA: IEEE, Jun. 2016, pp. 884–892, isbn: 978-1-4673-8851-1. doi: 10. 1109/CVPR.2016.102. [Online]. Available: http://ieeexplore.ieee.org/ document/7780471/ (visited on 05/20/2024). [21] C. Reinbacher, G. Graber, and T. Pock, Real-Time Intensity-Image Recon- struction for Event Cameras Using Manifold Regularisation, en, arXiv:1607.06283 [cs], Aug. 2016. [Online]. Available: http://arxiv.org/abs/1607.06283 (vis- ited on 05/20/2024). [22] C. Scheerlinck et al., “Fast Image Reconstruction with an Event Camera,” en, in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA: IEEE, Mar. 2020, pp. 156–163, isbn: 978-1-72816-553-0. doi: 10.1109/WACV45572.2020.9093366. [Online]. Avail- able: https : / / ieeexplore . ieee . org / document / 9093366/ (visited on 04/25/2024). [23] T. Stoffregen et al., Reducing the Sim-to-Real Gap for Event Cameras, en, arXiv:2003.09078 [cs], Aug. 2020. [Online]. Available: http://arxiv.org/ abs/2003.09078 (visited on 05/02/2024). [24] L. Wang et al., “Event-Based High Dynamic Range Image and Very High Frame Rate Video Generation Using Conditional Generative Adversarial Net- works,” en, in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA: IEEE, Jun. 2019, pp. 10 073– 10 082, isbn: 978-1-72813-293-8. doi: 10.1109/CVPR.2019.01032. [Online]. Available: https://ieeexplore.ieee.org/document/8954323/ (visited on 04/23/2024). [25] B. Ercan et al., “EVREAL: Towards a Comprehensive Benchmark and Analy- sis Suite for Event-based Video Reconstruction,” en, in 2023 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Workshops (CVPRW), arXiv:2305.00434 [cs], Jun. 2023, pp. 3943–3952. doi: 10.1109/CVPRW59228. 2023.00410. [Online]. Available: http://arxiv.org/abs/2305.00434 (vis- ited on 05/02/2024). [26] E. Shechtman, Y. Caspi, and M. Irani, “Increasing Space-Time Resolution in Video,” in Computer Vision — ECCV 2002, A. Heyden et al., Eds., Berlin, Heidelberg: Springer Berlin Heidelberg, 2002, pp. 753–768, isbn: 978-3-540- 47969-7. [27] U. Mudenagudi, S. Banerjee, and P. K. Kalra, “Space-time super-resolution using graph-cut optimization,” eng, IEEE transactions on pattern analysis and machine intelligence, vol. 33, no. 5, pp. 995–1008, May 2011, issn: 1939-3539. doi: 10.1109/TPAMI.2010.167. 45 https://doi.org/10.1109/ISCAS.2014.6865228 https://doi.org/10.1109/ISCAS.2014.6865228 http://ieeexplore.ieee.org/document/6865228/ http://ieeexplore.ieee.org/document/6865228/ https://doi.org/10.1109/CVPR.2016.102 https://doi.org/10.1109/CVPR.2016.102 http://ieeexplore.ieee.org/document/7780471/ http://ieeexplore.ieee.org/document/7780471/ http://arxiv.org/abs/1607.06283 https://doi.org/10.1109/WACV45572.2020.9093366 https://ieeexplore.ieee.org/document/9093366/ http://arxiv.org/abs/2003.09078 http://arxiv.org/abs/2003.09078 https://doi.org/10.1109/CVPR.2019.01032 https://ieeexplore.ieee.org/document/8954323/ https://doi.org/10.1109/CVPRW59228.2023.00410 https://doi.org/10.1109/CVPRW59228.2023.00410 http://arxiv.org/abs/2305.00434 https://doi.org/10.1109/TPAMI.2010.167 Bibliography [28] X. Xiang et al., “Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution,” English, ISSN: 1063-6919, 2020, pp. 3367–3376. doi: 10.1109/CVPR42600.2020.00343. [29] M. Haris, G. Shakhnarovich, and N. Ukita, “Space-Time-Aware Multi-Resolution Video Enhancement,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA: IEEE, Jun. 2020, pp. 2856– 2865, isbn: 978-1-72817-168-5. doi: 10.1109/CVPR42600.2020.00293. [On- line]. Available: https://ieeexplore.ieee.org/document/9156388/ (vis- ited on 05/21/2024). [30] E. Mueggler et al., “The Event-Camera Dataset and Simulator: Event-based Data for Pose Estimation, Visual Odometry, and SLAM,” en, The Interna- tional Journal of Robotics Research, vol. 36, no. 2, pp. 142–149, Feb. 2017, arXiv:1610.08336 [cs], issn: 0278-3649, 1741-3176. doi: 10.1177/0278364917691115. [Online]. Available: http://arxiv.org/abs/1610.08336 (visited on 05/02/2024). [31] H. Rebecq, D. Gehrig, and D. Scaramuzza, “ESIM: An Open Event Camera Simulator,” en, 2018. [32] T.-Y. Lin et al., Microsoft COCO: Common Objects in Context, arXiv:1405.0312 [cs], Feb. 2015. [Online]. Available: http://arxiv.org/abs/1405.0312 (vis- ited on 04/18/2024). [33] P. Virtanen et al., “SciPy 1.0–Fundamental Algorithms for Scientific Com- puting in Python,” Nature Methods, vol. 17, no. 3, pp. 261–272, Mar. 2020, arXiv:1907.10121 [physics], issn: 1548-7091, 1548-7105. doi: 10.1038/s41592- 019-0686-2. [Online]. Available: http://arxiv.org/abs/1907.10121 (vis- ited on 04/17/2024). [34] A. Paszke et al., PyTorch: An Imperative Style, High-Performance Deep Learn- ing Library, arXiv:1912.01703 [cs, stat], Dec. 2019. [Online]. Available: http: //arxiv.org/abs/1912.01703 (visited on 04/17/2024). [35] N. Ballas et al., Delving Deeper into Convolutional Networks for Learning Video Representations, arXiv:1511.06432 [cs], Mar. 2016. [Online]. Available: http://arxiv.org/abs/1511.06432 (visited on 04/25/2024). [36] D. Ulyanov, A. Vedaldi, and V. Lempitsky, Instance Normalization: The Miss- ing Ingredient for Fast Stylization, en, arXiv:1607.08022 [cs], Nov. 2017. [On- line]. Available: http://arxiv.org/abs/1607.08022 (visited on 04/25/2024). [37] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs), arXiv:1511.07289 [cs] version: 5, Feb. 2016. [Online]. Available: http://arxiv.org/abs/1511. 07289 (visited on 04/25/2024). [38] J. Johnson, A. Alahi, and L. Fei-Fei, Perceptual Losses for Real-Time Style Transfer and Super-Resolution, en, arXiv:1603.08155 [cs], Mar. 2016. [Online]. Available: http://arxiv.org/abs/1603.08155 (visited on 04/25/2024). [39] R. Zhang et al., The Unreasonable Effectiveness of Deep Features as a Per- ceptual Metric, arXiv:1801.03924 [cs], Apr. 2018. [Online]. Available: http: //arxiv.org/abs/1801.03924 (visited on 04/25/2024). [40] L. Sluijterman, E. Cator, and T. Heskes, Optimal Training of Mean Variance Estimation Neural Networks, en, arXiv:2302.08875 [cs, stat], Aug. 2023. [On- line]. Available: http://arxiv.org/abs/2302.08875 (visited on 04/25/2024). 46 https://doi.org/10.1109/CVPR42600.2020.00343 https://doi.org/10.1109/CVPR42600.2020.00293 https://ieeexplore.ieee.org/document/9156388/ https://doi.org/10.1177/0278364917691115 http://arxiv.org/abs/1610.08336 http://arxiv.org/abs/1405.0312 https://doi.org/10.1038/s41592-019-0686-2 https://doi.org/10.1038/s41592-019-0686-2 http://arxiv.org/abs/1907.10121 http://arxiv.org/abs/1912.01703 http://arxiv.org/abs/1912.01703 http://arxiv.org/abs/1511.06432 http://arxiv.org/abs/1607.08022 http://arxiv.org/abs/1511.07289 http://arxiv.org/abs/1511.07289 http://arxiv.org/abs/1603.08155 http://arxiv.org/abs/1801.03924 http://arxiv.org/abs/1801.03924 http://arxiv.org/abs/2302.08875 Bibliography [41] D. P. Kingma and J. Ba, Adam: A Method for Stochastic Optimization, en, arXiv:1412.6980 [cs], Jan. 2017. [Online]. Available: http://arxiv.org/abs/ 1412.6980 (visited on 05/01/2024). [42] L. Sluijterman, E. Cator, and T. Heskes, How to Evaluate Uncertainty Es- timates in Machine Learning for Regression? en, arXiv:2106.03395 [cs, stat], Aug. 2023. [Online]. Available: http://arxiv.org/abs/2106.03395 (visited on 05/02/2024). [43] Hyndman, {Robin John} and George Athanasopoulos, “Prediction intervals,” English, in Forecasting: Principles and Practice, Australia: OTexts, 2018. [On- line]. Available: https://otexts.com/fpp2/prediction-intervals.html (visited on 05/03/2024). 47 http://arxiv.org/abs/1412.6980 http://arxiv.org/abs/1412.6980 http://arxiv.org/abs/2106.03395 https://otexts.com/fpp2/prediction-intervals.html Bibliography 48 DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden www.chalmers.se www.chalmers.se List of Acronyms Nomenclature List of Figures List of Tables Introduction Limitations Background Event cameras Deep Learning Convolutional Neural Networks Recurrent Neural Networks Implicit neural representations Mean-Variance Estimation Networks Related literature Image reconstruction from event data Spatiotemporal Video Superresolution Methods Training data Ground truth generation Event generation Event Representation Continuous Video Representation Network Architecture and Training Loss Training procedure Evaluation Results Simulated data Prediction without superresolution Intensity predictions Uncertainty predictions Spatiotemporal superresolution Intensity prediction in spatial superresolution Intensity prediction in temporal superresolution Uncertainty quantification in spatiotemporal superresolution Real data Performance Sim-to-real gap Conclusion Bibliography