Realistic Radio Propagation Modeling for a Digital Twin Improvements with Enrichment of 3D Scenarios Master’s thesis in Physics OSKAR MORE ARVIDSSON DEPARTMENT OF ELECTRICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 Realistic Radio Propagation Modeling for a Digital Twin Improvements with Enrichment of 3D Scenarios OSKAR MORE ARVIDSSON Department of Electrical Engineering Chalmers University of Technology Gothenburg, Sweden 2023 Realistic Radio Propagation Modeling for a Digital Twin Improvements with Enrichment of 3D Scenarios OSKAR MORE ARVIDSSON © OSKAR MORE ARVIDSSON, 2023. Supervisors: Martin Johansson, Ericsson Research Gerhard Steinböck, Ericsson Research Examiner: Thomas Rylander, Chalmers University of Technology Master’s Thesis 2023 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Visualization of a radio wave propagating in an urban environment. Typeset in LATEX Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Realistic Radio Propagation Modeling for a Digital Twin Improvements with Enrichment of 3D Scenarios OSKAR MORE ARVIDSSON Department of Electrical Engineering Chalmers University of Technology Abstract Radio waves at higher frequencies in the current and future generations of radio net- works are more sensitive to details in the environment as they propagate. Previous measurements have shown that street furniture such as poles and trees can have a non-negligible effect on the characteristics of the radio propagation. Street poles can contribute significantly for around street corner situations to path gain, in partic- ular at higher frequencies. Furthermore, the scattering contributes to the richness of the channel. For site specific modeling, ray tracing simulations are needed. By including street poles in ray tracing simulations in this work, more realistic simula- tion results were obtained. This was seen through changes in Doppler frequency and path gain. This thesis focuses on positioning the poles in the site specific modeling, enabling inclusion of them in the ray tracing simulations. To begin with, street view panorama images including the poles in the interesting area were extracted. With deep learning algorithms, object detection was performed through panoptic segmen- tation and range detection through monocular depth estimation on the extracted images. Given the direction and distance output from images extracted at multi- ple camera positions, street poles along a road are positioned through triangulation and clustering with a positioning error of 3.5 meters. This is comparable to related approaches in the field. The errors are mostly due to limiting GPS accuracy for camera positioning and limitations of detecting distant poles. Keywords: radio propagation, ray tracing simulation, scattering models, poles, street view images, object detection, monocular depth estimation, geolocation v Acknowledgements First, I wish to thank my supervisors Martin Johansson and Gerhard Steinböck at Ericsson for their engagement and excellent guidance. In addition, I wish to thank my examiner Thomas Rylander at Chalmers for his support and for together with my manager Henrik Sahlin at Ericsson giving me the opportunity to perform this thesis. Further, thanks to Remco Heijs at Ericsson for his support with the simulations, thanks to Lars Hammarstrand at Chalmers for his support with the positioning, and thanks to Georgios Spaias, Vasilis Naserentin and Anders Logg at DTCC for sharing their related work on environment recreation. Finally, thanks to my family, friends, and colleagues for their moral support and advice along the way. Oskar More Arvidsson, Gothenburg, June 2023 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: ANN Artificial Neural Network API Application Programming Interface BS Base Station CNN Convolutional Neural Network CRS Coordinate Reference System CTF Channel Transfer Function EHF Extremely High Frequency EPSG European Petroleum Survey Group FOV Field Of View GSV Google Street View GPS Global Positioning System GPU Graphics Processing Unit IoU Intersection over Union LOS Line Of Sight MAE Mean Absolute Error MDE Monocular Depth Estimation MIMO Multiple Input Multiple Output mIoU mean Intersection over Union NLP Natural Language Processing PEC Perfect Electric Conductor RCS Radar Cross Section ReLU Rectified Linear Unit RGB Red, Green, Blue RMSE Root Mean Square Error PQ Panoptic Quality SHF Super High Frequency UE User Equipment USD Universal Scene Description UTD Uniform Theory of Diffraction ix Nomenclature Below is the nomenclature of indices, sets, parameters, and variables that have been used throughout this thesis. Sets TP True positives TN True negatives FP False positives FN False negatives D Predicted depth values Variables w Network weights b Network bias pred Predicted depth value gt Ground truth depth value Np Number of predictions Nm Number of matched predictions P Power G Gain PG Path gain d Distance f Frequency t Time λ Wavelength τ Delay xi ν Doppler frequency shift g Field pattern Ω Path direction α Gain coefficient D(ν) Doppler power spectrum h(τ, t) Impulse response H(f, t) Frequency response R Reflection coefficient σ Radar cross section N Number of paths xii Contents List of Acronyms ix Nomenclature xi List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Ethics and sustainability . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Report structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Radio propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Radio network . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Channel modeling . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.3 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Image analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.2 Convolutional neural networks . . . . . . . . . . . . . . . . . . 12 2.2.3 Panoptic segmentation . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4 Depth estimation . . . . . . . . . . . . . . . . . . . . . . . . . 16 3 Methods 19 3.1 Creating the 3D environment . . . . . . . . . . . . . . . . . . . . . . 19 3.2 Enriching the 3D environment . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Image extraction . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.2.2 Object detection . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.2.1 Mask processing . . . . . . . . . . . . . . . . . . . . 22 3.2.2.2 Direction estimate . . . . . . . . . . . . . . . . . . . 22 3.2.3 Depth estimation . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.3.1 Distance estimate . . . . . . . . . . . . . . . . . . . . 23 3.2.3.2 Calibration . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.4 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 xiii Contents 3.2.4.1 Separate predictions . . . . . . . . . . . . . . . . . . 24 3.2.4.2 Triangulation . . . . . . . . . . . . . . . . . . . . . . 25 3.2.4.3 Alternative approaches . . . . . . . . . . . . . . . . . 26 3.2.5 Error metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.6 Sources of error . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.7 Synthetic environment . . . . . . . . . . . . . . . . . . . . . . 28 4 Results 29 4.1 Image extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Pole detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.3 Positioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.4 Synthetic environment . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5 Simulation 39 5.1 Ray tracing tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2 Simulation scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.3 Signal processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4 Impact on channel characteristics . . . . . . . . . . . . . . . . . . . . 42 5.4.1 Doppler shift . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.4.2 Path gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6 Conclusion 47 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 6.2 Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.3 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Bibliography 51 xiv List of Figures 2.1 Visualization of the propagation of a radio wave modeled as a ray from the BS mounted on a building to the UE on the street. . . . . . 6 2.2 Spherical coordinate system with the angles φ and θ describing the path direction marked. . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 A ray propagating from a transmitter in black to a receiver in white, it interacts with the obstacle in grey through diffraction in Figure 2.3a and diffuse scattering in Figure 2.3b. . . . . . . . . . . . . . . . . . . 10 2.4 Outline of a basic ANN with neurons in input, hidden and output layers between which the input signals x are processed with weights w to output signals y. . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5 Outline of a CNN with an input image, a convolutional layer, a pool- ing layer, two fully connected layers and a flattened output with re- lated classification labels. The stacked rectangles indicate different kernels and the small rectangle the flow for a set of pixels. . . . . . . 12 2.6 Visualization of the difference between object detection in the form of object localization, instance segmentation, semantic segmentation and panoptic segmentation. . . . . . . . . . . . . . . . . . . . . . . . 14 2.7 Visualization of a distance scale in Figure 2.7b from monocular depth estimation applied on the image in Figure 2.7a. Brighter color indi- cates longer distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1 An overview of the process for enriching the 3D environment. It includes image extraction, object detection and positioning. . . . . . 20 3.2 The positioning process where separate predictions in green are clus- tered. They are positioned from estimates of direction and distance to poles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 The positioning process with triangulation where accepted intersec- tions in blue are clustered. Those have two separate predictions in green close to the associated intersection. . . . . . . . . . . . . . . . . 25 4.1 Extracted GSV camera positions shown in red overlaid on a satel- lite image of the area, the green marker showing the position of the panorama in Figure 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 An example panorama image extracted in Kista at the position marked with a green marker in Figure 4.1. . . . . . . . . . . . . . . . . . . . . 30 xv List of Figures 4.3 Panoptic segmentation output from the panorama image in Figure 4.2 shown as segmentation masks in Figure 4.3a and blue pole direction estimates in Figure 4.3b where red is North. . . . . . . . . . . . . . . 31 4.4 Results from the monocular depth estimation applied on the panorama image in Figure 4.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 Examples of object detection difficulties. These include poles con- fused with the facade in Figure 4.5a and Figure 4.5b, occluded by cars or trees in Figure 4.5c and affected by distortions from the im- age extraction in Figure 4.5e and Figure 4.5f. Also, Figure 4.5d shows how a tree trunk is detected as a pole. . . . . . . . . . . . . . . . . . 33 4.6 Predicted pole positions through triangulation from GSV panoramas shown as magenta crosses compared to the true positions in black along the street in Kista. . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.7 Absolute error distribution for the pole predictions from triangulation compared to the true positions along the street in Kista. . . . . . . . 35 4.8 Corresponding panorama images from GSV in Figure 4.8a and the synthetic Omniverse environment in Figure 4.8b with poles at the true positions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.9 Predicted pole positions from Omniverse panoramas of an environ- ment without trees as shown as magenta crosses compared to the true positions in black along the street in Kista. . . . . . . . . . . . . . . 37 5.1 The traces of receiver indices in Kista from the selected measure- ments. Section S1 with trace index between 1200 and 1300 is along Torshamnsgatan and the orthogonal section S2 from 1360 to 1420 is along Kistagången. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 5.2 A couple of rays with multiple interactions traced between transmitter in red on top of a building and receiver in blue placed at the streets of Kista, Stockholm. . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.3 Normalized power spectrum showing the Doppler frequency for re- ceiver index 1200 to 1300. From left to right, the plots show results for measurements, simulations without poles, simulations with poles, and simulations with only poles. . . . . . . . . . . . . . . . . . . . . 44 5.4 Normalized power spectrum showing the Doppler frequency for re- ceiver index 1200 to 1300. In Figure 5.4a the true pole positions in the section have been used, in Figure 5.4b the detected ones. . . . . . 45 5.5 Path gain for the section of receiver indices in the orthogonal street, showing an increase due to scattering from the poles around the cor- ner. True pole positions in Figure 5.5a and detected positions in Figure 5.5b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 xvi List of Tables 4.1 The accepted pole detections from the generated direction output shown in Figure 4.3b and depth output shown in Figure 4.4. Direction given in angles from North, pole position in image along the x-axis in pixels and distance in meters from the camera. . . . . . . . . . . . 32 4.2 Quantified performance of the positioning with single predictions and triangulation applied on GSV panoramas. . . . . . . . . . . . . . . . 35 4.3 Quantified performance of the positioning algorithm applied on Om- niverse panoramas showing environments with varying levels of de- tails. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5.1 The most important parameters for the ray tracing simulation. These include the center frequency and bandwidth, sampling and interaction specifications as well as pole dimensions. . . . . . . . . . . . . . . . . 41 xvii List of Tables xviii 1 Introduction To develop future radio networks, realistic modeling of the networks is important to reach optimal design and performance. The digital modeling is performed by tracing radio waves between devices in the network, investigating how the rays interact with the environment. One possible way to make the simulations more realistic could be to enrich the simulation environments by including more details. Going beyond using outlines of buildings and streets, simulations in an urban environment could include street furniture like trees, lamp posts and cars from which the radio waves can bounce off. 1.1 Problem formulation The purpose of this thesis is to investigate how enrichment of the 3D environment could make the ray tracing simulation of a radio network more realistic. This is studied by investigating how the electromagnetic characteristics of the radio network in the simulation may change with the inclusion of street furniture compared to the previous model. In order to obtain a sufficiently detailed environment, this thesis aims to detect, classify and position the street furniture from street view images. Using implemented electromagnetic models for the objects, the simulation results for the radio channel characteristics will be analyzed for indications of a more realistic digital twin. 1.2 Background For new generations of radio networks, such as 5G and 6G, the requirements and design have become more complex [1]. An increasing population with an increasing number of connected devices demands wireless connection in an increasing area. With the development of ray tracing simulations and computational power, the pos- sibility of simulating networks and designing them to achieve the best performance has improved [2]. By simulation and recreation of the physical product, one can analyze the performance more efficiently. This is the concept of a digital twin, and it has become a successful concept in topics ranging from product cycles to recre- ating city centers [3]. For the digital twin of a radio network, the electromagnetic characteristics should be as similar to the real network as possible. This of course 1 1. Introduction has to be with consideration to how many details actually are relevant seen to the overall performance. One way in which this digital twin of the radio network could be improved to be more realistic is through enrichment of 3D scenarios. With enrichment of the sce- narios, it is here meant to include street furniture [4] in addition to buildings and street outlines already included. Street furniture can be objects such as trees, lamp posts, street signs and cars to name a few. Relevant work done by Ericsson [5] and others [6, 7, 8] indicates that including street furniture in the ray tracing simula- tion environment can actually have a non-negligible effect on the radio propagation characteristics. More specifically, deviations between measurement and simulation are observed in path gain and in frequency shift through the spread in Doppler due to the moving receiver in their scenario. This may be due to the street furniture having an effect on the real radio channel characteristics of the measurements, but its effect is lacking in the ray tracing simulation. To create a 3D environment as realistic as possible, it is important to recreate real scenarios including street furniture. Recreating specific scenarios is one approach, and using statistical rules applied to approximately include the correct amount of positioned street furniture is another. To make the specific recreation faster and easier, or at least improve the statistical rules to be more specific to a city or area, an automated workflow to detect street furniture can be applied. To find and position street furniture, street view images from urban environments can be used. Object detection and positioning algorithms can be performed [9, 10] with these images. A neural network can be trained on large sets of annotated image data including different object classes to handle the detection task. One example is the network in [11] that has been trained on the extensive Mapillary Vistas dataset including street furniture such as vegetation, lamp posts and cars as object classes [12]. The extracted positioning information about the details in the urban environment can then be merged into the scene description for an analysis of the impact on the ray tracing simulations. 1.3 Limitations As stated in section 1.1, the aim of the thesis is to detect and position street furniture from images to enable enrichment of the 3D environment for ray tracing simulations. Especially, the work focuses on one kind of street furniture, namely poles of street signs and street lights. This is because poles are commonly encountered in urban scenarios, and the conductive material has a relatively clear influence on the radio waves compared to other materials such as wood. The clear vertical shape of the poles also gives the opportunity for more accurate detection and positioning, as well as more accurate electromagnetic modeling, as compared to trees. Compared to other street furniture such as cars, the poles are also static and not commonly changed. This is more suitable for recreating specific scenarios. Furthermore, it is especially the inclusion of poles that indicates non-negligible effects in [7, 8] and in measurements performed by Ericsson. The measurements are performed in Kista, 2 1. Introduction Stockholm, which for that reason will be the focused area in this thesis. Despite the focus on poles, the aim is still to keep the positioning process as generic as possible in order to be useful for other street furniture as well and also suitable for other locations. The contribution of this thesis will be to combine and integrate the methods of automatic image analysis and radio propagation simulations. This can hopefully contribute to how digital twin generation could be more realistic by enrichment of 3D scenarios in a more general and automated way. 1.4 Ethics and sustainability Regarding the street view images, consideration for privacy and integrity has been taken since the images used have blurred areas covering for instance faces and reg- istration plates. The part of the thesis work including the image analysis and deep learning is based on existing imagery and open-source implementations. The choice of pre-trained models is a sustainable choice since training deep neural networks is energy consuming. Similarly, it is sustainable to use already existing measurement data for the radio propagation study. 1.5 Report structure Following this thesis introduction, the theory chapter will introduce the topics of radio propagation and image analysis with machine learning on a basic level needed to follow the main part of the thesis. The upcoming method chapter describes the workflow including the creation of the enriched environment through image extraction, object detection and object positioning. With a similar order the result chapter presents and visualizes the step by step achievements. This is followed by a description of the ray tracing simulation and the results obtained with this enriched environment. Finally, the conclusion chapter sums up the main discussion points and contributions of this thesis. 3 1. Introduction 4 2 Theory This chapter presents the underlying theory behind the methods implemented in this thesis work. First, there is an introduction to the basic theory of radio propagation to set the scene for the digital twin of the radio network. Secondly, the general theory of neural networks for image analysis including object detection is covered which is the base for the creation of the enriched 3D environment. 2.1 Radio propagation The concept of radio propagation modeling is to analyze how radio waves propagate or travel between points in space. Similar to light waves, radio waves are elec- tromagnetic waves that can be characterized by frequency, amplitude, phase and polarization. These waves are attenuated or weakened with the traveling distance and can also undergo reflection, scattering and diffraction as they propagate. An- alyzing the propagation of radio waves through measurements and simulations is important for the design of radio networks [13]. 2.1.1 Radio network In a radio network, the electromagnetic signals of the wireless communication are sent from central base stations (BS) strategically placed to reach a majority of the user equipment (UE). The BS usually has an antenna system mounted on a mast or building that communicates with a large number of UEs, such as mobile phones or other connected apparatus. The sending device is referred to as the transmitter and the target device as the receiver. The development of wireless telecommunication opened the door for vast opportunities and that a connection to a 4G network today can be reached in most areas. The increasing amount of connected devices and the increase in data traffic require better communications. In recent years, the development and implementation of the next generation, 5G, has spread in larger cities, and research on 6G networks proceeds. The 5G network can handle an increased number of devices in crowded areas better than previous generations. One contribution to this is that new parts of the fre- quency spectrum are used for communication. The frequency of the signals in the 5G network can cover parts of the spectrum from just below 1 GHz, similar to 4G 5 2. Theory networks, and up to around 50 GHz. The range of around 400-900 MHz is referred to as the low-band, 1.7-4.7 GHz as the mid-band and 24-47 GHz as the high-band. The mid-band is the most widely used, with around 3 GHz being the frequency im- plemented in most big city 5G networks. The frequency range 3-30 GHz is referred to as the super high frequency (SHF) or centimeter wave band associated with the wavelength, whereas 30-300 GHz is referred to as extremely high frequency (EHF) or millimeter wave (mmWave) [14]. Radio waves with a frequency higher than a few MHz follow the so-called line of sight (LOS) propagation [15]. This means the rays travel in direct paths. They may be diffracted, reflected or scattered on the way between the transmitter and receiver, but they will not bend and follow the contour of the Earth around the horizon like lower frequency waves. Between the transmitter, which in an example could be a base station, and a receiver, which could be a mobile phone, there are so-called Fresnel zones. The Fresnel zones include possible LOS propagation paths. The first Fresnel zone includes the strongest signal. The radio wave propagation through space is in ray tracing modeled as a ray following a straight line from BS to UE as in Figure 2.1. The radio wave has additional characteristics such as amplitude and polarization that are taken into account. Figure 2.1: Visualization of the propagation of a radio wave modeled as a ray from the BS mounted on a building to the UE on the street. For 5G radio networks, the base station antennas are phased array antennas. In combination with an antenna array for the user they form a multi-input multi- output (MIMO) antenna system [16]. The definition of a phased array antenna is that it is a combination of smaller regular antenna components, whose output phase can be shifted in order to form a beam of the signal to the desired direction and power through interference. This design enables a better directivity of the antenna, which means that more of the antenna output power is going to the desired direction towards the receiver. Thus, there is not only the BS position but also several parameters such as the directivity, the gain and the transmitted power that can be varied to optimize the network. The power of the propagating radio wave between a transmitter and a receiver can be described with the radio equation, also known as Friis equation [15]. This equation is valid for free space and line of sight propagation for one radio wave, which is 6 2. Theory assumed in an initial stage. The equation expresses the received power Pr as Pr = PtGtGr ( λ0 4πd )2 (2.1) where Pt is the power fed to the transmitting antenna. Gr and Gt are the antenna gains of the receiving and transmitting antenna systems, respectively. The antenna gain describes the directivity and radiation efficiency of an antenna. An antenna gain of one would mean a theoretical isotropic antenna with identical characteristics in all directions. However, real antennas are not isotropic but instead more sensitive or output more power in a specific direction, and hence have a different gain. The distance between the antennas is denoted by d and the wavelength of the radio wave by λ. The wavelength is calculated as λ0 = c0/f where c0 is the speed of light in vacuum and f the frequency. Together, they compose the term ( λ 4πd )2 which describes the so-called free-space path loss [15]. Dividing the received power by the transmitted power gives the path gain PG = Pr/Pt . (2.2) The path gain is often used as a measure of the performance of a radio network, since it relates the available power for a receiver to the transmitted power. Similarly, the path loss expressed as 1/PG can be used. Calculating the PG for several receiver positions would yield a coverage map relating to how good connection a UE would have at different positions in an environment. 2.1.2 Channel modeling Characterizing a full radio channel is more complex than for one radio wave in LOS propagation. Channel modeling is needed for networks with antenna arrays. A radio channel including N radio wave paths can be characterized by the impulse response [13] similar to h(τ, t) = N∑ n=1 gT x(ΩT x,n)gRx(ΩRx,n)αne−j2πfcτnδ(t − τn)ej2πνRx,nt . (2.3) Here, gT x and gRx are the field patterns describing the directivity of the trans- mitting antenna Tx and receiving antenna Rx, respectively. These field patterns are functions of the respective path directions ΩT x,n = [φT x,n, θT x,n] and ΩRx,n = [φRx,n, θRx,n] of each path n. Given a spherical and horizontal coordinate system, φ is the azimuth angle and θ the elevation angle from the vertical axis as shown in Figure 2.2. 7 2. Theory θ φ y x z Figure 2.2: Spherical coordinate system with the angles φ and θ describing the path direction marked. The individual gain for each path is denoted αn and is dependent on the type of interactions such as line of sight, reflection, diffraction and scattering it passes. The expression for αn is further described in subsection 2.1.3 below. Each path also exhibits a certain phase rotation, e−j2πfcτn , dependent on the paths delay τn and carrier frequency fc. The delay is due to the distance traveled by the radio wave, and causes the shift represented in Equation 2.3 by the Dirac delta function as δ(t − τn). Finally, there is another phase shift due to the mobility of the receiver. This is due to the Doppler effect [17] that results in a Doppler frequency shift νRx,n = rT Rx,n(ΩRx,n)v λ0 . (2.4) In this expression, the velocity vector of the receiver is dependent on the angles of movement as v = v[sin(θv) cos(ϕv), sin(θv) sin(ϕv), cos(θv)] where v is the absolute velocity in the direction of the movement. The path’s directional vector is described by rRx,n = [sin(θRx,n) cos(ϕRx,n), sin(θRx,n) sin(ϕRx,n), cos(θRx,n)]. In Equation 2.4, rT Rx,n is the transposed vector. These vectors are given in Cartesian coordinates. The Doppler effect for electromagnetic radio waves works similarly as for sound waves. With a moving receiver relative the transmitter as one example, the frequency will decrease for a receding receiver and increase for an approaching receiver. The effect is further described in [17]. 2.1.3 Interactions Radio waves of the new generation networks in the SHF and EHF range are extra sensitive for interactions with objects due to the short wavelength. With advance- ment of computational performance through graphics processing units (GPUs), the possibility to model these detailed interactions has improved. This is why ray tracing simulations are more widely used as one essential part of analyzing radio networks. With the higher sensitivity, it becomes more complex to model and optimize the characteristics through the placement of the antennas and the choice of frequency [15]. 8 2. Theory For a ray tracing simulation in an urban environment with many obstacles, the interactions of the rays with buildings, streets and street furniture have to be mod- eled. Without interactions and only line of sight propagation, as described above in Equation 2.1, the gain for each path is described as αLOS = λ0 4πd , (2.5) in other words the free-space path loss. For a plane wave incident on a planar material interface, Snell’s laws describe how the angle of incidence is related to the angle of reflection and transmission, where the angle of transmission is dependent on the materials. Furthermore, the Fresnel coefficients describe magnitude and phase of the reflected and transmitted wave in relation to the incident wave, where these coefficients are dependent on both the material and the polarization [18]. On a smooth surface of a perfect electric conducting (PEC) material, the radio wave will undergo specular reflection and the reflection coefficient is R = −1. The gain of the path is described with the coefficient αR = R λ0 4π(d1 + d2) , (2.6) where d1 is the distance from the transmitting antenna to the object and d2 is the distance from the object to the receiver. However, with real world objects such as buildings this is seldom the case since the building facades are rough and the material is not always PEC. For non-PEC materials |R| < 1, which can be obtained for various material types, thicknesses and incoming angles according to [18]. For rough materials, one common model is the Lambertian diffuse scattering model [19]. The model approximates the surface roughness and the backscattering due to small-scale geometric variations over the surfaces. The diffuse scattering spreads out the rays in a hemisphere which is visu- alized in Figure 2.3b. Also the diffraction, which is the interaction of rays on and around edges and corners of buildings, is visualized in Figure 2.3. A common model of diffraction is the uniform theory of diffraction (UTD) [20]. In the work related to this thesis, the interactions with street furniture are treated in a special way. To model these interactions, two paths of line of sight propagation before and after the obstacle are considered. The influence of the obstacle is added through a calculated radar cross section (RCS) σ. The calculation of the RCS follows the radar equation [15]. The individual gain of the paths is obtained as αScat = λ0 √ σ (4π)3/2d1d2 (2.7) where d1 and d2 once again are the respective distances from the transmitter to the obstacle and from the obstacle to the receiver. The radar cross section will depend on the incoming and outgoing angles of the paths [8]. The street furniture in focus for this thesis is poles of street lights and signs. This is in part because in recent work such as [7, 8], it was shown that poles can con- tribute significantly to path gain (PG). For scattering on a pole, the radar cross 9 2. Theory (a) Diffraction. (b) Diffuse scattering. Figure 2.3: A ray propagating from a transmitter in black to a receiver in white, it interacts with the obstacle in grey through diffraction in Figure 2.3a and diffuse scattering in Figure 2.3b. section can be modeled analytically which has been used as a base for the current implementation. In [21] the radar cross section is provided for the far field. Far field means that at a great distance from the pole, the electromagnetic field is decreasing inversely proportional to the increasing distance. This is given a signal that is not incoming from the top of the pole. Both polarizations are handled separately. To fulfill this assumption also on shorter distances for typical frequencies, the pole has been divided into smaller sections. As the specific electromagnetic modeling is not part of this thesis, the reader is referred to the provided references [7, 8] for more details. Other types of street furniture can also be included in radio propagation model- ing. Trees and vegetation are interesting scattering objects. In a simplified model, they can be seen as a medium other than free space through which there is line of sight propagation. Then another term on top of the free space loss is multiplied to Equation 2.1 to account for the stronger attenuation through the vegetation. More complex scattering models for trees could also be analyzed and possibly implemented similar to as described in [22], but the complexity of the tree shape and medium makes it even more difficult and computationally costly than for poles. 2.2 Image analysis In this section, the theory of analyzing image material to find the street furniture interesting for the ray tracing simulations is covered. Extracting information from images may be seen as a fairly simple problem, since a person from looking at a picture quite easily can get a perception of what it depicts and where in the scene the features are. With the rise of computing power and the field of deep learning, the analysis can now be automated. In this way, detecting objects and recreating scenes from imagery can be possible with less manual work, but the problem of getting there is not as simple. The theory of general neural networks is covered, followed by the more complex networks for object detection implemented in this thesis. Finally, an introduction to neural networks for distance estimation in images 10 2. Theory will be presented. 2.2.1 Neural networks Inspired by the function of human brains, artificial neural networks (ANN) or simply neural networks are built to recreate how the brain can learn from multiple inputs throughout life to recognize patterns, situations and objects. ANNs are usually built with several layers of neurons between which signals, represented by numbers, can be processed. The signals are multiplied with individual weights depending on between which neurons they pass. Figure 2.4 shows the most simple ANN architecture, with neurons divided into input and output layers with a hidden layer between them. From each neuron, the signal is processed with a weight contributing to the value of the neurons in the next layer. In supervised learning, the network is trained with a large amount of input data for which the classification is known. In the training process, the ANN adjusts the weights between neurons so that it can classify new test data based on what has been learned previously [23]. Input layer Hidden layer Output layer Figure 2.4: Outline of a basic ANN with neurons in input, hidden and output layers between which the input signals x are processed with weights w to output signals y. For the most simple neural network architecture shown in Figure 2.4, the forward propagation of the signals through the network is shown in Equation 2.8. There, the output signal yl m for the output layer numbered l and neuron number m is calculated. The expression includes a sum of the values from all neurons xl−1 n in the layer before, where each value has been multiplied with an individual weight wl n,m between neuron n in layer l − 1 and neuron m in layer l. Additionally, a bias term bl m is added. yl m = ∑ n xl−1 n wl n,m + bl m (2.8) To set the scene, one commonly used example of the application of basic ANNs is the recognition and classification of handwritten digits in the MNIST dataset [24]. Then the input values to the neural networks input layer are the values of the different pixels, projected on a one dimensional vector, and the output after processing through the network is an integer. 11 2. Theory 2.2.2 Convolutional neural networks Convolutional neural networks (CNN) are a type of ANNs that were developed for image analysis. In comparison to the most basic artificial neural networks, CNNs can better handle multidimensional input such as the different pixel values of an image. For the MNIST dataset, CNNs significantly improve the classification ac- curacy compared to basic ANNs [25]. In scaling the application to more complex images and detection tasks a basic CNN will be necessary. The special design of the CNNs makes it possible to reduce the overall number of weights or parameters in the network, hence reducing both computational time and the risk of overfitting. Overfitting means that there are so many parameters to adjust that the overall result will fail to solve the general task of classifying test data. The general architecture of a CNN is sketched in Figure 2.5. The input image is split into different depth layers, such as the three color layers red, blue and green (RGB) for images with color. For a grayscale image, one depth layer would be enough. The image is processed in convolutional layers, pooling layers and fully connected layers, respectively, before the signal is flattened and outputted as a label of what the image is picturing [23]. The smaller square visualizes the flow of a couple of pixels through the network. car bike … tree … … Input image Convolutional layer Pooling layer Fully connected layer Flattening Output Figure 2.5: Outline of a CNN with an input image, a convolutional layer, a pooling layer, two fully connected layers and a flattened output with related classification labels. The stacked rectangles indicate different kernels and the small rectangle the flow for a set of pixels. In the convolutional layers different convolutional kernels, as sketched in the depth dimension, are applied to parts of the input matrices. These are typically of di- mension 3x3 or 5x5 including weights and are applied with regular scalar product, preserving the height and width dimension. The scalar multiplication is followed by an elementwise activation function in the form of the rectified linear unit (ReLu), returning max(0, x) where x is the element. This activation function helps reduce overfitting by setting negative values to zero. There are different kernels learned to classify low level details such as edges or general shapes in the image. The pooling layer then reduces the dimensionality with downsampling, where neigh- boring pixel values are associated together resulting in one common value. The best performance achieved is with maximum pooling, extracting the maximum value of the region, since it suppresses noise from earlier values set to zero by the ReLu. An- 12 2. Theory other option could be average pooling taking the average of included values, which as described would be more sensitive to noise. Between the convolutional layer and the pooling layer in Figure 2.5, or after the pooling layer, another convolutional layer can be applied to repeat the process. In the new layer, the kernels can be trained to recognize more high level and feature specific characteristics such as the wheel of a bike or the wing of a bird. The inclusion of additional convolutional layers and even pooling layers can be repeated multiple times. After the last pooling layer, what would be classified as enough details of the image have been detected. In addition, the dimensionality of the neurons has decreased to a manageable size such that two fully connected layers, with functionality similar to the most basic ANNs, can be applied. Between these another ReLu activation function can be applied to further increase the performance and reduce the noise of a blurry picture for instance. Finally, by flattening the signals, one-dimensional output referring to the classification label of the image can be delivered. For instance, the image in Figure 2.5 can be labeled to include a car or a bike. For a neural network to be ready to classify images as explained, it first has to be trained on a vast amount of data. The increasing size and variety of datasets have been one major reason for the development of the image analysis field. The datasets include annotated images, which means that the images are associated with a label of what it depicts. The variety of annotation labels, or object classes, determines what objects the network can predict in the test images. Two of the most widely used datasets are Microsofts Common Objects in Context (COCO) [26] and ImageNet [27], both with hundreds of object classes. There is a variety of options to construct CNNs with different numbers of layers and different kernels. One of the best performing and most widely used architectures is ResNet, short for residual net, developed by Microsoft Research and setting records on both the image classification datasets COCO and ImageNet by the time of de- velopment [28]. The inclusion of residual learning, where the signal paths include skips over layers, reduces the impact of the vanishing gradient problem [23]. This problem means that occasionally small changes in the network parameters can make the optimization less accurate. The ResNet50 for instance is comprised of 50 layers, including five convolutional layers with combinations of 7x7, 3x3 and 1x1 convolu- tional kernels, and is commonly used as a backbone, where other algorithms have used this one as a first step and baseline to build upon. 2.2.3 Panoptic segmentation The convolutional neural networks covered in subsection 2.2.2 can manage the image classification task. This means that given an image, the output will be the network’s idea of what it represents. For some applications this is enough, but there are further steps in the development of neural networks for image analysis. A first step would be to not just classify, but also localize the object detected by creating a bounding box around it. This could be approximated by the ResNet from the resulting labels [28], but to include it in the training process the regional convolutional network 13 2. Theory (R-CNN) [29] was developed. The R-CNN iteratively increases the analyzed region of the image to better define what part of the image that is relevant to analyze for the object in question, called the region of interest (RoI). (a) Object localization. (b) Instance segmentation. (c) Semantic segmentation. (d) Panoptic segmentation. Figure 2.6: Visualization of the difference between object detection in the form of object localization, instance segmentation, semantic segmentation and panoptic segmentation. Similar to investigating the regions of interest, the Mask R-CNN predicts a segmen- tation mask for the object in addition to the bounding box [30]. The segmentation mask is a coupled group of pixels that are all part of the object. The segmentation masks can be formed in different ways, where separating specified individual objects with separate masks is called instance segmentation. If individual objects of the same class are grouped together in the same mask, which is common with object classes that can be difficult to differentiate, it is called semantic segmentation. The difference between these segmentation methods is visualized in Figure 2.6. This version of semantic segmentation also includes classification of background pixels in the image. Combining classification of all pixels in the image with the separation of as many object masks as possible, similar to instance segmentation, is referred to as panoptic segmentation. In panoptic segmentation, the object classes are often divided into ”things” and ”stuff”. Things like a person or car are objects that distinctly can be separated into individuals and that are countable. Stuff such as buildings, vegetation and roads are classified as regions and assumed uncountable. One of the most accu- rate and qualitative panoptic segmentation algorithms is developed by FaceBook AI 14 2. Theory Research and named Mask2Former [11]. As with the previous methods, it builds upon a backbone pre-trained on the ImageNet dataset. It can be chosen to be a standard ResNet backbone, but a so-called Swin transformer backbone can enhance the performance even more [31]. Swin stands for shifted windows, which means it can adjust the size and placement of regional windows where the detection is performed. The transformer concept is inspired from natural language processing (NLP) where the language is broken down and analyzed from learned features. In a similar manner, the transformer for image analysis performs the detection with help from learned queries and features for the detection of object classes. In this way, the search is based on the known connection between queries and features rather than upsampling from pixel level to something similar to the object class. The Mask2Former algorithm then combines feature pyramid extraction in a pixel decoder and a transformer decoder for the most accurate pixel classification through- out the picture. The pixel and transformer decoder are both specialized in detecting high resolution and feature specific details. They can in this architecture work to- gether by complementing each other for a better result. The pixel decoder gradually covers more detailed areas and objects with a multi-scale approach on different res- olutions. The transformer decoder is applying masked attention, which can take the multi-scale detection and complement with detection from learned features such that the extraction of the high resolution features is improved. For panoptic segmentation, the most common evaluation metric is the panoptic quality (PQ) [32]. It can be seen as a metric on how well the predicted segmentation masks match the ground truth masks of the image. It is calculated as PQ = ∑ (p,q)∈T P IoU(p, q) TP + 1 2 |FP | + 1 2 |FN | (2.9) where TP is the collection of prediction masks p and ground truth masks q that match. FP is the collection of predicted masks not matched and FN is the collection of the remaining unmatched ground truth masks. Intersection over union (IoU) is a metric to show the overlap between matching prediction and ground truth masks formulated as IoU(p, q) = p ∩ q p ∪ q . (2.10) As an additional measure of the quality of the segmentation the mean IoU (mIoU) can be used as well, which is as the name describes a mean taken of the IoU values for all masks in the set TP , the predicted masks with matching ground truth masks. A widely used dataset with segmentation masks is Mapillary Vistas dataset[12]. It includes street view images and is widely used for training algorithms for automatic driving. This dataset includes 25 000 high resolution images from cities covering 6 continents, with a variety of weather and camera settings. Furthermore, the images have been manually annotated with 124 semantic object categories of which 100 are instance specific. The use of the dataset from Mapillary for training and thus inclusion of street poles as an object class is the main reason for the choice of the Mask2Former algorithm. 15 2. Theory In addition, it has the competitive panoptic quality of 45.5 and a mIoU of 60.8. Furthermore, it builds on the previously developed Facebook AI library detectron2 [33] implemented with PyTorch which makes it relatively handy to implement. 2.2.4 Depth estimation Associated with the field of object detection is range detection. For an autonomous car for instance it is not enough to know that there is another car in the vicinity and in which direction it is located. It is also crucial to know how far away it is. For range estimation there are three different sensors that can be used. Besides image based, these include radio detection and ranging (radar) and light detection and ranging (lidar). Radar is based on reflection of radio waves and lidar on reflection of light waves. However, the most accessible sensor is the camera. The signals from all these type of sensors can also be used together [34] to improve performance. Only using the camera input could lead to estimations with relative big errors and narrow user applications compared to the other sensors. However, the range accu- racy using images have improved. Predicting the range or depth in a single image, called monocular depth estimation (MDE), is a growing field with the development of deep learning [35]. Neural networks can be trained with the increasing number of image datasets with to predict relative depth in test images. This can be done in an self-supervised manner, learning depth estimates from stereo images from different directions and even complemented with video material, or with supervised learning and ground truth data. The trained algorithm can predict a depth map including values for each pixel in the image, commonly visualized on a colored map as in Figure 2.7b where brighter color indicates a longer distance. (a) Original image. (b) Monocular depth estimation. Figure 2.7: Visualization of a distance scale in Figure 2.7b from monocular depth estimation applied on the image in Figure 2.7a. Brighter color indicates longer distance. The colored scale in Figure 2.7b enables a perception of the relative distances in an image. For instance, it can be seen that the ground covered in snow and the persons standing on it are closer to the camera than the sky. In order to relate the predicted depth scale to metric absolute distances between the camera and object, one has to take into account the characteristics of the camera such as focal length and the size of the image [36]. 16 2. Theory One of the best performing algorithms is named Monodepth2. The training dataset includes imagery captured with cameras of different focal length and thus field of view (FOV) leading to an algorithm with a general application. The general appli- cation of the Monodepth2 algorithm makes it possible to use for other image and camera types that are not necessarily included in the training dataset. As proposed in [36] and also implemented further in [9], the depth map can be scaled for appli- cation on a specific image type. This approach calibrates the scale from knowledge of the true data in a test set of images for a later general application. Monodepth2 is a combination of a depth and pose decoder [36]. The depth decoder is a type of CNN called U-Net, which is an optimized version of a fully connected CNN to analyze the most detailed features of an image. The pose decoder is a slightly modified ResNet backbone as described in subsection 2.2.2, pre-trained on the ImageNet dataset. It is modified such that it from two different image frames can predict a relative pose of an object. Predictions from the depth decoder are in training compared to the pose prediction in a loss function to be minimized. This is performed on pixel level followed by upsampling to bigger regions for consistency over the image. The algorithm training is self-supervised. There is no prior an- notated knowledge of the ground truth data, instead the self-supervised approach uses an automatic approach to approximate the truth from the image poses. The dataset used consists of stereo imagery, with different views on the same object, and monocular video material. The images depict street view scenarios. The performance of depth prediction algorithms for MDE is typically evaluated on the KITTI dataset [37]. It is an image and distance dataset in which true distance values are provided from the collection with lidar sensors simultaneously with the image capturing. Different evaluation metrics can be used, where one of the most common ones is the root mean square error (RMSE) which is evaluated as RMSE = √√√√ 1 |D| ∑ pred∈D ||gt − pred||2 (2.11) where D is the set of all predicted depth values pred for a single image, each of them compared to the ground truth gt. Monodepth2 receives one of the best root mean square errors (RMSE) of 4.63 meters [36]. 17 2. Theory 18 3 Methods In this chapter, the methods for the main track of the thesis workflow are described. First, it will be described how the 3D environment is created. The focus is then on the enrichment of the 3D environment. This includes the image extraction and the image analysis for which the theory has been covered, followed by the positioning of the objects in the environment. 3.1 Creating the 3D environment In order to provide a digital twin of the radio network that is as realistic as possible, the 3D environment in which the ray tracing simulations are performed also has to be realistic. One of the areas in which measurements and simulations are performed by Ericsson is in Kista, Stockholm as introduced in section 1.3. To build up the 3D environment, terrain data is first fetched for the desired area of approximately 2000 times 1800 meters. This includes a height over sea level profile with a resolution of approximately one meter. In addition, so-called shapefiles including the outlines of buildings and street networks in vector format are fetched for the area. The terrain information and the shape outlines are then merged so that they overlap in a tool called CityEngine used for building 3D environments. From the CityEngine tool, the model is exported as a universal scene description (USD) developed by Pixar. This USD format saves computer graphics data in three dimensions and is optimal for including different characteristics of an environment. USD files are commonly used for representing 3D environments in ray tracing tools. USD file data can be visualized in a variety of tools, such as Omniverse Create from NVIDIA. Omniverse Create can both build and use 3D environments for visual effects in an extensive way, where the visual effects of for instance windows can be included and different weather conditions can be applied. Also, the resulting rays from the ray tracing simulations, once performed, can be included for visualization in the environment of Omniverse Create. 3.2 Enriching the 3D environment When enriching the 3D environment, proper image material of the area including the street furniture has to be extracted. This process is described in this section, 19 3. Methods followed by how the images are processed with object detection to position the street furniture in the environment. An overview of the enrichment process can be seen in the flowchart in Figure 3.1. The flowchart includes the image extraction, the algorithms used with some of the processing steps implemented in this thesis as well as the two positioning methods with belonging clustering. This is an initial overview, where more detailed descriptions follow in this section. Furthermore, the poles are merged into the synthetic 3D environment where panoramas also can be extracted for an analysis of different error sources that are not present in a synthetic environment. Panoptic segmentation Separate predictions Triangulation Direction output Distance output Panoramas Clustering Depth estimation Clustering Calibration Processing Matching Image extraction Object detection & Depth estimation Positioning Figure 3.1: An overview of the process for enriching the 3D environment. It includes image extraction, object detection and positioning. 3.2.1 Image extraction To begin with when enriching the environment, suitable image material is extracted from the area of interest. In this thesis, street view images from Google Street View (GSV) have been used since they cover many cities and also can deliver great image quality [38]. Other providers of street view images such as Mapillary [12] offered a limiting quality since they are created by private contributors to a greater extent. As a compliment, aerial or satellite images are extracted from Google Earth for an overview of the area [39]. For as good coverage of the environment as possible, panorama images with a field of view (FOV) of 360 degrees are extracted. The GSV application programming interface (API) can provide the available camera positions. Smaller tiles of the view are extracted and stitched together to form the full panorama. In total 16 tiles in the horizontal direction and 9 tiles in the vertical direction are stitched, with deliberate consideration such that the distortion is minimized in the panoramas [40]. The extracted images are after they have been stitched scaled down to the resolution 1024 x 320 from 6656 x 3328. This is performed with the interpolation method INTER_AREA from the Python package OpenCV [41] which interpolates the 20 3. Methods pixel values depending on the area relation. The low resolution is chosen due to the specific requirements of the depth estimation algorithm Monodepth2 which requires specific resolutions. Several panorama images along a street are extracted to populate a whole street scenario with street furniture in this thesis. These are extracted with an approximate distance of 10 meters between them, which is the shortest distance available and is chosen to maximize the number of images featuring each object. The GSV car is supposed to have a maximum speed of 45 km/h and capture at least one image every third meter. Filtering by Google based on image quality has then led to the publication of images approximately every 10 meters [42]. The Google API works such that it provides the closest available camera location given an approximated position. To extract the camera positions and panoramas in a certain area for this thesis, it was chosen such that the extraction is started with an approximate location in latitude and longitude coordinates at the beginning of a street segment. In addition, the approximate heading of the street is given. The heading is given in degrees in the range [0, 360) similar to a compass, where 0 is North, 90 East, 180 South and 270 West. The first camera position is extracted from the API and is the closest to the initial guess. The guess for the second camera location will be at a distance of 10 meters in the approximated heading of the street. The camera location closest to this second guess will then also be saved. For the upcoming camera locations, the heading of the guess will be changed such that it follows the direction of the two previously saved camera locations. In order to make calculations like these with the camera locations, the usual GPS coordinates given as spherical coordinates with latitude and longitude according to the World Geodetic System1984 (WGS84) standard are projected to a coordinate reference system (CRS) with Cartesian coordinates. There are many different CRS, each one has different coverage depending on the region on Earth. Regions can overlap and sometimes different CRS can be used in the same region, which makes it important to use the same transformation for all different coordinates included in the same project. For the Stockholm example above, the CRS with European Petroleum Survey Group (EPSG) code 32633 is used. This was chosen since there was already a shapefile provided for the area with this code. A shapefile is a file format saving geospatial vector data such as building polygons or road lines according to one of these reference systems. Given this reference system, a point close to the middle of the area of interest was chosen as the new center for the coordinates for easier visualization. 3.2.2 Object detection Once the images containing the street furniture have been extracted along the road, the details are extracted with the object detection algorithm. A state of the art neural network for panoptic segmentation described in subsection 2.2.1 is used. The Mask2Former network used has been pre-trained on a big dataset from Mapillary [12], including many street view images from various parts of the world and with 21 3. Methods various labels. Among the labels for things, that can be detected with separate masks in the image, there are the poles of street lights and street signs that are interesting for this thesis. The inclusion of poles as an object class in the dataset that the algorithm has been pre-trained on, in combination with the competitive performance, are the main reasons for the choice of Mask2Former. The evaluation of the model is performed with an external NVIDIA GeForce RTX 3090. 3.2.2.1 Mask processing The output of the object detection algorithm is a matrix with the same dimensions as the pixel width and height of the input panorama image, in this case 1024 x 320. Every pixel detected as part of an object class is assigned a category identification number, relating to each specific car or in this case most importantly each pole. In a few cases, two poles are detected and classified with the same number as if they were one common pole. Because of the vertical nature of the poles, and the clear horizontal separation between them, the sorting of which pixels are part of which poles is possible. To limit the effect of false detections and poles too far away, a minimum number of pixels is set to 30 for a separated mask to be classified as a pole. Visual inspection indicated that the detection masks with fewer pixels than around this threshold were mostly false detections, poles far away or even small parts of the same pole where for instance the top of the pole is bent to hold up a lamp. 3.2.2.2 Direction estimate To position the street furniture from the object detection, a horizontal direction from the camera position to the object is estimated. Since the field of view of the image is 360 degrees, the position of the pole mask relative to the width of the matrix can give the approximate angle of the pole [9]. Because of the vertical nature and relatively uniform thickness of the poles, and also in order to account for partial occlusion of the pole, the most vertical mask position in the image is used. With this it is meant that the number of pixels in the mask is projected on the horizontal axis, and the position with the highest number of pixels is chosen. For this angle to make any sense for positioning it has to be related to a common ref- erence. This reference is chosen as North, and in each panorama it is approximated from the knowledge that the exact middle of the panorama is the driving direction of the car. The driving direction is approximated from the direction to the next camera location for the panorama extractions as visualized in Figure 4.1. Since the CRS is in x- and y- coordinates, where the y-axis is pointing toward the North, this can be related to the direction of the North in the image. An alternative method to deciding the direction of North in an image would be to use image matching through OpenCV [41] with a smaller image extracted from GSV as well but with a previously known heading [40]. Despite the promising applications to panoramas in a few test cases, this method showed not to be accurate enough for the application in this work. 22 3. Methods 3.2.3 Depth estimation The approach applied in this thesis for positioning the street furniture from object detection is in combination with range estimation. This is done with MDE as described in subsection 2.2.4, more specifically with the deep learning algorithm Monodepth2 [36]. This algorithm provides the metric distances estimation to each pixel of an image. 3.2.3.1 Distance estimate A distance estimate for a pole can then be calculated from the depth values of the pixels associated to each pole mask from the panoptic segmentation. In the similar positioning approach with segmentation masks and depth estimations in [9], a single distance estimate for each mask was calculated as a trimmed mean of all the mask pixels. In [9], it was motivated to exclude the top and bottom 10 % of the pixel depth values, respectively, due to the difficulty of the algorithm to predict depth around the edges of the object. In [9], trees were detected and positioned. From an analysis of the general distribution of depth values for the pixels of a pole mask in our implementation, there was a tendency to predict a too long distance for many of the pixels. This was especially the case for the tops of the poles, probably since they are relatively thin. The final distance estimate for each pole was therefore calculated as a mean of the closest half of pixel values. As stated, this was done in order to account for the tops of poles that in this thesis work often were not detected as clear as the bottom part, and that the edges of the poles were not so clear causing predictions farther away. Combining the known direction and distance estimates of the detected pole, a posi- tion can be predicted. Given the camera position where the panorama was extracted, the position prediction is at the estimated distance in the estimated direction. To position poles along a road segment, predictions from multiple panoramas are com- bined. With a combination of several camera positions as prediction sources, there might be multiple predictions associated with the same pole but detected from a dif- ferent views. In subsection 3.2.4 the algorithms to combine these predicted positions are discussed. 3.2.3.2 Calibration Note that the multiple predictions are used in a calibration procedure as in [9]. All the predictions are used to adjust the depth scale with a factor to match the focal length of the camera for the panoramic images from GSV. This factor was obtained as 0.25 for GSV images after comparison with real truth positions of poles gathered along a road segment of Vancouver, where this true data was available [43]. The calibration of the depth estimation output was performed with 50 images covering approximately 500 meters of a road. Along this segment, a total of 20 street poles were placed with known locations from the database. Different scaling factors were tested to find the optimal one minimizing the mean absolute error of the predictions. 23 3. Methods 3.2.4 Positioning As previously stated, there can be multiple predictions of the same pole that origi- nate from different camera positions. In order to obtain a final set of positions for the poles along a road segment, these predictions have to be combined or clustered. In this thesis, two possible solutions have been implemented. In the most straight forward way that is here referred to as ”separate predictions”, all the predictions from all the camera positions are clustered. In the second approach, ”triangulation”, pairwise matching of detections is applied in order to associate predictions for better performance. 3.2.4.1 Separate predictions To begin with, all separate predictions are clustered together to handle multiple detections of the same object. Predictions are shown for a show case example in Figure 3.2. Camera position Depth estimate Figure 3.2: The positioning process where separate predictions in green are clus- tered. They are positioned from estimates of direction and distance to poles. It is first performed one type of clustering of the separate predictions. Predictions within a specific radius are assumed to be associated and hence clustered. This is done to not include any prior knowledge of the number of predictions desired. This clustering is performed with a sklearn implementation called the mean shift [44]. The mean shift is an iterative method, finding the central points of the clusters that maximizes a density function [45] given all the separate predictions in total. In this implementation, a flat kernel is used with no prior knowledge of the positioning included. A clustering radius of 4 m was chosen to obtain an approximation of the appropriate number of predictions. This radius was chosen based on the true data in Kista where poles in most seen cases are separated more than 4 meters. Lowering the radius would allow poles closer each other to be detected if this would be the case, but this would also increase the risk of keeping multiple predictions of the same pole. 24 3. Methods The mean shift clustering gives an approximation of the suitable number of clusters needed given the cluster size specified. Furthermore, it gives a set of cluster centers that are optimal for the separate predictions. However, another clustering method was found to outperform the mean shift when it comes to finding the clustering cen- ters. Given the number of predictions from the mean shift, the separate predictions were then instead clustered through the k-means method [46]. This method itera- tively chooses the predictions that from a global perspective give the least variance to the other predictions within the same cluster. The variance is here defined as the sum of squared distances between all predictions in the clusters. The sklearn package uses Lloyd’s algorithm to iteratively find the most probable configuration [44]. From this clustering of all the predictions from all camera positions, a final set of predicted pole positions are obtained through the separate prediction method including clustering. 3.2.4.2 Triangulation To further enhance the performance, a method for association the separate pre- dictions has been implemented. The approach used in this thesis is inspired from [9], where possible pole locations can be approximated through triangulation. This means that each position where two lines directed from different camera locations intersect is a possible pole position. This positioning process is shown in Figure 3.3. Camera position Depth estimate Accepted intersection Figure 3.3: The positioning process with triangulation where accepted intersec- tions in blue are clustered. Those have two separate predictions in green close to the associated intersection. With multiple detections from each camera position, and multiple camera positions in proximity, the number of intersections rapidly increases. The problem then comes to choosing the most suitable intersections. The straightforward approach used for placement of trees in a similar case [9] accepts an intersection if the two depth predictions associated with the two intersecting detection lines are close enough. 25 3. Methods This threshold has been chosen to 4.5 meters. This margin can roughly account for the given RMSE in the depth estimation. This triangulation approach is visualized with an example in Figure 3.3. Furthermore, this positioning method can cause predictions in the vicinity of each other similar to when using the separate predictions approach. Therefore, clustering methods are applied as described as in the "separate predictions" approach section. However, this clustering is with the accepted intersections and not with all the sepa- rate predictions. It is initially a mean shift clustering, where it is assumed that there should be no poles closer than 4 meters to each other. Following this, the received number of predictions are used further in the k-means clustering method. From this, a final set of predicted pole positions are obtained through the triangulation method. 3.2.4.3 Alternative approaches In related work, there have been approaches using only the image coordinates of the objects detected to position them. This uses the approximate camera height and an assumption of clear sight of the object and flat surface [47]. However, these assumptions do not always apply, and this approach has not provided successful results during testing. Furthermore, other approaches have used only direction estimates. Then rules de- ciding how far from the road street poles are placed in general can be used for an approximate placement [48]. In a similar manner, positioning based on the place- ment of buildings has been implemented in [49]. Due to the difficulty of handling poles not following these rules and handling multiple detections from different cam- era positions, this method was not tested for this work. Similar rules applied to the height of the objects to label them in different classes could possibly also be implemented. This was only tested to a limited extent in this thesis work and could be further investigated in similar work. In addition, methods could complement or replace the depth estimations when ac- cepting intersections in the triangulation method. For instance, scale invariant fea- ture matching could be applied to match features of objects [50]. However, it is difficult to implement in this case since the poles look similar to each other. More- over, the background changes depending on the viewing direction on the pole. As another compliment to depth estimation on images, sensor input from radar could be included as well [51]. The problem of associating the predictions from different camera positions with each other has also been handled with alternative methods in related work. In [52] a probabilistic approach for data association is applied on position predictions from radar sensors. The processing of these predictions could be similar to predictions from object detection. From the predictions an estimated number of final predictions is decided, to then associate initial predictions to a final one. Due to the difficulty of associating the predictions to a specific estimate with good accuracy, and due to the complexity of the model, this was not applied in this thesis. 26 3. Methods Finally, there have also been approaches using Markov random fields to cluster predictions as an alternative probabilistic approach [10, 53]. The concept of the random field is to iteratively test configurations from the multiple predictions to minimize a global energy function where for instance single predictions are penalized. The set of possible positions is then discrete rather than continuous as for the clustering algorithms applied in this thesis, which was regarded as limiting for the purposes of this thesis. A few test cases were performed with a random field similar to [10] without any clear improvement of the positioning. 3.2.5 Error metric With the complete method described for positioning the poles through detecting approximate horizontal direction and distance, it is evaluated how precise the per- formance is. For quantification of the results, true pole positions for the area have been gathered through visual inspection of aerial and street view imagery. Each pre- dicted pole position (xp, yp) can then be compared to the closest truth pole position (xt, yt). This can give a total mean absolute error (MAE) calculated as MAE = 1 Np Np∑ p=1 √ (xp − xt)2 + (yp − yt)2 (3.1) where Np is the total number of predictions. The mean absolute positioning error is used as an error metric in similar work [9, 10]. In addition, the ratio of matched predictions is presented as Nm/Np where Nm is the number of predictions matched with a true position. As in [9], an upper threshold for a prediction to be regarded as a match is used, here set to 8 meters. There might be cases where a couple of predictions have the same closest true pole, but these cases are assumed to not be notably affecting the results. 3.2.6 Sources of error There are a number of parameters that may affect the performance of the positioning algorithm. To begin with, there is an estimated general GPS accuracy of 1-5 m in 95 % of the cases [54]. This will affect the camera locations that the Google API provides, causing possible differences in the actual camera location and the given camera position. The use of further equipment in the Google cars in the form of an inertial measurement unit (IMU) should improve the accuracy to 2.5 meters according to Google [42]. The IMU combines the output of accelerometers and gyroscopes to improve the positioning of the car. The camera locations along a road follow the road in relatively straight lines that may be off from the exact line where the car is driving and also may deviate in the direction of the road. However, at least the sideways errors appear to be systematic errors causing a translation of the pole predictions. Further post processing could be applied to either manually or automatically correct the camera positioning to better handle these errors, but this has not been pursued to a greater extent within the scope of this thesis. The positioning error of the supposed true pole positions may also affect the result. The visual inspection with a combination of aerial images with shadows and low 27 3. Methods resolution top views of the poles and also estimation of the position from the road position in the street view images may cause an error. Especially since the timing of collecting the aerial and street view images is in some cases deviating. Visual inspection of the positions of the street light poles in the database for Vancouver [43] used for tuning the depth estimation algorithm show that they might also deviate. In Kista the assumed true positions are from visual inspection due to limited data sources. These true positions may deviate with a mean of 0.98 m which is based on comparisons with assumed true positions for about 41 of the true pole positions that are also fetched as true lighting sources in Stockholm from a database [55]. Furthermore, there are possible error contributions from the positioning algorithm itself. There may be a small deviation in the prediction of directions in the images, both seen to the assumption that the direction of the car is not exactly pointing to the next camera position and to that it should be in the exact middle of the image. However, from inspection of the images the North prediction seems reliable. Furthermore, a few of the pole detections may be with small horizontal deviations due to the limiting resolution of the panoramas. In total, these angle errors are assumed to cause negligible errors in the positioning compared to the errors caused by the GPS errors. This is since the shift of the camera position is then causing a greater deviation in the position prediction. 3.2.7 Synthetic environment In order to isolate the possible errors caused by the positioning algorithm, the algo- rithm is not only applied to the collection of GSV panoramas but also to a collection of panoramas depicting a synthetic environment. The poles have been placed at what is assumed to be the true positions in the 3D environment created and visualized in Nvidia Omniverse Create as described in section 3.1. From this scenario, panoramic images picturing the synthetic world can be extracted from the exact camera posi- tions along the road with a fish-eye lens. This results in the elimination of camera position errors and true pole position errors. For these Omniverse panoramas, the camera characteristics cause the depth estimates to be scaled with 0.35 instead of 0.25 as for the GSV panoramas. This was concluded with a similar method as for the GSV panoramas. From a few test cases in the Kista synthetic environment, the scaling factor giving the least mean absolute error to some of the true positions was chosen. The positioning from the panoramas of the synthetic environment is varied to see how the amount of details in the environment affects the accuracy. With a fully populated model, it includes buildings with textured and colored facades as well as trees and cars. Sequentially removing the trees, then the textured facades and finally also the buildings gives a picture of the positioning algorithm’s performance and sensitivity to different error contributions. 28 4 Results In this chapter, the step by step results from the implementation of the workflow described in chapter 3 are presented and discussed. To begin with, the output from the image extraction is covered. Afterwards, these images are used in the object detection algorithm. Then, the obtained positions for the poles are presented. The quantified positioning result from both the real world and synthetic data are included to compare possible error contributions. 4.1 Image extraction As described in subsection 3.2.1 panoramic images were extracted from GSV along the road Torshamnsgatan in Kista, Stockholm. For each of the 50 extracted panora- mas, the camera position is provided in latitude and longitude coordinates. The camera positions are shown in Figure 4.1 overlaid on a Google Earth satellite image. Although the GSV car is driving in one of the lanes along the road when capturing the imagery, some camera positions are placed on the pavement or in the opposite driving lane. Furthermore, there can be deviations in the positions along the di- rection of the road. This is despite that Google performs filtering to smooth the camera positions along the road. As seen in Figure 4.1, the errors are systematic rather than randomized. Consecutive camera positions can, for instance, be shifted in the same direction. This was seen clearly when consecutive panoramas show how the real camera positions are in one of the road lanes, but where the extracted camera positions are in the middle of the road instead. One of the extracted panoramas is shown in Figure 4.2. The corners or smaller areas of the panoramas can be distorted and some areas may be blurry because of either stitching problems or integrity reasons. The stitching problem occurs despite using stitching algorithms from [41] to stitch the extracted image tiles to a single panorama. Furthermore, sunlight, shadows and occlusion from objects in the street like cars might affect the quality of how well the panoramas cover the surrounding. Despite these issues, the area of interest in the panoramas is in most cases clear enough as seen in Figure 4.2. 29 4. Results Figure 4.1: Extracted GSV camera positions shown in red overlaid on a satellite image of the area, the green marker showing the position of the panorama in Fig- ure 4.2. Figure 4.2: An example panorama image extracted in Kista at the position marked with a green marker in Figure 4.1. 4.2 Pole detection For each panoramic image extracted along the road in Kista the detection algorithm is applied as described in chapter 3. In Figure 4.3a the detection result from the sample panorama in Figure 4.2 is shown. The segmentation masks have been applied as a layer on top of the original image. For each separable object, the different colors show the pixels in the image associated with it and the corresponding label describes what it represents. For this thesis work, the focus is on the poles which can be seen in different colors along the road. There were a few cases when two poles were presented with the same classification number and color as if they were one pole. In these rare cases the masks were separated in post processing. This separation was performed if one pole mask contained two or more submasks with a horizontal spacing of at least one pixel between. As described in subsection 3.2.2, a pole mask is accepted if it contains at least 30 pixels to reduce the risk of multiple detections of the same pole. From the panoptic segmentation and the extraction of pole masks, a horizontal angle is estimated for each pole mask. As seen in Figure 4.3b the estimated pole directions in blue lines match well with the detected and masked poles in Figure 4.3a. The pole directions are referenced relative to the North indicated by the red line. 30 4. Results (a) Panoptic segmentation. (b) Direction estimate. Figure 4.3: Panoptic segmentation output from the panorama image in Figure 4.2 shown as segmentation masks in Figure 4.3a and blue pole direction estimates in Figure 4.3b where red is North. As described in subsection 2.2.4 the detected pole masks are further associated with a distance estimate in order to position the poles. The result of the monocular depth estimation is shown in Figure 4.4. The colors of the pixels represent a depth estimate described by the colorbar. For each detected pole mask with a known direction from the object detection, the associated depth values are extracted to give a depth estimate. This can be done since the depth map matrix outputted has the same dimensions as the segmentation output. As seen in the figure, the details not too distant from the camera are more clearly marked whereas the resolution of the prediction is decreasing with increasing distance. The poles closest to the camera have visible depth contours, even though the edges are not so clear. Hence, a pole prediction is accepted if the depth prediction is within 20 meters which is most often the region where the distance predictions are considered good enough. This limit is set to reduce the risk of detections that are not as qualitative as the ones at shorter distances. For each camera position, the detections of poles are summarized and exported to a common data file. The data format of the file with comma separated values (CSV- file) is shown in Table 4.1. The index of each detection and also the camera index are noted, which are the same for the examples in Table 4.1 and thus omitted in the table. The horizontal direction to each pole given in degrees relative to true North and the depth estimate from the group of pixels being a part of the pole are noted. 31 4. Results Figure 4.4: Results from the monocular depth estimation applied on the panorama image in Figure 4.2. Table 4.1: The accepted pole detections from the generated direction output shown in Figure 4.3b and depth output shown in Figure 4.4. Direction given in angles from North, pole position in image along the x-axis in pixels and distance in meters from the camera. Direction [degrees] x [pixels] Distance [m] 55.2 802 6.32 64.00 827 9.86 85.10 887 10.29 105.13 944 16.13 214.12 230 11.10 Also, the pole position in the image along the x-axis in pixels is noted in the table for comparison with Figure 4.3a. In Figure 4.3b these pole directions are shown in blue with angles given relative to the true North marked with the red line. The detections from all camera positions are gathered in the same file to be exported for input in the positioning algorithm. Furthermore, there were a few cases where a pole was not detected by the segmenta- tion algorithm. This was due to either the occlusion by a car, the pole being too far away such that the resolution of the image is not sufficient to present it, or because the image from the extraction was distorted in this specific area. In addition, there occurred a few false detections such that persons and tree trunks could be falsely classified as poles. Examples of those are seen in Figure 4.5. However, considering the total number of poles detected in the 50 panoramas along the street in Kista the missed and false detections were few. The mistakes were further reduced since de- tections were only accepted if the pole mask exceeded 30 pixels in total size in both horizontal and vertical directions. The missed detections due to occlusion or poles too far away are the reason for the reduced performance seen in the upcoming sec- tion. This is in some cases worsened considering the application of the triangulation positioning as described where at least two detections of one pole are required. 32 4. Results (a) (b) (c) (d) (e) (f) Figure 4.5: Examples of object detection difficulties. These include poles confused with the facade in Figure 4.5a and Figure 4.5b, occluded by cars or trees in Fig- ure 4.5c and affected by distortions from the image extraction in Figure 4.5e and Figure 4.5f. Also, Figure 4.5d shows how a tree trunk is detected as a pole. Similar to the object detection algorithm, the depth estimation can also cause a few detections with especially limiting accuracy. These were cases where it was difficult to separate the pole from the facade behind or where the object in question was too far away. In addition, sunlight and image distortions could affect the estimation. 4.3 Positioning The detected pole direction and distance estimates are used in the two positioning algorithms described in subsection 3.2.4. First the "separate predictions" approach is applied. Secondly, the "triangulation" approach is applied. In the triangulation approach, the detections from pairwise camera positions are used to find possible intersections between rays, that are accepted if the two corresponding depth esti- mates are within 4.5 meters. In addition, the predictions are clustered if multiple predictions end up within 4 meters of each other. They are clustered to account for multiple detections of the same pole. The poles are assumed not to be too close to each other. The predicted pole positions are compared to the assumed true pole positions. The true positions are obtained from visual inspection of aerial and street view images from the area. In total, there are 76 true pole positions including lamp posts, flag poles and street sign poles in the road segment. These are shown together with the 33 4. Results Figure 4.6: Predicted pole positions through triangulation from GSV panoramas shown as magenta crosses compared to the true positions in black along the street in Kista. predicted pole positions from the triangulation algorithm in Figure 4.6. The CRS has EPSG code 32633 and the midpoint used as reference is chosen as 667557, 6588849 to match the existing description of the environment. For each predicted pole position, the distance to the closest true position is used to calculate the mean absolute error (MAE) of the predicted positions. For the detected 55 poles in the area from the triangulation method, 87% of them were matched with true values. For those matched the predicted MAE was 3.56 meters as stated in Table 4.2. The distribution of the absolute errors for this method is shown in Figure 4.7. The result from the triangulation algorithm with the intersections used for the pre- diction is compared to the more simple approach separate predictions. As seen in Table 4.2 the calculated MAE is better for the triangulation method, but of similar size for both methods. Additionally, the percentage of the total predictions matched with a true pole position is lower for the separate predictions method. With this, it is meant that there are many more predictions not matched with true values causing a worse overall prediction than for the triangulation method. This is due to that all predictions are included, no pairwise matching is used to make sure a pole is detected twice to be used as a final prediction. As a comment on this result, it has to be said that, according to the error estimation 34 4. Results Figure 4.7: Absolute error distribution for the pole predictions from triangulation compared to the true positions along the street in Kista. Table 4.2: Quantified performance of the positioning with single predictions and triangulation applied on GSV panoramas. Method Predictions Matched pred. MAE Separate predictions 102 68.6 % 3.68 m Triangulation 55 87.3 % 3.56 m methodology described in subsection 3.2.5, the GPS accuracy of the extracted cam- era positions is approximately 2.5 meters. Additionally, the assumed true positions are placed from visual inspection of aerial and GSV images leading to an approx- imate error contribution that should be small but also could also have an effect. Furthermore, errors may occur due to missed and false detections as in Figure 4.5. Further, there might be multiple detections of the same true pole position. In addi- tion, poles to the lower left are further away than 20 m from the lane where the car is driving, which is outside the region where the detections are good enough. This is a shortcoming of the current prediction and could maybe be improved if images were captured from both lanes of the road. Additionally, a shorter distance between the camera positions and thus more detections could possibly improve the results. 4.4 Synthetic environment Finally, the poles are included in a 3D environment used as a base for the radio propagation simulations. This 3D environment is created from a combination of the terrain data of an area in combination with shapefiles including the outlines of buildings and street networks as described in section 3.1. In the created 3D environment the poles have been added at the true pole positions, see Figure 4.8b. While moving the camera in the 3D world, panoramic images are extracted with headings matching the existing real world panoramas. 35 4. Results (a) GSV panorama. (b) Omniverse panorama. Figure 4.8: Corresponding panorama images from GSV in Figure 4.8a and the synthetic Omniverse environment in Figure 4.8b with poles at the true positions. In the most complex synthetic 3D environment, there are detailed buildings, a street network, a vegetation layer and of course the poles at the true positions. In this synthetic environment, panoramas are extracted exactly at the assumed camera position. This enables testing of the positioning algorithm while the error contribu- tions from camera positions and true pole positions are eliminated, as described in subsection 3.2.7. In Figure 4.8 a sunny sky in the synthetic environment is displayed to match the GSV image. However, for the positioning a gray sky has been used to avoid the effect of distracting sunlight and sharp shadows. The fully detailed environment includes the street network, buildings with facades with clear textures and a foliage layer with trees placed. This will be referred to as case C1. In further tests, the trees are first removed (case C2), followed by the removal of the textured facades (case C3) and finally also the buildings are removed (case C4). The extraction of panoramic imagery from the synthetic environment, and application of the positioning algorithm, is performed for all cases. The positioning results are presented in the form of MAE and percentage of matched predictions as described in Table 4.3. As seen in Table 4.3, the positioning algorithm gives the best result when the syn- thetic environment does not include trees in case C2. The result from the positioning algorithm based on this data is shown in Figure 4.9. By inspection of the detection output, this is mostly due to two reasons. The first one is the trees leading to occlu- sion of the poles such that they are not detected. The second one is the similarity of the tree trunks to poles in this synthetic environment which confused the object detection algorithms. These missed and false detections in case C1 showed to cause more and worse positioning predictions compared to case C2. 36 4. Results Table 4.3: Quantified performance of the positioning algorithm applied on Omni- verse panoramas showing environments with varying levels of details. Case Data Predictions Matched pred. MAE C1 Fully detailed 73 91.7 % 1.38 m C2 No vegetation 67 98.5 % 0.94 m C3 No facade texture 74 95.9 % 1.69 m C4 No buildings 83 80.7 % 2.68 m Figure 4.9: Predicted pole positions from Omniverse panoramas of an environment without trees as shown as magenta crosses compared to the true positions in black along the street in Kista. 37 4. Results There is a slight increase in errors when removing also the more textured and de- tailed facades in case C3. In the last case (C4) when the buildings were removed the errors increased even more. This is mostly due to problems with the depth es- timation algorithm. Since the MDE algorithm was trained on real world scenarios, the differences compared to this relatively empty environment may have been too big, leading to bad quality predictions. This was seen with difficulties handling for instance street crossings as well as separating the specific distance to poles against the background. The empty environment in case C4 did however make it easier for the object detection algorithm to detect poles on a longer range that were not detected before. It should be noted that the application of the positioning algorithm on the synthetic environment indicates that the positioning error of the camera positions and true positions have a relatively big influence on the result. However, the similarity to the real GSV scenarios is not perfect since there are details that are not included in the synthetic scenarios. To name a few examples, the real world includes a greater number of cars, more extensive vegetation and of course also people, bicycles and fences around buildings or bridges. In addition, the synthetic poles may be easier to detect than real poles both due to the pole design and the surrounding environment. 38 5 Simulation In this chapter, an introduction is presented of the performed ray tracing simula- tions. This is followed by the implemented simulation scenario with the enriche