Perceptual Evaluation of Mitigation Approaches of Errors due to Spatial Undersampling in Bin- aural Renderings of Spherical Microphone Array Data Tim Lübeck Department of Civil Engineering and Architecture CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2019 Master’s thesis 2019 Perceptual Evaluation of Mitigation Approaches of Errors due to Spatial Undersampling in Binaural Renderings of Spherical Microphone Array Data Tim Lübeck Department of Architecture and Civil Engineering Division of Applied Acoustics Audio Technology Group Chalmers University of Technology Gothenburg, Sweden 2019 Faculty of Information-, Media- and Electrical Engineering Institute of Communications Engineering TH Köln - University of Applied Sciences Perceptual Evaluation of Mitigation Approaches of Errors due to Spatial Undersampling in Binaural Renderings of Spherical Microphone Array Data Tim Lübeck © Tim Lübeck, 2019. Supervisor: Jens Ahrens, Chalmers University of Technology, Department of Architecture and Civil Engineering Examiner: Jens Ahrens, Chalmers University of Technology, Department of Architecture and Civil Engineering Christoph Pörschmann, TH Köln, Institute of Communications Engineering Master’s Thesis ACEX30-19-101 Department of Architecture and Civil Engineering Division of Applied Acoustics Audio Technology Group Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Python simulation of a plane wave impact on a spherical microphone array. Typeset in LATEX Gothenburg, Sweden 2019 vi Perceptual Evaluation of Mitigation Approaches of Errors due to Spatial Undersampling in Binaural Renderings of Spherical Microphone Array Data Tim Lübeck Department of Architecture and Civil Engineering Chalmers University of Technology Home University: TH Köln - University of Applied Sciences, Institute of Communications Engineering Abstract High fidelity virtual acoustic environments (VAEs) have become of increasing in- terest in the context of immersive virtual and augmented reality applications, and have developed to a reputable field of research. To present a real-life sound scene as VAE to a listener, the spatial sound field has to be captured. Traditionally, spatial sound fields are sequentially measured by means of a dummy head. Using spherical microphone array recordings is a promising alternative to time-consuming dummy head measurements. This enables the possibility of dynamic real-time applications that involve recordings and reproductions of spatial sound scenes as for example live concerts or telephone conferences. However, the quality of the rendered VAEs is significantly limited by the density of the microphone distribution provided by the microphone array. A limited number of microphones leads to spatial undersampling of the surrounding sound field and creates audible artifacts in the VAE. This work investigates the errors due to spatial undersampling and state-of-the-art approaches to mitigate them. In a concluding listening experiment, the improvements that can be achieved with these approaches were perceptually evaluated. It was found that there are a few algorithms that significantly improve the binaural auralization of spherical microphone array data. Keywords: Spatial Aliasing, SH Order Truncation, Spherical Harmonics, Plane Wave Decomposition, Spherical Microphone Arrays vii Acknowledgements First of all, I would like to thank my supervisor Jens Ahrens whose door was always open to me and my questions. Through him, I have gained a lot of expertise for my future career. Many thanks go to my second supervisor Christoph Pörschmann. He already ac- companies me throughout my entire studies and played a decisive role in my current graduation. He also opens up important chances for my future personal and career development. I also would like to thank the entire Division of Applied Acoustics of Chalmers University for giving me the opportunity to work on this project. It was always a pleasant atmosphere, that made my time at Chalmers a great experience and a perfect completion of my university studies. At this stage, I want to say thank you to all the people at the Division who participated in my final listening experiment. Further, I would like to thank my office colleague Hannes Helmholz for the enjoyable cooperation. Hopefully, there will be more common projects in the future. Last but not least, I would like to thank Johannes Arend who taught me all my basic knowledge in audio processing and scientific writing and still has the patience to guide me with the statistical evaluations. As this work was done during an Erasmus semester abroad at the Chalmers Univer- sity, many thanks go to the Erasmus program for their financial support. Tim Lübeck, Gothenburg, August 2019 ix Contents 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Mathematical Derivation of Capturing and Reproducing Spatial Sound Fields . . . . . . . . . . . . . . . . . . . . 3 2.1.1 The Acoustic Wave Equation and its Solutions . . . . . . . . . 3 2.1.1.1 Spherical Bessel Functions . . . . . . . . . . . . . . . 4 2.1.1.2 Solutions of the Angular Components - Spherical Harmonics . . . . . . . . . . . . . . . . . . . . . . . . 4 2.1.2 The Spatial Fourier Transform . . . . . . . . . . . . . . . . . . 5 2.1.3 The Plane Wave Composition . . . . . . . . . . . . . . . . . . 5 2.1.4 Binaural Reproduction . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Spherical Array Processing . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.1 Discretization - Sampling Scheme . . . . . . . . . . . . . . . . 8 2.2.2 The Discrete Spatial Fourier Transform . . . . . . . . . . . . . 8 2.3 Technical Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.1 Spatial Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.3 Consequences of Undersampled Microphone Array Data Renderings . . . . . . . . . . . . . . . . . . . . . . 12 2.3.4 Reducing the Impairment of Spatial Undersampling . . . . . . 13 2.3.4.1 Spectral Equalization . . . . . . . . . . . . . . . . . 14 2.3.4.2 Bandwidth Extension Algorithm for Microphone Ar- rays . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.4.3 Magnitude Least Squares . . . . . . . . . . . . . . . 18 2.3.4.4 Tapering . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4.5 Matrix Regularization . . . . . . . . . . . . . . . . . 20 2.3.4.6 Spatial Anti-Aliasing Filters . . . . . . . . . . . . . . 22 2.3.5 Analysis of the Resulting Binaural Signals . . . . . . . . . . . 23 2.3.5.1 MagLS . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.5.2 Spherical Head Filter - SHF . . . . . . . . . . . . . . 24 2.3.5.3 Tapering . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3.5.4 BEMA . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3.5.5 Global Equalization Filter - GEQ . . . . . . . . . . . 26 xi Contents 2.3.5.6 Overall Comparison . . . . . . . . . . . . . . . . . . 27 3 Implementation 31 3.1 Rigid Sphere Impulse Response Measurements . . . . . . . . . . . . . 31 3.2 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3 Auralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4 Perceptual Evaluation 35 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.2 Experimental Paradigm . . . . . . . . . . . . . . . . . . . . . 36 4.2.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5 Conclusion 45 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.2 Follow-Up Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Bibliography 47 A Appendix 51 A.1 Radial Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 A.2 Comparison of Algorithm Improvements . . . . . . . . . . . . . . . . 52 A.3 Results of the Listening Experiment . . . . . . . . . . . . . . . . . . . 54 A.3.1 Overall Descriptive Values . . . . . . . . . . . . . . . . . . . . 54 A.3.2 Nested ANOVA Results . . . . . . . . . . . . . . . . . . . . . 54 xii 1 Introduction 1.1 Motivation In the last decades the rising number of virtual and augmented reality applications creates the demand for high fidelity virtual acoustic environments (VAEs). Further- more, presenting immersive spatial sound fields to a listener has become a reputable field of research. One common way of presenting a VAE is binaural synthesis. The headphone-based auralization stimulates the binaural spatial hearing of the human auditory system, and ideally reproduces the same sound pressure at the ear drums that would be perceived in a real-life sound scene. Binaural renderings are based on appropriate head-related impulse responses (HRIRs) or binaural room impulse responses (BRIRs) that describe the transformation of sound reaching the human ear drums. This information is usually obtained by a dummy head with two micro- phones in the ear canals which capture the surrounding sound field. For presenting a full-spherical VAE, in particular when involving the head-orientation of the lis- tener, time-consuming dummy head measurements with adequately high resolution are unavoidable. When thinking of dynamic auralizations of spatial sound fields, for example broad- castings of live concerts, or spatial acoustic telephone conferences, dummy head measurements are unfeasible. An alternative to dummy head measurements is the recording of spatial sound fields by means of spherical microphone arrays. Micro- phone arrays allow to capture the entire surrounding sound field at once, instead of sequentially measuring it with a dummy head. The reproduction of array recordings is not restricted to binaural auralizations, but could also involve wave field synthe- ses, Ambisonic reproductions or any other object based VAE. Furthermore, spherical microphone arrays can be used for a wide range of further applications and research areas, for example sound field analyses as beamforming or acoustic holography. This work focuses on binaural reproductions of spherical microphone array captur- ings. Bernschütz (2016) elaborated a mathematical closed description for recording a spatial sound field by means of spherical microphone arrays, and reproducing it via dynamic binaural synthesis. Theoretically, this reproduction works without any loss of quality compared to dummy head auralizations. However, real-life implementa- tions have technical constrains, mainly caused by a limited number of microphones that lead to spatial undersampling of the surrounding sound field. Two technical constraints, namely the spatial aliasing and order truncation effect, lead to audible artifacts in the binaural rendering. They are the spatial analogous to the artifacts occurring at the time-frequency sampling. 1 1. Introduction This work investigates the spatial aliasing and order truncation effect and their consequences on the perceived quality of spatial audio reproductions. Moreover, state-of-the-art algorithms to reduce the influence of these artifacts are studied, implemented and compared in a conclusive listening experiment. 1.2 Related Works Benjamin Bernschütz developed a comprehensve signal processing chain including the microphone array capturing of spatial sound fields and the reproduction of bin- aural signals as part of his doctoral dissertation (Bernschütz, 2016). Most of the theoretical derivations in this work are based on this dissertation. The Sound Field Analysis MATLAB toolbox (SOFiA) provides the implementation of the signal pro- cessing described in his thesis, see Bernschütz et al. (2011). The Python port1 is an equivalent implementation realized by Christoph Hohnerlein, see Hohnerlein and Ahrens (2017). All implementations, visualizations, and the final generation of the listening experiment stimuli is done with tools from this Python Sound Field Anal- ysis toolbox (SFA). The auralization of the binaural signals is done with the SoundScape Renderer (SSR), developed by Geier et al. (2008). It allows to render arbitrary HRIR or BRIR data sets defined on a pure horizontal sampling grid with 1° resolution. To achieve the main goal of spherical microphone array renderings, namely the real- time auralization, Helmholz et al. (2019) developed a Python application which allows to dynamically capture and reproduce spatial sound fields. Although this work just focuses on impulse response based renderings, it is intended to use the findings to improve this real-time auralization as much as possible. 1https://github.com/AppliedAcousticsChalmers/sound_field_analysis-py 2 https://github.com/AppliedAcousticsChalmers/sound_field_analysis-py 2 Theory 2.1 Mathematical Derivation of Capturing and Reproducing Spatial Sound Fields This chapter briefly summarizes the theory of capturing spatial sound fields with spherical microphone arrays and reproducing these captures as a virtual acoustic environment. Therefore, in the first section, the mathematical fundamentals of sound field capturing are explained. Section 2.2 describes the limitations that arise when applying the mathematical derivations to real-life spherical microphone arrays. Finally, the consequences of these technical constraints are discussed, and a number of state-of-the-art reduction approaches are introduced. 2.1.1 The Acoustic Wave Equation and its Solutions In linear acoustics every sound pressure p(x, t) in the free space x at the time t is satisfying the homogeneous acoustic wave equation, with c denoting the speed of sound in the air and ∇2 the Laplacian in Cartesian coordinates. ∇2p(x, t)− 1 c2 ∂2 ∂t2 p(x, t) = 0. (2.1) This holds for every homogeneous fluid with no viscosity, see e.g. Kinsler et al. (1999) or Williams (1999, Eq. 2.1, p. 15). For a sound field consisting of a single- frequency, the pressure p(x, t) can be expressed as p(x, t) = p(x)ejωt, (2.2) with the angular frequency ω = 2πf and the temporal frequency f . Using this description, the time-frequency transform of the wave equation yields the Helmholtz equation ∇2p(x, ω) + ( ω c )2 p(x, ω) = 0 (2.3) which delivers stationary solutions in the frequency domain. ω/c is mostly denoted as the wave number k. As in this work spatial sound fields are measured by means of spherical micro- phone arrays, it is reasonable to use spherical coordinates to describe the point r = (r, φ, θ) in the three dimensional space. Whereby φ denotes the azimuth an- gle ranging from 0 to 2π and θ the colatitude ranging from 0 to π. Expressing the 3 2. Theory Helmholtz equation in spherical coordinates and using the assumption of a separable solution (p(r, φ, θ) = R(r) ·Φ(φ) ·Θ(θ)), the Helmholtz equation can be decomposed into three separate equations d dr ( r2dR dr ) + [ ( ω c )2 r2 − n(n+ 1) ] R = 0 (2.4) 1 sin θ d dθ ( sin θdΘ dθ ) + [ n(n+ 1)− m2 sin2 θ ] Θ = 0 (2.5) d2Φ dφ2 +m2Φ = 0. (2.6) A detailed derivation can be found in Rafaely (2015, pp. 31-34). These equations depend either on r, φ, or θ only and their solutions are well known. Equation 2.4 describes the radial behavior and can be transformed to the spherical Bessel equation, solved by the spherical Bessel functions. Equation 2.5 and Equation 2.6 describe the angular components with respect to φ and θ and can be solved by the spherical harmonics. 2.1.1.1 Spherical Bessel Functions With respect to the present problem, the radial behavior of the wave equation can be described by n-th order solutions. There are Bessel functions of the first kind jn used for interior problems, whereby all sources are surrounding the array surface, or Bessel functions of the second kind denoted as yn. Furthermore, there are Hankel functions of the first kind h(1) n = jn(z) + iy(z), or second kind h(2) n = jn(z)− iyn(z) used for exterior problems, where the sensors completely enclose the sources to be captured. Hankel functions of the second kind are also denoted as third kind Bessel functions. 2.1.1.2 Solutions of the Angular Components - Spherical Harmonics Equation 2.5 is denoted as Legendre equation which can be solved by associated n-th order and m-th mode Legendre functions Pm n (cos θ) of the first kind. Equation 2.6 is an ordinary second order differential equation which dependents on a specific integer constant m, resulting from the periodicity of Φ, as a function over the unit circle. Both angular components of the Helmholtz equation Θ(θ) and Φ(φ) are usually described by the spherical harmonics (SH) Y m n (θ, φ) = √√√√2n+ 1 4π (n−m)! (n+m)!P m n (cos θ)ejmφ. (2.7) There exist several ways of defining the spherical harmonics, see e.g. Williams (1999, Eq. 6.20, p. 186), or Jackson (1962, Eq. 3.53, p. 108) which, however, has no effect 4 2. Theory on the fundamental theory. One important property of the spherical harmonics is that they set up an orthonormal basis of the sphere < Y m n (θ, φ), Y m′ n′ > = ∫ 2π 0 ∫ π 0 Y m n (θ, φ)Y m′ n′ (θ′, φ′)∗ sin θdθdφ = δnn′(φ− φ′)δmm′(cos θ − cos θ′), (2.8) where (·)∗ denotes the complex conjugate. δnn′(φ−φ′)δmm′(cos θ− cos θ′) is denoted as Dirac Delta or Kronecker Delta function. 2.1.2 The Spatial Fourier Transform Spherical harmonics are a closed set of solutions for the angular component of the Laplacian and thus the Helmholtz equation. Furthermore, they form a complete set of orthogonal basis functions of the sphere. Therefore, arbitrary functions ful- filling the wave equation can be expanded over a sphere by means of the spherical harmonics. The time-frequency Fourier transform expands functions as a linear combination of the orthogonal sine and cosine functions over a circle. Similarly, the spatial or spherical Fourier transform (SFT) expands spatial functions as a linear combination of the orthogonal SHs over a sphere. The spatial Fourier transform is often denoted as spherical harmonic expansion and the domain described by the spatial Fourier expansion coefficients (SH coefficients) as spherical wave spectrum or SH domain. For an arbitrary function G(φ, θ, r, ω) the SH coefficients G̊nm(r, ω) can be calculated with G̊nm(r, ω) = ∫ 2π 0 ∫ π 0 G(r, φ, θ, ω)Y m n (θ, φ)∗ sin θdθdφ. (2.9) During the entire work, (̊·) is denoting the SH coefficients in spherical coordinates. The inverse spatial Fourier transform (ISFT) is given by G(r, φ, θ, ω) = ∞∑ n=0 n∑ m=−n G̊nm(r, ω)Y m n (θ, φ). (2.10) 2.1.3 The Plane Wave Composition Theoretically, a plane wave arises from an infinitely distant point source, such that a straight plane wavefront can be assumed at the point of interest. According to Rafaely (2003), every sound field can be described as a superposition of multiple plane waves. For the further processing, it becomes useful to describe sound fields as such a plane wave continuum to find the so-called plane wave (PW) components D(φd, θd, ω). These PW components or PW coefficients describe which portion of a unit-amplitude plane wave impinging on a sphere at (φd, θd) would generate the same sound pressure as the surrounding sound field at that point. According to Williams (1999, p. 259), the SH expansion with respect to the position (r0, φ, θ) of an unit-amplitude plane wave impinging on an open sphere at (θd, φd) can be expressed as p(r0, φ, θ, φd, θd, ω) = 4π ∞∑ n=0 n∑ m=−n i njn ( ω c r ) Y m n (θ, φ)Y m n (θd, φd)∗, (2.11) 5 2. Theory with the imaginary unit i and the spherical Bessel function jn. The sound field s(r0, φ, θ, ω) can now be expressed as the summation of an infinite number of weighted plane waves impinging on a sensor surface S0 with radius r0 from all possible direc- tions (φd, θd) s(r0, φ, θ, ω) = ∫ 2π 0 ∫ π 0 D(φd, θd, ω, )p(r0, φd, θd, φ, θ, ω) sin θdθdφ, (2.12) where D(φd, θd, ω) are the desired frequency dependent weighting coefficients of the unity plane waves p(r0, φ, θ, φd, θd, ω). Substituting Equation 2.11 in Equation 2.12 and rearanging in favor of expressing D(φd, θd, ω) yields D(φd, θd, ω) = ∞∑ n=0 n∑ m=−n 1 4πinjn ( ω c r0 ) S̊nm(r0, ω)Y m n (θd, φd), (2.13) where S̊nm(r0, ω) denotes the SH coefficients of the sound field s(r0, ω, φ, θ). This equation is known as the conventional plane wave decomposition (PWD), see e.g. Bernschütz (2016, p. 63) or Rafaely (2003, pp. 42-45). Considering Equation 2.11 and keeping the property of orthogonality (Eq. 2.8) in mind, the PWD can be interpreted as a varying spatial Dirac impulse in (φd, θd) direction, sampling the spatial sound field. Equation 2.13 refers to a sound field at an open sphere. The term dn = 1 4πinjn ( ω c r0 ) (2.14) is denoted as radial filter dn and contains the spherical Bessel functions jn, discussed in Section 2.1.1.1. These radial filters basically remove the dependency of the ex- pansion surface and thus depend on the array setup. Expressing the PWD more generally leads to D(φd, θd, ω) = ∞∑ n=0 n∑ m=−n dnS̊nm(r0, ω)Y m n (φd, θd). (2.15) When calculating the PW components at a rigid sphere, the radial filters dn have to compensate for scattering effects of the array body and thus will be extended to dn = 1 4πin ( jn ( ω c r0 ) − j′ n(ω c r0) h ′(2) n (ω c r0) h (2) n (ω c r0) ) . (2.16) Radial filters which are usually realized as FIR filters have a significant influence on the auralization quality, and have been discussed extensively in past investigations, e.g. Bernschütz (2016, pp. 90-118), Rafaely (2015, pp. 34-38) or Lösler and Zotter (2015). However, this work will not cover the radial filters in more detail 1. Figure 2.1 shows magnitudes of PW coefficients of a simulated broadband plane wave impinging on an ideal array at φ = 0°, θ = 90°, with respect to the frequency and azimuth direction. 1Diagrams of the radial filters used for the auralization in this work are depicted in the appendix, see Figure A.1 6 2. Theory 500 1k 20k Frequency in Hz -180 -120 -60 0 60 120 180 A z im u th in D e g re e −50 −40 −30 −20 −10 0 M a g n it u d e in d B Figure 2.1: A simulated broadband plane wave impinging on an ideal array with a radius of 10 cm at φ = 0°, θ = 90° direction. The figure shows the magnitudes of the plane wave coefficients for different azimuth angles at the array surface with respect to the frequency. The plane wave coefficients were rendered up to an SH order of 7. 2.1.4 Binaural Reproduction To reproduce the spatial sound field recording as binaural VAE, the PW components have to be merged with corresponding head-related transfer functions (HRTF) in an appropriate way. An HRTF can be understood as the transformation of a single plane wave from a sound source to the human ears. Thus, the binaural signals Y l,r(ω) that a sound field would generate at the eardrums can be calculated by weighting every PW coefficient D from direction (φd, θd) with corresponding HRTFs from the same direction and integrate them over the entire array surface Y l,r(ω) = 1 4π ∫ 2π 0 ∫ π 0 H l,r(φ, θ, ω) ∞∑ n=0 n∑ m=−n dnS̊nm(r0, ω)Y m n (φd, θd) sin θdθdφ. (2.17) H l,r(φ, θ, ω) is denoting the HRTFs. A slightly different method for calculating the binaural signals can be found, for example, in Ben-Hur et al. (2018). The HRTF set H l,r(φ, θ, ω) is transformed into the wave spectrum domain as well. By employing the property of orthogonality of the spherical harmonics (Eq. 2.8) Equation 2.17 can be rearranged to Y l,r(ω) = ∞∑ n=0 n∑ m=−n dnS̊nm(ω, r0)H̊ l,r nm(ω), (2.18) where H̊ l,r nm denotes the HRTF SH coefficients. This way of binaural reproduction highly depends on the convention of the SH basis functions which was presented in detail by Andersson (2017, p. 7). 7 2. Theory 2.2 Spherical Array Processing 2.2.1 Discretization - Sampling Scheme So far, a continuous knowledge of the sound pressure distribution on the array sur- face was expected for the mathematical derivations. However, real-world microphone arrays do not provide a closed sensor surface S0, and the capturing of a continuous sound pressure is impossible. Instead, there exist a few spherical microphone arrays with a certain number of microphones. For example some first order arrays as the Core Sound Tetra Mic, or Sennheiser AMBEO, described in Gonzalez et al. (2018), or some higher order microphone arrays as the 19 channel Zylia2, the 32 channel EIGENMIKE (Meyer and Elko, 2002), or the planned 7th order HØSMA 7N (Dzi- wis et al., 2019). The positions of the microphones on the array surface, denoted as sampling points or sampling nodes, are defined by the sampling scheme. In the following, Msg (sg: sampling grid) describes the number of microphones which are placed at the positions (φsp, θsp) (sp denotes the index of the sampling point). To every microphone, one weight wsp is assigned which indicates the portion of the array surface covered by the corresponding microphone. It can be shown that the arrangement of the microphone positions has a significant influence on the follow- ing derivations, although the amount of microphones remains the same. Different sampling schemes are discussed in previous studies, for example Zotter (2009a) or Zotter (2009b). Every sampling scheme has an individual order Nsg dependent on the scheme efficiency ηsg. It can be calculated with Msg ≈ ηsg(Nsg + 1)2. Thus, to expand up to an SH order of Nsg at least Msg microphones have to be provided by the microphone array. For simplification, the following derivations just consider arrangements with an effi- ciency of one and microphone positions on a single radius r0. Thus, the dependency on r0 vanishes. Furthermore, the direction (φ, θ) will be denoted as Ω. 2.2.2 The Discrete Spatial Fourier Transform Instead of a continuous SFT given by Equation 2.9, sampling at discrete microphone positions yields the discrete SFT (DSFT) G̊nm(ω) = Msg∑ sp=1 wsp G(Ωsp, ω)Y m n (Ωsp)∗. (2.19) Consequently, the limited number of sound field SH coefficients limits the modal order of the PWD to Nsg D(Ωd, ω) = Nsg∑ n=0 n∑ m=−n dn S̊nm(ω)Y m n (Ωd). (2.20) The ideal directional Dirac impulse becomes broader which leads to a reduced spatial resolution. Figure 2.2 shows the influence of the limitation of the PWD order. For 2https://www.zylia.co/ 8 https://www.zylia.co/ 2. Theory small orders N , the number of side-lobes decreases, but their magnitudes become larger. For higher orders, the number of side-lobes increases, but their magnitude levels decrease. For ideal sampling, and thus an infinite PWD order, all side-lobes vanish. −120 120 180 −10 −20 −30 −40 −50 −60 −70 −80 0 N or m . m ag ni tu de in d B N=5 −60 0 60 Azimuth angle in DEG −120 120 180 −10 −20 −30 −40 −50 −60 −70 −80 0 N=10 −60 0 60 Azimuth angle in DEG −120 120 180 −10 −20 −30 −40 −50 −60 −70 −80 0 N or m . m ag ni tu de in d B N=20 −60 0 60 Azimuth angle in DEG −120 120 180 −10 −20 −30 −40 −50 −60 −70 −80 0 N=30 −60 0 60 Azimuth angle in DEG Figure 2.2: Normalized magnitudes of PW coefficients with respect to the azimuth directions for SH rendering orders N = 5, N = 10, N = 20, N = 30. The figure is taken from Bernschütz (2016, p. 73) to illustrate the influence of SH order truncation. Smaller N leads to less side- lobes with higher magnitudes, larger N to an increasing number of side-lobes, but with decreasing magnitudes. An ideal PWD with an infinite SH order leads to an ideal spatial Dirac impulse. 2.3 Technical Constraints 2.3.1 Spatial Aliasing Besides other constraints, as the gain limitation of the radial filters for lower fre- quencies to avoid noise amplification, the major problem of real-world microphone arrays is the limited microphone distribution density that leads to spatial under- sampling. One consequence of those undersampled sound field captures is named spatial aliasing and will be discussed in this section. Sampling a continuous-time signal basically leads to a replication of the correspond- ing spectrum at half the sampling frequency (Nyquist-Frequency). When sampling slower than twice of the highest signal frequency, frequency components of the spec- tral replications are aliased into the lower frequency bands (Shannon, 1998). This temporal aliasing effect is illustrated in Figure 2.3. Same applies when spatial sam- pling a space-continuous sound field. Sampling with a limited number of sampling points results in the replication of the modal spectra in the wave spectrum do- main. Those replications are aliased into the lower modal wave spectra, as can be 9 2. Theory seen in Figure 2.4. However, for spatial aliasing, there is no certain temporal fre- quency bound, as the Nyquist-Frequency for the continuous-time sampling, where no aliasing artifacts arise. Natural sound fields are not band-limited, and thus spa- tial aliasing is present for every temporal frequency. Nevertheless, higher temporal frequencies are more affected by spatial aliasing than lower temporal frequencies. According to Rafaely (2005), a temporal frequency fa can be estimated where spa- tial aliasing artifacts increase rapidly fA = Nsgc 2πr0 . (2.21) f fnfn f fn fsfnfs (a) (b) − − − Figure 2.3: Sampling a time continuous signal with a sampling rate smaller than twice of the highest signal frequency leads to the time-aliasing effect. Components of the spectral replications are aliased into the higher frequencies. The figure is taken from Ahrens (2012, p. 119). Figure 2.4: Sampling a space-continuous signal results in replications in the wave spectrum domain. Analogous to the time-aliasing, portions of the wave spectrum replications of higher modes are aliased into lower modes. The figure is taken from Ahrens (2012, p. 125). Thus, SH coefficients of orders N < Nsg are not mentionably affected by spatial aliasing, or in other words, to avoid spatial aliasing up to certain temporal fre- quency f , a grid order of Nsg > (2πfr0)/c = k r0 is necessary. This frequency fa will be denoted as spatial alias frequency in the following. According to Ben-Hur et al. (2018), the aliasing error can be mathematically ex- pressed with 10 2. Theory ∞∑ n=0 n∑ m=−n Y m n (θ′, φ′)∗Y m n (θd, φd)︸ ︷︷ ︸ {δnn′δmm′ +ε(n,m,n′,m′)} . (2.22) The SH basis functions are not completely orthogonal and thus produce an addi- tional error ε. Neglecting trivial solutions as increasing the number of microphones, or decreasing the array radius r0 to improve the microphone density, there exist a small number of approaches to reduce the impact of spatial aliasing. These approaches are discussed in Section 2.3.4 2.3.2 Truncation Error A further essential constraint induced by sparse microphone arrangements is the SH series truncation error. When describing HRIRs or spatial sound fields using SH coefficients, in real-life implementations just a limited modal order of SH basis functions can be used. The DSFT is limited to Msg, as can bee seen in Equation 2.19. However, HRIRs and natural sound fields are not limited in their spatial reso- lution and thus would require an SH representation with infinite modal order. It is unavoidable to truncate the natural SH order series at a certain point and cut away the higher SH modes that contain important spatial details. This is highly relevant, especially for HRIRs. The nature of the human pinna induces relevant localization cues at high-frequency components. Representing this localiza- tion information in the SH domain requires a high spatial resolution and hence higher SH orders. Neglecting these modes leads to audible artifacts in binaural renderings, as less spaciousness or coloration effects. In particular, the contralateral ear signal that is not exposed by direct sound incidences is more affected by the truncation error. HRIR data sets are usually measured on sampling grids with an adequate resolution to store all necessary spatial details. However, there are just a limited number of sound field signals obtained by spherical microphone arrays. When merging high modal resolution HRIRs and limited modal resolution sound field data, the trunca- tion effect, and thus loss of spatial detail is unavoidable. Furthermore, similar to the windowing of a time-signal with a harsh rectangular window which results in side-lobes in the frequency domain, harsh truncating of the SH order results in artifacts in the wave spectrum domain. This effect is illustrated in Figure 2.2 or Figure 2.5. Truncating the SH order series at lower orders leads to fewer side-lobes but with higher amplitude and energy, truncating at higher orders yields multiple side-lobes with decreased magnitudes. Figure 2.5 (bottom, right) was generated by rendering up to an SH order N = 85 to simulate a nearly ideal spatial Dirac impulse. 11 2. Theory 0° 45° 90° 135° 180° 225° 270° 315° N=3 0° 45° 90° 135° 180° 225° 270° 315° N=5 0° 45° 90° 135° 180° 225° 270° 315° N=7 0° 45° 90° 135° 180° 225° 270° 315° N=85 Figure 2.5: Normalized logarithmic magnitudes of SH coefficients of a 0° spatial Dirac impulse computed with Equation 2.8 for maximum SH orders N = 3 (black) N = 5 (blue) N = 7 (red), and N = 85 (grey). 2.3.3 Consequences of Undersampled Microphone Array Data Renderings Using undersampled microphone array recordings for rendering a VAE results in the spatial aliasing effect and at the same time to the SH order truncation. Therefore, it is hard to distinguish between the consequences of those effects. However, spa- tial aliasing solely depends on the number of microphones, or more precisely on the density of the microphone arrangement, whereas the truncation effect just depends on the SH rendering order. Thus, both effects result in different and even contrary audible artifacts. Ben-Hur et al. (2019) denoted the overall error caused by sparse microphone constellations as sparsity error and mathematically break down the dif- ferences of the aliasing and truncation effect. Spatial undersampling yields two different artifacts in binaural reproductions. On the one hand, the spectral modification, and on the other hand, a loss of spatial information. Figure 2.6 illustrates both of the artifacts. Similar to Figure 2.1, it shows the magnitudes of the PW coefficients of a simulated broadband plane wave impinging on an array at φ = 0°, θ = 90° with respect to the frequency and azimuth angle. This time, on a real-life array with 10 cm radius and a 5th order Lebedev sampling scheme. Whereas for the ideal array the magnitudes for certain azimuth directions keep constant for all frequencies, the spatial resolution for the real-life array is totally lost for high frequencies. Additionally, at high frequencies an overall increased level of the PW coefficients can be observed. The influence of spatial aliasing and truncation error on the time-frequency spec- trum is illustrated in Figure 2.7. Both diagrams show the frequency responses of 90° BRIRs to the left ear resulting from binaural reproductions of array impulse re- 12 2. Theory 500 1k 20k Frequency in Hz -180 -120 -60 0 60 120 180 A z im u th in D e g re e −50 −40 −30 −20 −10 0 M a g n it u d e in d B Figure 2.6: Plane wave impact at φ = 0°, θ = 90° on a real-life microphone array with a radius of 10 cm and a 7th order Lebedev. The figure shows the magnitudes of the PW coefficients for different azimuth angles at the array surface with respect to the frequency. The PW coefficients was rendered up to an SH order of 7. sponses measured in Control Room 7 at the WDR broadcasting studios (Section 3.1 gives a more detailed explanation of the array measurements by Stade et al. (2012) used for the renderings in this work). The left-hand figure depicts BRTFs (binaural room transfer function) based on 1202 sampling point Lebedev grid array impulse responses. The black curve BRTF was rendered up to an SH order of 29, the blue one up to an SH order of 5. Hence, both BRTFs are not noteably affected by spatial aliasing, but the blue curve is significantly more affected by the truncation error. It can be seen that the truncation error yields a decreased level for higher frequencies, and thus has a low-pass characteristic. The right-hand diagram shows BRTFs based on a 1202 Lebedev grid (black curve) and a 50 sampling point Lebedev grid (blue curve), both rendered up to an SH order of 5. Hence, both renderings contain the same truncation error, but just the blue BRTF is influenced by spatial aliasing. It can be observed that spatial aliasing leads to an increased level at higher frequen- cies, and thus has a high-pass characteristic. Both illustrations show the effect on the time-frequency spectrum for one direction only. However, it can be shown that spatial aliasing is present with the same amount for all directions, whereas the trun- cation error is highly direction dependent and yields spectral artifacts in particular for lateral directions, see e.g. Zaunschirm et al. (2018), or Hold et al. (2019). 2.3.4 Reducing the Impairment of Spatial Undersampling This section gives an overview of common approaches to reduce the perceptional artifacts of spatial undersampled array auralizations. It will be separated between aliasing and truncation mitigation approaches, however some of them handle both of the effects. 13 2. Theory 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B Influence of SH Order Truncation 1202 nodes N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz alias frequency Influence of Spatial Aliasing 50 nodes N = 5 1202 nodes N = 5 Figure 2.7: 90° BRTF renderings to the left ear based on array impulse responses in Control Room 7 at theWDR studios. The left-hand figure displays two BRTFs resulting from measurements on a 1202 sampling point Lebedev grid, the black one rendered up to a maximum SH order of 29, the blue one up to an SH order of 5. The differences at higher frequencies illustrate the spectral influence of the truncation error. The right side figure shows two BRTFs resulting from renderings up to an SH order of 5. The blue curve resulting from a measurement on a Lebedev grid with 50 sampling points, the black one from a Lebedev grid with 1202 sampling points. This illustrates the spectral influence of spatial aliasing. Both BRTFs were fractional-octave smoothed over 1/3 octave (This holds for every figure displaying frequency responses in this work). 2.3.4.1 Spectral Equalization As already mentioned, one artifact of spatial aliasing is the increased sound pres- sure level for higher frequencies. The simplest way to compensate for this spectral artifact is to equalize the resulting binaural signals. Bernschütz (2016, pp.129-130) found that the spectral increase follows a certain regularity. Figure 2.8 (left) taken from Bernschütz (2016, pp.130) shows the logarithmic deviation of 7th order array renderings and the corresponding dummy head measurements, averaged over 360 azimuth directions. For the array rendering, the authors used HRTFs with a lim- ited modal resolution (RHRTFs) (Bernschütz, 2016, pp.83-86). These RHRTFs are based on SH interpolation and prevent the truncation effect, whereas the depicted frequency responses are just affected by spatial aliasing. Figure 2.8 illustrates that for diffuse sound fields, the deviation nearly linear increases with a 6 dB per octave slope for frequencies above the spatial alias frequency. The right-hand figure shows the resulting compensation filter, generated by inverting the average deviation curve. To calculate a generative filter with respect to the array configuration, the author proposed an ordinary first-order low-pass filter with a cut-off at the spatial alias frequency. To evaluate this, Figure 2.9 was created. It shows the average logarithmic deviation of binaural signals, rendered with 1202 and 50 sampling point array impulse re- sponses on the same SH order. Just like in Figure 2.7 (right), the 50 sampling point rendering is affected by spatial aliasing. The blue curve in the upper figure shows the logarithmic deviation of a 90° BRTF to the left ear, the black curve the loga- rithmic deviation averaged over right and left BRTFs for 360 azimuth directions. Bernschütz (2016) compared dummy head BRIRs to RHRTF array renderings and 14 2. Theory found that the deviation follows a 6 dB per octave slope. However, in Figure 2.9, it can be observed that the deviation increases nearly linear with 12 dB per octave. The generative low-pass filter matching the inverse of this curve can be calculated by second order low-pass filters, depicted in the bottom figure for array orders 5, 7 and 10. Nevertheless, for consistency, the discussion and preparation of the listening experiment stimuli are referred to the first-order equalization filters as proposed by Bernschütz (2016). Figure 2.10 illustrates the result of the equalization, using second order low-pass filters. Again frequency responses of BRIRs rendered with 1202 and 50 sampling point array data are compared. Additionally, the blue curve shows the equalized curve which closely matches the desired grey 1202 sampling point BRTF. 10 Frequency in Hz 2 103 104 M ag ni tu de de vi at io n in dB -10 0 -5 5 10 15 20 Simulation (measured)Broadcast Studio Control (measured) Room 10 Frequency in Hz 2 103 104 Fi lte r m ag ni tu de in dB -20 -15 -10 -5 0 5 10 Spectral compensation filter Figure 2.8: On the left-hand side the average logarithmic deviations of 7th order array render- ings and corresponding dummy head measurements are depicted. The deviation matches a linear 6 dB per octave increase for frequencies above the alias frequency. The right-hand diagram illus- trates the resulting compensation filter derived from inverting the average deviation curve. The picture taken from Bernschütz (2016, pp.130) 102 103 104 Frequency in Hz 0 5 10 15 20 M a g n it u d e in d B Log. deviation Average log. deviation 102 103 104 Frequency in Hz −20 −15 −10 −5 0 M a g n it u d e in d B Nsg = 5 Nsg = 7 Nsg = 10 Figure 2.9: Logarithmic deviation curve of 5th order array renderings done with 50 and 1202 sampling point array measurements. The blue curve depicts the logarithmic deviation for a 90° BRTF to the left ear, the black curve the deviation averaged over both ears and 360 azimuth directions. The bottom figure displays generative second order low-pass filters for the SH order 5, 7 and 10. 15 2. Theory 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B 1202 nodes 50 nodes 50 nodes, Equalized Figure 2.10: 90° BRTF to the left ear based on array renderings of 1202 (grey) and 50 (black and blue) sampling point Lebedev grid measurements. Both were rendered up to an SH order of 5. The black curve is significantly affected by spatial aliasing, the blue curve was equalized by a second order low-pass filter. Similarily, the spectral modification caused by the truncation error can be equal- ized by appropriate compensation filters. Ben-Hur et al. (2017) introduced so-called Spherical Head Filters (SHF) which also can be applied to the binaural signals. Sim- ilar to the assumptions for the spatial aliasing compensation filters, Ben-Hur et al. (2017) expect an average response of a diffuse sound field, but additionally assume a rigid sphere approximation for the human head. Figure 2.11 depicts the proposed SHF for different SH orders N . These filters were designed under the assumption of negligible spatial aliasing. kr Figure 2.11: The so-called Spherical Head Filters for SH orders 1, 3, 6 und 12 introduced by Ben-Hur et al. (2017). They assume negligible spatial aliasing and compensate for spectral artifacts induced by the SH order truncation. Both spectral compensation filters are either proposed as truncation compensation or spatial aliasing compensation filters. However, in real-life both effects are present at the same time. In Section 2.3.5.5 a combination of both filters compensating the overall undersampling error is discussed. 16 2. Theory 2.3.4.2 Bandwidth Extension Algorithm for Microphone Arrays The Bandwidth Extension Algorithm for Microphone Arrays (BEMA) has been de- veloped as spatial aliasing mitigation approach, see Bernschütz (2012) or Bernschütz (2016). The sound field SH coefficients of higher frequency bands that are affected by spatial aliasing, will be substituted with synthesized SH coefficients. As all higher SH coefficients will be replaced, the BEMA approach compensates not just the spa- tial aliasing artifacts, but also the truncation error. Basic idea is to use certain properties of the reliably obtainable frequency bands to estimate the SH coefficients of the higher frequencies. Therefore, spatial and spectral properties will be treated separately. The spatial energy distribution will be extracted from the SH coefficients of frequency bands below the aliasing frequency, captured by the microphone array. The total energy of the higher frequencies will be obtained by an additional omni- directional microphone, ideally located in the array center. This approach is based on the assumption of a natural spatial sound field, where the energy distribution of adjacent frequency bands follows a certain similarity. The sound field SH coefficients S̊nm(ω) with respect to the frequencies (ω0, ω1, . . . , ωK) can be expressed as the (N + 1)2 × (K + 1)2 matrix S̊nm(ω) =  S̊00(ω0) S̊00(ω1) . . . S̊00(ωK) S̊−11(ω0) S̊−11(ω1) . . . S̊−11(ωK) S̊01(ω0) S̊01(ω1) . . . S̊01(ωK) S̊11(ω0) S̊11(ω1) . . . S̊11(ωK) ... ... . . . ... S̊NN(ω0) S̊NN(ω1) . . . S̊NN(ωK)  . (2.23) The matrix S̊anm(ω) (Eq. 2.24) as a subset of S̊nm(ω) with respect to the frequencies (ωa, . . . , ωK−1, ωK) holds the aliased SH coefficients which will be replaced with the synthesized BEMA SH coefficients. ωa can be approximated with ωa ≈ Nsg r0 according to 2.21. S̊anm(ω) =  S̊00(ωa) . . . S̊00(ωK−1) S̊00(ωK) S̊−11(ωa) . . . S̊−11(ωK−1) S̊−11(ωK) S̊01(ωa) . . . S̊01(ωK−1) S̊01(ωK) S̊11(ωa) . . . S̊11(ωK−1) S̊11(ωK) ... . . . ... ... S̊NN(ωa) . . . S̊NN(ωK−1) S̊NN(ωK)  (2.24) To extract the reliably spatial information denoted as spatiotemporal image I̊mn an appropriate source band has to be defined. It is preferable to choose source bands as close as possible to the synthetic bands. In the following, the bands will be assumed as ωa − µ. The spatial image can be computed with I̊nm = 1 W W∑ µ=1 dn ( ωa − µ c r0 ) S̊nm(ωa − µ). (2.25) 17 2. Theory By multiplying with dn, the dependency on the radial filter vanishes. The spa- tiotemporal image should just hold information about the energy distribution, as the total energy will be taken from the center microphone. Therefore, it has to be power normalized according to p = 1 W N∑ n=0 m∑ m=−n W∑ µ=1 ∣∣∣∣dn (ωa − µc r0 ) S̊nm(ωa − µ) ∣∣∣∣2 (2.26) p = N∑ n=0 m∑ m=−n ∣∣∣∣I̊mn∣∣∣∣2 (2.27) I̊norm nm = √ p p I̊nm. (2.28) The center signal will be normalized to its own average level in the source band (ωa − µ) and the synthetic BEMA SH coefficients can be calculated with S̊BEMA nm = 1 dn(ω c r0) I̊ norm nm︸ ︷︷ ︸ spatial information cnorm(ω)︸ ︷︷ ︸ spectral information . (2.29) Here, the corresponding radial filter dn will be taken into account. The authors propose to optimize the phase of the synthetic SH coefficients. For this, the phase of the synthetic SH coefficients will be shifted according to a phase offset ∆φ obtained at the transition bin ∆φ = arg ( S̊00(ωa) S̊BEMA 00 (ωa) ) (2.30) S̊ BEMA nm (ω) = S̊BEMA nm (ω) ej∆φ. (2.31) Further, it is reasonable to fade in the synthetic coefficients, instead of a harsh crossover. Bernschütz (2012) introduced one feasible crossover method based on a smooth overlapping of synthetic and original SH coefficients. This will not be derived in more detail. For a single plane wave, the knowledge of the pressure distribution on the array surface for a single frequency bin allows to obtain the pressure distribution over the entire frequency range. Hence, BEMA works perfectly for a single plane wave. However, Bernschütz (2016, pp.144-146) found out that even for three time shifted plane waves impinging on the array surface the BEMA algorithm result in notable artifacts. It is unclear, if BEMA improves the auralization of array recordings in diffuse sound fields. 2.3.4.3 Magnitude Least Squares Schörkhuber et al. (2018) introduced a method for reducing the impact of order truncation denoted as Magnitude Least-Squares (MagLS). This method optimizes the used HRTF set in such a way that they will lead to lower truncation errors after 18 2. Theory the binaural rendering. The MagLS approach consists of to parts. First of all, the HRTFs will be modified such that the energy in higher SH orders is reduced without noteably decreasing the perceptional quality. This modification is an advancement of the time alignment (TA) approach by Zaun- schirm et al. (2018). According to the Duplex Theory by Rayleigh (1907), for high frequencies the Interaural Time Differences (ITDs) become perceptional less impor- tant than the Interaural Level Differences (ILDs). However, most of the energy in higher modes is caused by rapid phase changes towards higher frequencies. Thus, removing the linear phase at higher frequencies will decrease the energy in higher modes without significantly modifying the ILDs HTA(Ω, ω) = H(Ω, ω) if ω ≤ ωc H(Ω, ω)e−jφl(Ω,ω) if ω ≥ ωc . (2.32) Where φl(Ω, ω) is the linear phase with respect to the ITD for direction Ω and ωc the cut-off frequency which was empirically chosen by the authors. MagLS aims at finding the optimum phase instead of completely removing the linear phase. This can be achieved by solving the least-square problem that minimizes the distance of the magnitudes of the modified HRTFs and a reference set min H̊nm(ω) Msg∑ sp=1 [|Y m n (Ωsg)H̊nm(ω)| −H(Ωsg, ω)]2. (2.33) It would be beyond the scope of this thesis to derive this least-square procedure why the reader is referred to Schörkhuber et al. (2018). After modifying the phase for higher frequencies either with the TA or the MagLS approach, in a second step, the HRTFs will be optimized using a so-called covari- ance constrained matrix R(ω). This matrix will equalize the spectrum based on a statistical diffuse sound field expectation. For a detailed derivation of R(ω) the reader is referred to Zaunschirm et al. (2018), or Zotter and Frank (2019). 102 103 104 Frequency in Hz 0 5 10 15 20 25 30 S H o rd e r KU100 HRIRs 102 103 104 Frequency in Hz MagLS 3th order 102 103 104 Frequency in Hz MagLS 5th order 102 103 104 Frequency in Hz MagLS 7th order Figure 2.12: Normalized logarithmic energy distributions of HRTF SH coefficients rendered up to the orders N = 1 to N = 30. All energy plots are based on a KU100 HRIR dataset. The left-hand figure depicts the energy distribution of the untreated HRIR set, the right-hand figures the distributions of the MagLS pre-processed HRIRs for target rendering orders N = 3, N = 5, and N = 7. The color encodes the energy ranging from −40 dB to 0 dB normalized to the maximum values for each frequency bin. 19 2. Theory Figure 2.12 illustrates the efficient energy concentration of the MagLS approach for the different target rendering orders N = 3, N = 5, and N = 7. All four plots are based on the KU100 2707 HRIR dataset by Bernschütz (2013). They are displaying the energies of the SH coefficients rendered for orders N = 1 up to N = 30 with respect to the frequency, normalized to the maximum value for each frequency bin. The left-hand figure was processed with the untreated HRIR set, the right-hand figures for HRIRs that were pre-processed for target SH rendering order 3, 5 and 7. The color indicates the energies ranging from −40 dB to 0 dB. It can clearly be seen that for the pre-processed HRIR sets the energy is concentrated to the lower orders. There are a number of further HRIR pre-processing approaches that have been discussed and evaluated by Brinkmann and Weinzierl (2018). 2.3.4.4 Tapering Hold et al. (2019) introduced a method denoted as spherical harmonic tapering to reduce the side-lobes induced by order truncation. Hard truncating the SH order is equivalent to window the SH coefficients with a rectangular window which results in unwanted side-lobes. Windowing the SH coefficients using appropriate smoother window functions yields a fade-out of higher order modes which ideally results in side-lobe supression. Hold et al. (2019) proposed a half-sided Hanning window to yield this softer truncation. Furthermore, the authors propose to equalize the binaural signals using the Spherical Head Filters discussed in the previous section. However, the Hanning tapering requires slightly modified SHF. These modified filters are depicted in Figure 2.14. 0° 45° 90° 135° 180° 225° 270° 315° N=3 Tapered 0° 45° 90° 135° 180° 225° 270° 315° N=5 Tapered 0° 45° 90° 135° 180° 225° 270° 315° N=7 Tapered Figure 2.13: Normalized logarithmic magnitudes of SH coefficients of a spatial Dirac impulse in 0° direction computed with Equation 2.8 for maximum SH orders N = 3, N = 5, and N = 7. For each order the rectangular windowed (grey dashed) and Hanning windowed (colored) SH coefficients are depicted. SH Tapering using a Hanning window significantly results in side-lobe suppression, but also in a wider main-lobe with an increased level. 2.3.4.5 Matrix Regularization Alon and Rafaely (2012) (Alon and Rafaely, 2017) developed a method for spatial aliasing cancellation based on a matrix regularization. For this, it is advantageous to introduce the matrix denotation of the PWD: 20 2. Theory Figure 2.14: Hold et al. (2019) propose to equalize the resulting binaural signals after weighting the sound field SH coefficients with a tapering window. The Spherical Head Filters have to be adapted with respect to the window which is shown in the figure, taken from Hold et al. (2019). The adaptation basically results in a shift of the cut-off towards higher frequencies. According to Equation 2.15, the PW components D(ω) can be calculated from the sound pressure SH coefficients P̊nm as follows D = Y diag(b)P̊nm. (2.34) With the radial functions b and the SH basis functions Y Y =  Y 0 0 (Ω1) Y −1 1 (Ω1) . . . Y Nsg Nsg (Ω1) Y 0 0 (Ω2) Y −1 1 (Ω2) . . . Y Nsg Nsg (Ω2) ... ... . . . ... Y 0 0 (ΩMsg) Y −1 1 (ΩMsg) ... Y Nsg Nsg (ΩMsg)  . (2.35) As long as the sound field is limited to an SH order of N , thus Msg ≥ (N + 1)2 is fulfilled, the computation of the PW components from the sound pressure signals P (Ω, ω) can be assumed to be error free. By rearranging this equation, an expression for the sound field SH coefficients can be found P̊nm = diag(b)−1Y∗D. (2.36) However, real-life sound fields are not limited in their expansion order. Hence, for undersampled sound field recordings with Msg ≤ (N + 1)2 the PWD produces an error and Equation 2.3.4.5 transforms to D = Ỹdiag(b̃) ˜̊Pnm. (2.37) With Ỹ = [ Y,Y∆ ] , b̃ = [ bT , bT∆ ]T , and ˜̊Pnm = [ P̊T nm, P̊T ∆ ]T . Where (·)∆ denotes the spatial aliasing affected higher order parts. Consequently, the aliased SH coefficients can be expressed as 21 2. Theory P̊ alias. nm = diag(b)−1Y∗D = diag(b)−1Y∗︸ ︷︷ ︸ stable Ỹdiag(b̃)︸ ︷︷ ︸ aliased︸ ︷︷ ︸ Ap ˜̊Pnm (2.38) Where the matrix Ap is denoted as aliasing projection matrix. The first part of Ap consists of the stable obtainable wave spectrum information, the second part describes the aliased information. It can be written as [ 1,∆ε ] (1 = unit matrix, ∆ε = aliasing error pattern). P̊ alias. nm can thus be broken down to P̊ alias. nm = P̊nm + ∆ε ˜̊P∆. (2.39) Aim of the matrix regularization is to minimize the term ∆ε ˜̊P∆. For this purpose, a regularization matrix CAC is introduced P̊nm != CAC P̊ alias. nm = CAC Ap ˜̊Pnm = CACP̊nm + CAC ∆ε ˜̊P∆. (2.40) One measure to express the overall aliasing error εPWD is the squared Euclidean dis- tance between the ideal SH coefficients P̊ and the aliased coefficients ˜̊P. Thus, ideally, multiplication with CAC should zero the spatial aliasing error εPWD =∥∥∥ ˜̊P− P̊ ∥∥∥2 . Substituting CAC Ap and calculating the minimum by deriving and resolving to zero yields the final aliasing regularization matrix CAC = (Ap Ap H)−1Ap. (2.41) With Ap = diag(b)−1Y∗ Ỹ diag(b̃) and (·)H the Hermetian operator. Alon and Rafaely (2017) showed that this approach works well for simulated plane wave scenarios. However, it is unclear if it produces satisfying improvements for real-life diffuse sound fields. The matrix regularization approach would be highly suitable for real-time applications, because the CAC matrix just depends on the radial filters and thus could be completely pre-rendered. Furthermore, there are a number of further regularization approaches discussed in literature preventing different artifacts, for example reduction of noise gain. Those regularizations could be easily combined. 2.3.4.6 Spatial Anti-Aliasing Filters To prevent time aliasing when sampling continuous time signals, low-pass filters are commonly used to reduce the energy in higher frequency components. Thus, it is an obvious approach to apply spatial low-pass filters, also denoted as modal low- pass filters, before spatial sampling. Rafaely et al. (2007) and Meyer et al. (2008) presented a theoretical approach for deploying those modal low-pass filters. Basic idea is to weight the sound pressure function s with a filter function h before spatial sampling in such a way that it contains less energy in higher orders. It was shown 22 2. Theory that an ideal filter function that corresponds to a rectangular modal window can be expressed with h(θ, φ) = N + 1 4π(cos θ − 1)[PN+1(cos θ)− PN(cos θ)]. (2.42) This filter eliminates the energy in all SH modes above N . The spatial filtering process can be described as spherical correlation of the sound pressure s and an additional transducer providing an expanded membrane covering a wider area of the sphere. This additional transducer rotates over the entire surface to weight the incident sound pressure s. Theoretically, this spatial filtering leads to remarkable reductions of spatial aliasing, but has some essential limitations. Firstly, similar to time domain filtering where ideal rectangular windows lead to large side-lobes in the frequency domain, modal rectangular windows yield side-lobes in the spherical wave spectrum domain. To prevent this, Rafaely et al. (2007) and Elahi et al. (2019) discussed different window functions used for spatial filtering which are depicted in Figure 2.15. A further limitation is the realization of an additional expanded trans- ducer itself. The expansion of the sensing membrane could be simulated by multiple narrow microphones, however, the design is highly challenging. Bernschütz (2016, pp.132-138) extensively discusses expanded transducers and mentions an approach for using such a sub-array for implementing the spatial alias filtering. Figure 2.15: The left-hand figure taken from Rafaely et al. (2007) illustrates the magnitudes of the modal-filters designed for reducing the energy of SH orders greater than 4. It shows the ideal spatial aliasing filter, as well as rectangular and Hamming window spatial constrained filters. The right-hand figure taken from Elahi et al. (2019) shows similar filter magnitude responses, but designed for damping SH orders larger than 5. Additionally, the proposed modal-filter based on a slepian eigenfunction window is depicted. 2.3.5 Analysis of the Resulting Binaural Signals Informal prior listening tests showed that not all approaches described in Section 2.3.4 yield satisfying audible improvements. Thus, just a few algorithms will be investigated in more detail. This section will discuss the most promising approaches and their influence on the binaural signals. 23 2. Theory 2.3.5.1 MagLS The Magnitude Least-Squares algorithm (MagLS) is an approach for mitigation of the truncation error. Figure 2.16 and Figure 2.17 show the comparison of two render- ings based on array impulse response measurements in the CR7 on a 1202 sampling node Lebedev grid, for the SH orders 29 and 5. This ensures that both renderings are affected by the same spatial aliasing, but by different truncation errors. The figures show left (blue) and right (red) ear signals separately which simplifies to illustrate the directional dependency of the truncation error. In particular, the 90° BRTFs (Fig. 2.17) demonstrate that the contralateral left ear signals are more af- fected by the truncation error as the ipsilateral right ear signals. The shape of the frequency response to the left ear is totally modified, whereas the right ear signal is just damped with a nearly constant offset towards higher frequencies. Comparing the 0° and 90° ear signals shows that the truncation error has a different amount of impact for different directions. However, the MagLS algorithm yields a significant improvement for both directions. −40 −30 −20 −10 0 10 M a g n it u d e in d B Left 0° BRIR 1202 nodes N = 5 1202 nodes N = 29 Right 0° BRIR 1202 nodes N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B 1202 nodes, MagLS N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz 1202 nodes, MagLS N = 5 1202 nodes N = 29 Figure 2.16: MagLS: Frequency responses of 0° left and right ear signals resulting from array renderings based on a 1202 sampling node grid rendered up to an SH order N of 29 (black curves) and 5 (colored curves). The upper diagrams show the influence of the truncation error for the right and left ear separately. The bottom diagrams depict the MagLS improvements. It can be seen that they work equally for both ear signals. 2.3.5.2 Spherical Head Filter - SHF The Spherical Head Filters (SHF) are global filters applied to the binaural ear signals compensating for the spectral artifacts induced by SH order truncation. Figure 2.18 (left) depicts the SHF filters for the orders N = 3, 5 and 7. The improvements that can be achieved with the SHF filtering are illustrated in Figure 2.19. The binaural signals were rendered based on the same array configurations as for the MagLS Fig- ures 2.16 and 2.17, however, just for the left ear. The ear signal for 0° (left) shows a 24 2. Theory −40 −30 −20 −10 0 10 M a g n it u d e in d B Left 90° BRIR 1202 nodes N = 5 1202 nodes N = 29 Right 90° BRIR 1202 nodes N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B 1202 nodes, MagLS N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz 1202 nodes, MagLS N = 5 1202 nodes N = 29 Figure 2.17: MagLS: Frequency responses for 90° left (contralateral) and right (ipsilateral) ear signals resulting from array renderings based on a 1202 sampling node grid rendered up to an SH order N of 29 (black curves) and 5 (colored curves). The upper diagrams show the influence of the truncation error for the right and left ear separately. The bottom diagrams depict the MagLS improvements. Involving the Figure 2.16 shows that MagLS achieves remarkable improvements independent on the direction. significant improvement of the frequency response. Comparing it to the 90° signals demonstrates that the global filter applied to all directions equally yields unwanted effects for the more critical lateral frequency responses. It thus compensates the col- oration for frontal directions more than for lateral directions. However, on average, the SHF filtering leads to a spectral improvement for all directions. 2.3.5.3 Tapering The Tapering algorithm tries to mitigate the harsh truncation of the SH order series by applying a Hanning window based weighting of the SH coefficients. Additionally, the investigators propose to apply a slightly modified version of the SHF. Figure 2.20 displays the binaural signals resulting from the same renderings as before (e.g. Fig. 2.19), but involving the Tapering algorithm. Since the observation of frequency responses for single directions mainly illustrates the spectral improvements, Figure 2.20 basically indicates the influence of the modified SHF filtering. However, in- formal listening probes revealed that the Tapering yields audible improvements for lateral directions. The final listening experiment will show if the Tapering has a per- ceivable influence, or if the SHF filtering yields the more significant improvement. 25 2. Theory 103 104 Frequency in Hz −20 −15 −10 −5 0 5 10 15 20 M a g n it u d e in d B N=3 N=5 N=7 Nsg=3 Nsg=5 Nsg=7 103 104 Frequency in Hz N = 3, Nsg = 5 N = 5, Nsg = 5 N = 7, Nsg = 7 N = 3, Nsg = 7 Figure 2.18: SHF: Spherical Head Filters and aliasing compensation filters (left) for multiple SH orders N and grid orders Nsg. The aliasing compensation filters (with negative slope) depends on the grid order, the SHF (with positive slope) solely on the SH order. The right-hand figure depicts the combination of both filters for different N -Nsg configurations resulting in the global undersampling filters (GEQ). 2.3.5.4 BEMA Figure 2.21 and Figure 2.22 were generated to study the improvement of BEMA. Therefore, array measurements in the CR7 on a 1202 sampling node grid (black curves) and 50 sampling node grid (colored curves) are rendered up to the same SH order of 5. They are affected by the same truncation error, but by different amount of spatial aliasing. The four upper diagrams show an increased level of the 50 node rendering and thus demonstrate the spatial aliasing impact. The bottom diagrams depict the BEMA treated signals that are mostly slightly improved compared to the non-treated signals. However, some signals, for example the right ear signal for 90°, is even worse than the raw array rendering. In prior informal listening tests, it turned out that BEMA mostly yields unsatisfying results. Especially for highly reverberant rooms, BEMA leads to more audible artifacts than improvements. Nevertheless, it was investigated as part of the listening test. 2.3.5.5 Global Equalization Filter - GEQ Except for the BEMA algorithm, all approaches either treat the truncation or the spatial aliasing error. To compensate for the spectral artifacts induced by spatial aliasing and SH order truncation, a global undersampling compensation filter (GEQ) was derived. For this, the SHF and the generative first-order low-pass AEQ (Sec. 2.3.4.1) are combined. Figure 2.18 depicts AEQ and SHF (left), as well as the resulting GEQ (right). The high-shelf shape of the GEQ makes clear that the low-pass characteristic of the truncation error has more impact than the high-pass characteristic of spatial aliasing. An important note at this point: The SHFs just depend on the SH rendering order N , but the design of the AEQ on the spatial aliasing frequency (approximated by Eq. 2.21) and thus the sampling scheme order Nsg. Hence, two binaural renderings, one up to SH order 3, another up to order 5, requires the same AEQ, but different SHF and thus different shapes of the global equalization filter. 26 2. Theory −40 −30 −20 −10 0 10 M a g n it u d e in d B Left 0° BRIR 1202 nodes N = 5 1202 nodes N = 29 Left 90° BRIR 1202 nodes N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B 1202 nodes, SHF N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz 1202 nodes, SHF N = 5 1202 nodes N = 29 Figure 2.19: SHF: Frequency responses of 0° (left) and 90°(right) ear signals to the left ears resulting from array renderings based on a 1202 sampling node grid rendered up to an SH order N of 29 (black curves) and 5 (blue curves). The upper diagrams show the influence of the truncation error for two different directions. The bottom diagrams depict the improvements of Spherical Head Filter equalization. It can be seen that global equalization achieves different improvements with respect to the direction. 2.3.5.6 Overall Comparison The previous discussion of the improvement algorithms focuses either on the trunca- tion or spatial aliasing reduction. Therefore, certain rendering configurations were used to illustrate the influence on the single effects. However, in real-life, rendering with an SH order of 5 whereas 1202 microphone signals are given is not meaningful. Moreover, the listening test will investigate the overall perceived improvement of array renderings. To do this, they will be compared to dummy head measurements. Figure 2.23 and Figure 2.24 compare 3rd, 5th and 7th order array renderings based on array measurements in the SBS to corresponding dummy head BRTFs. Figure 2.23 compares BRTFs for 0° orientation to the left ear, Figure 2.24 the 90° BRTFs 3. The black curve depicts the dummy head BRTF, the dashed grey curve the raw un- treated array rendering. Considering the 3rd order rendering at 0°, it can be found that the Tapering and SHF lead to a too high gain of higher frequency components. MagLS and GEQ match the dummy head frequency response the best. For SH order 5 and 7, it can not be clearly determined which algorithm generates the best match of the frequency response. In addition to the frequency responses, the spatial alias frequency according to Equation 2.21 is marked. Consistent with the derivations in Section 2.3.1 this frequency is shifted to higher frequencies for higher-order render- ings. Furthermore, it can be seen that the deviation of raw rendering and dummy head reference decreases for higher order renderings. 3Similar comparisons for the LBS which was also tested in the listening experiment can be found in the appendix, see. Figure A.3 and Figure A.3. 27 2. Theory −40 −30 −20 −10 0 10 M a g n it u d e in d B Left 0° BRIR 1202 nodes N = 5 1202 nodes N = 29 Left 90° BRIR 1202 nodes N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B 1202 nodes, Tapering N = 5 1202 nodes N = 29 102 103 104 Frequency in Hz 1202 nodes, Tapering N = 5 1202 nodes N = 29 Figure 2.20: Tapering: Frequency responses of 0° (left) and 90° (right) ear signals to the left ear resulting from array renderings based on a 1202 sampling node grid rendered up to an SH order N of 29 (black curves) and 5 (blue curves). The upper diagrams show the influence of the truncation error for two different directions. The bottom diagrams depict the improvements of Tapering equalization. Similar as for the SHF, Tapering achieves different improvements with respect to the sound source direction. 28 2. Theory −40 −30 −20 −10 0 10 M a g n it u d e in d B Left 0° BRIR 50 nodes N = 5 1202 nodes N = 5 Right 0° BRIR 50 nodes N = 5 1202 nodes N = 5 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B 50 nodes, BEMA N = 5 1202 nodes N = 5 102 103 104 Frequency in Hz 50 nodes, BEMA N = 5 1202 nodes N = 5 Figure 2.21: BEMA: 0° BRTFs rendered on SH order N = 5 based on 1202 (black) and 50 (colored) sampling node grids in the CR7. The upper diagrams illustrate the influence of spatial aliasing, the bottom diagrams the BEMA improvement. It can be seen that BEMA has a different impact on the ipsilateral right (red) and contralateral left (blue) ear signal. −40 −30 −20 −10 0 10 M a g n it u d e in d B Left 90° BRIR 50 nodes N = 5 1202 nodes N = 5 Right 90° BRIR 50 nodes N = 5 1202 nodes N = 5 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B 50 nodes N = 5 1202 nodes, BEMA N = 5 102 103 104 Frequency in Hz 50 nodes N = 5 1202 nodes, BEMA N = 5 Figure 2.22: BEMA: 90° BRTFs rendered on SH order N = 5 based on 1202 (black) and 50 (colored) sampling node grids in the CR7. The upper diagrams illustrate the influence of spatial aliasing, the bottom diagrams the BEMA improvement. Involving Figure 2.21 shows that BEMA achieves different improvements with respect to the direction. 29 2. Theory 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B alias frequency SH order 3 DH Raw MagLS BEMA SHF GEQ Tapering 102 103 104 Frequency in Hz alias frequency SH order 5 DH Raw MagLS BEMA SHF GEQ Tapering 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B alias frequency SH order 7 DH Raw MagLS BEMA SHF GEQ Tapering Figure 2.23: Comparsion of 0° BRTFs to the contralateral left ear resulting from 3rd, 5th and 7th order binaural array renderings and dummy head (DH) measurements in the SBS. The array renderings were processed with each of the proposed improvement algorithms. 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B alias frequency SH order 3 DH Raw MagLS BEMA SHF GEQ Tapering 102 103 104 Frequency in Hz alias frequency SH order 5 DH Raw MagLS BEMA SHF GEQ Tapering 102 103 104 Frequency in Hz −40 −30 −20 −10 0 10 M a g n it u d e in d B alias frequency SH order 7 DH Raw MagLS BEMA SHF GEQ Tapering Figure 2.24: Comparsion of 90° BRTFs to the contralateral left ear resulting from 3rd, 5th and 7th order binaural array renderings and dummy head (DH) measurements in the SBS. The array renderings were processed with each of the proposed improvement algorithms. 30 3 Implementation 3.1 Rigid Sphere Impulse Response Measurements Although the main purpose of microphone array renderings is dynamic real-time auralization, this work focuses on static, impulse response based renderings. The entire development, as well as the preparation of the listening experiment stimuli, is based on renderings of measured array impulse responses. Stade et al. (2012) provide a comprehensive measurement database which is used by a large number of researchers studying spherical microphone array processing. Stade et al. (2012) captured dummy head room impulse responses and spherical mi- crophone array impulse responses at the same position in four different rooms at the WDR broadcasting studios in Cologne. The dummy head captures were performed using a Neumann KU100 on a pure horizontal grid with an angular resolution of 1°. The array measurements were done using the VariSphere Array (VSA) developed by Bernschütz et al. (2010). It is a fully automated robotic measurement system which allows to sequentially capture impulse responses for simulating a rigid sphere mi- crophone array. Arbitrary amounts of microphone positions on arbitrary sampling schemes are provided. Stade et al. (2012) captured array responses on a 50, 86, 110, and 1202 sampling point Lebedev grid in a large and small Broadcasting Studio (LBS, SBS), as well as in the Control Room 1 and 7 (CR1, CR7). For all sampling schemes, the microphone capsules were mounted on a rigid sphere body extension with a radius of 8.75 cm. For a more detailed description of the measurement setup, as for example specifications of the microphone capsules, the loudspeaker used for excitation, or the exact measurement positions, the reader is referred to Stade et al. (2012). Furthermore, Bernschütz (2013) provides an HRTF dataset measured on a 2702 sampling node Lebedev grid which is used for all array renderings in this work. This measurement was done using the KU100 dummy head as well. 3.2 Signal Processing A complete signal processing chain for binaural auralization of array impulse re- sponses was firstly implemented by Bernschütz et al. (2011). For this, the MATLAB Sound Field Analysis Toolbox (SOFiA) was developed. It allows to render binaural signals for any direction based on array impulse responses measured on arbitrary sampling schemes. Besides a number of additional valuable tools, e.g. for simulat- ing or plotting sound field data, all necessary signal processing steps, introduced in 31 3. Implementation Section 2.1 were implemented. Hohnerlein and Ahrens (2017) ported the SOFiA toolbox to Python. All signal processing presented in this work, is based on this Python port. A basic overview of the signal flow of the binaural reproduction of array impulse responses is shown in Figure 3.1. The microphone impulse responses measured on a M array sg sampling grid are transformed to the frequency domain first, and then using the DSFT (Eq. 2.19) to the SH domain with an SH order of N . According to the array setup, basically involving the choice of an open or rigid sphere configuration, as well as the array radius, the N + 1 radial filters are calcu- lated in the frequency domain. Likewise, the MHRIR sg sampling node HRIR dataset is transformed to the SH domain with the same SH order N . The reproduction of the binaural signals is done for certain directions. In case of a dynamic auralization using the SSR, 360 BRTFs on a horizontal grid, denoted as the decomposition grid, are calculated. This calculation is realized using Equation 2.18 by weighting the sound field SH coefficients with the HRTF SH coefficients in the SH domain. A final IFFT yields the binaural impulse responses used for convolution in the SSR. Microphone Signals HRIRs Radial Filters Binaural Signals Time Domain Time Domain Frequency Domain SH Domain Decomposition Grid Binaural Reproduction FFT DSFT FFT DSFT IFFT Array Configuration Marray sg MHRIR sg MDG sg MDG sg MHRIR sg Marray sg Figure 3.1: Basic signal flow of binaural renderings of microphone array signals. Array impulse responses measured on a Marray sg sampling point grid, as well as the MHRIR sg HRIR dataset are transformed to the frequency domain first and then to the SH domain by means of the DSFT up to an SH order N . With respect to the array configuration, N + 1 radial filters are calculated in the frequency domain. Using Equation 2.18, the binaural signals are calculated forMDG sg directions defined by the decomposition grid. A final IFFT yields the time domain BRIRs. The existing sound field analysis Python toolbox was extended during the work of this master’s thesis. BEMA was directly ported from the SOFiA toolbox. The ta- pering algorithm was implemented according to Hold et al. (2019)1. To realize the MagLS algorithm, the authors (Zaunschirm et al., 2018) provided their MATLAB code. This enabled the possibility to prepare the KU100 HRIR set for renderings on arbitrary SH orders. The Spherical Head Filters, as well as the overall spec- tral equalization filters, were realized as linear phase FIR filters generated with the firwin function available in the Python-Scipy library. Furthermore, the matrix reg- ularization approach was implemented, however, it turned out that at this stage of development it yields no perceptional satisfying results. 1https://github.com/chris-hld/spaudiopy/tree/ICASSP2019 32 https://github.com/chris-hld/spaudiopy/tree/ICASSP2019 3. Implementation 3.3 Auralization The auralization of array data based-BRIRs, as well as the measured dummy head- BRIRs, is done with the SoundScape Renderer (SSR). It allows the real-time convo- lution of arbitrary HRIRs or BRIRs defined on a pure horizontal grid with 1° resolution with arbitrary input audio signals. The dry audio signal to auralize can be played back on any audio player and routed into the SSR via Jackaudio2. To consider the head orientation of the listener, the SSR provides the connection of a variety of head-tracking systems. This enables the presentation of a dynamic binaural syn- thesis where the entire sound scene keeps static while moving the head. The SSR submits the possibility to load multiple BRIR data sets into the working memory and switch them dynamically during the playback of the dry audio signal. Fur- thermore, the SSR can handle commands sent via TCP which allows to remotely control the SSR with arbitrary applications. The conducted listening experiment introduced in the next section is based on a Python application with QT5 graphical user interface that controls the SSR via TCP connection. Figure 3.2: The graphical user interface of the SoundScape Renderer in the Binaural Room Synthesis (BRS) mode shows an example scene consisting of four sound sources. Every sound source represents one BRIR set loaded into the working memory. Every BRIR set can be unmuted dynamically which enables the real-time rendering of a dynamic binaural synthesis. 2http://jackaudio.org/ 33 http://jackaudio.org/ 3. Implementation 34 4 Perceptual Evaluation 4.1 Introduction Ideally, spherical microphone array recordings and subsequent binaural reproduc- tions should realize perceptual similar auralizations achieved with dummy head mea- surements in the same scenario. Bernschütz (2016), or Ahrens and Andersson (2019) conducted listening experiments comparing headphone-based binaural syntheses us- ing dummy head BRIR measurements to headphone-based auralizations based on microphone impulse response measurements with different amounts of microphones. Both, dummy head and array impulse response measurements were held under the same conditions. Bernschütz (2016) found that for SH rendering orders above order 8 most of the perceptual differences decrease significantly. Ahrens and Andersson (2019) confirmed this finding in their follow-up study. Rendering orders up to 8 yield audible differences compared to the dummy head auralizations. In particular, for lateral source directions, remarkable differences arise. The latter can be explained with the truncation error. Further, the investigators show that spatial aliasing has a significant influence on the auralization quality. The present investigation can be seen as a continuation of these studies: Section 2.3.4 introduced a number of truncation error and spatial aliasing error mitigation approaches that were developed over the last few years. Section 2.3.5 discusses the most promising algorithms, namely MagLS, Tapering, SHF and GEQ, and illustrates their influence on the binaural renderings. This work presents a com- parison of the perceivable improvements that can be achieved with this selection of methods by means of a listening experiment. The influence of the spatial aliasing and truncation impairments will not be treated separately, but the overall quality improvement of array renderings compared to corresponding dummy head measure- ments will be tested. 4.2 Experimental Design 4.2.1 Stimuli To compare the algorithms in a listening experiment, different BRIR renderings were presented to the listener using a dynamic binaural synthesis. Each BRIR was rendered based on array impulse responses in the WDR broadcasting studios, as 35 4. Perceptual Evaluation described in Section 3. Ahrens and Andersson (2019) showed that up to an SH rendering order of 8 audible differences arise. As this study aims at investigating undersampled array constellations, renderings of orders 3, 5 and 7 are compared. The rendering orders 3 and 5 are based on the 50 sampling node Lebedev grid mea- surement, the 7th order rendering on the 86 sampling node Lebedev grid. To study the dependency on the environment, the binaural renderings were done for array impulse responses in the LBS (approx. RT60 = 1.8 sec, at 1 kHz) and SBS (approx. RT60 = 1.1 sec, at 1 kHz). Each of these rendering configurations were processed with each improvement algorithm. It was presented a dynamic binaural synthesis where the participant was free to rotate its head and listen to all directions. To simulate an initial lateral head orientation which allows for studying the direction- dependent performance of the algorithms, each BRIR was rotated clockwise for 90°. This is of high interest in particular for the algorithms reducing the truncation er- ror. To ensure consistency to the study of Ahrens and Andersson (2019), for all renderings the radial filters were soft-limited to 0 dB1. The use of linear-phase filters results in BRIRs which are time delayed in relation to the BRIRs measured with the dummy head. To provide a continuous playback while switching the stimuli during the listening experiments, all delays where com- pensated. This was achieved by shifting the BRIRs according to their peak mea- sured with a simple onset detection provided by the AKtools MATLAB toolbox2 (Brinkmann and Weinzierl, 2017). As the test signal, acoustical dry drum recordings were presented. Drum recordings consist of a wide spectrum and typically provide high transients that make drums to rather critical test signals. Furthermore, Ahrens and Andersson (2019) used the same test signal which supports the consistency of the experimental results. 4.2.2 Experimental Paradigm To compare the different algorithms, a listening test based on the multiple stim- ulus with hidden reference and anchor (MUSHRA) test paradigm was used. The MUSHRA methodology is a multi-stimulus test design proposed by the International Telecommunication Union (ITU), see ITU-R BS.1534-3 (2015). It was developed for evaluating the preceptual quality of audio systems and codecs. It allows to compare up to 12 different systems, or different levels of audio systems presented at the same time. Therefore, the reference stimulus labeled as such, as well as multiple further stimuli were presented per trial. One of the stimuli is the hidden reference, one of them a hidden anchor-stimulus. All stimuli are compared to the reference stimulus in terms of intermediate perceived quality. The anchor is a stimulus that should be perceived as worst by the listener and is used for screening of the ratings. Each rating is done with a continuous movable slider ranging from 0-100 which allows to register very small differences. The sliders are divided into five evenly spaced segments labeled with ’Excellent’ (100-80), ’Good’ (80-60), ’Fair’ (60-40), ’Poor’ 1Again the reader is referred to the appendix where the soft-limited radial filters are depicted, Figure A.1 2www.ak.tu-berlin.de/aktools 36 www.ak.tu-berlin.de/aktools 4. Perceptual Evaluation (40-20), ’Bad’ (20-0). In contrast to the MUSHRA design introduced by the ITU, the aim was to evaluate the overall perceived difference to the reference stimuli instead of their intermedi- ate quality. Therefore, a slightly adapted test design was used in which the slider labels were replaced with ’No difference’ (100), ’Small difference’ (75), ’ Moderate difference’ (50), ’Significant difference’ (25) and ’Huge difference’ (0). Furthermore, the placements of the labels were adjusted such that ’No’ and ’Huge’ difference represents the extreme ratings 100 and 0. Figure 4.1 depicts the PyQT based ex- perimental graphical user interface developed for the listening test. All stimuli described above have to be compared to the corresponding dummy head auralization in terms of overall perceived differences. For the anchor stimulus, a 3 kHz low-pass filtered diotic signal based on the associated dummy head BRIR was generated. Hence, for each trial, BRIRs processed with the algorithms MagLS, BEMA, Tapering, SHF, GEQ, as well as the untreated BRIR rendering (raw), the hidden dummy head reference, and the anchor stimulus were presented. These con- ditions were rendered for two rooms, three SH orders, and two directions. Table 4.1 summarizes these 12 trials with 8 conditions each. Table 4.1: Overview of the stimuli tested in the final listening experiment. All of the conditions were presented per trial, whereas the reference stimulus (DH Ref), anchor stimulus, and the raw rendering are listed as an algorithm. Each of those sets of conditions were rendered for 3 SH orders, 2 directions and 2 rooms. In total 12 trials with 8 conditions each. Algorithms SH orders (grid) Directions Rooms BEMA 3 (50) 0° LBS MagLS 5 (50) 90° SBS SHF 7 (86) Tapering GEQ Raw DH Ref Anchor 4.2.3 Setup The experiment was conducted in a quiet acoustically damped audio laboratory at the Chalmers University. The participants were presented a dynamic binaural syn- thesis using AKG K702 headphones with a Lake People G109 headphone amplifier at a playback level of 66 dB(A). The head orientation was tracked with a Polhemus Patriot. The rendering of the binaural VAE was done with the SSR in BRS mode. A Python application with QT5 graphical user interface which controls the SSR via TCP commands was developed for the listening experiment. The output signals of the SSR were routed to an Antilope Audio Orion 32 channel DA converter via a 48 kHz JACK-audio server providing 512 samples per buffer. To compensate for the binaural chain of the AKG headphones and the KU100 dummy head, a headphone- equalization according to Stade et al. (2012) was performed. The entire rendering and performance of the listening experiment was done on an iMac Pro 1.1. 37 4. Perceptual Evaluation Figure 4.1: PyQT5 implementation of the graphical user interface for the listening test. 8 stimuli, one of them the hidden reference, one the hidden anchor, are compared to the reference stimulus. Each slider allows to rate the perceived difference to the dummy head reference. 4.2.4 Procedure 12 participants, 3 of them female, aged between 22 and 50 took part in the ex- periment. Most of them were master’s students or staff at the Division of Applied Acoustics of the Chalmers University. 9 participants had experience with binaural synthesis and 8 reported already participated in a listening experiment before. The subjects sat in front of a computer screen with a keyboard and mouse. Their task was to rate the similarity of each stimulus compared to the reference according to the stimuli preparation discussed above. Each stimulus, as well as the reference could be listen to as much as desired. The participants were allowed and strongly encour- aged to move their heads during the presentation of the stimuli. At the beginning of each experiment, the subjects had to rate four training stimuli which cover the entire similarity range of the presented stimuli in the following test. It took about 30 minutes for the participants to rate all 12 trials. 4.3 Results According to the recommendation of ITU-R BS.1534-3 (2015), all anchor and ref- erence ratings were post-screened before applying statistical analysis. All anchor ratings were below 30 and most of the reference ratings higher than 80. Solely two 38 4. Perceptual Evaluation reference ratings of 50 and 49 were conspicuous which, however, is a relatively low portion of in total 96 ratings per participant. There were no further noticeable ab- normalities in these ratings, wherefore none of the participants was excluded from the statistics. A first overview of the results is presented as boxplots in Figure 4.2 illustrating the ratings for the SBS and LBS at 0° and 90° separately. Each of the four figures shows a boxplot for each algorithm and the SH orders 3 (red), 5 (black), and 7 (blue). The boxes displaying the 25th and 75th percentiles and the corresponding median value as a black line. Furthermore, outliers are indicated as grey circles and minimum and maxim ratings, not identified as outliers, by the upper and lower whiskers. Boxplots refer to median values which are presented in Table A.1 in the appendix. Addition- ally, the table shows a number of further descriptive statistical values. The boxplots confirm that for most conditions the anchor and reference stimuli were indicated as such by the listener, the anchor median value of all conditions is 1 and the reference rating median value is 100. Furthermore, the boxplots for the un- treated renderings illustrate that for all four room-direction combinations, higher order renderings yielded higher similarity ratings. The 5th order rendering conse- quently achieved higher ratings than the 3rd order rendering, however, 7th order renderings in the SBS achieved lower ratings than the rendering on SH order 5. Apart from this, it is remarkable that all improvement algorithms achieved higher ratings than raw renderings except the BEMA algorithm. Hence, all other algo- rithms achieved perceptual improvements. For a more detailed analysis, repeated measures ANOVAs were performed. To test the data requirements for the ANOVA a Lilliefors test for normality distribution was applied. It failed to reject the null hypothesis in 5 of 72 conditions at a signif- icance level of .05. However, parametric tests as ANOVAs are generally robust to violations of normal distribution assumption. For the further analysis, Greenhouse- Geisser corrected p-values are considered (the associated ε values for correction of the degrees of freedom of the F -distribution are reported as well). A four-way repeated measures ANOVA with the within-subject factors algorithm (BEMA, MagLS, Tapering, SHF, GEQ and raw), order N (3, 5, 7), room (SBS, LBS) and direction (0° , 90°) was performed. Even though the denotation is rather inappropriate, the raw rendering is one factor-step of the within-subject factor al- gorithm. The ANOVA results are presented in Table 4.2. In addition, a number of nested repeated measures ANOVAs were performed whose results are presented in the appendix. For each of the six algorithms, one three-way ANOVA with the factors order (3, 5, 7), room (SBS, LBS), and direction (0° , 90°), as well as a four-way ANOVA with the subset of MagLS, Tapering, SHF, and GEQ for the factor algorithm and the factors order (3, 5, 7), room (SBS, LBS), and di- rection (0° , 90°) was applied. Since ANOVAs refer to mean values, Figure 4.3 was generated to support the ANOVA results. It depicts meanplots for each algorithm and the orders 3 (red), 5 (black), and 7 (blue), separately. Each mean value was calculated by averaging 39 4. Perceptual Evaluation Figure 4.2: Boxplots illustrating the ratings for each room and direction separately. Each figure shows the boxplots for each algorithm and the SH orders 3 (red), 5 (black) and 7 (blue). Each box indicates the 25th and 75th percentiles and the median value as a black line. The outliers are marked as grey circles and the minimum and maximum ratings not identified as outliers with black upper whiskers. the ratings for both rooms, both directions, and all participants. Furthermore, 95% within-subject confidence intervals as proposed by Loftus (2002) and Jarmasz and Hollands (2009) based on the algorithm main effect are shown. The figure confirms the observations taken from the boxplots that for the raw rendering higher SH orders achieved better ratings than lower-order renderings and that, except for BEMA all algorithms led to an improvement compared to the raw auralization. More clearly than the boxplots, the meanplots suggest that the perceptual quality of the array rendering does not scale linearly with the SH order. Considering just the raw con- dition shows that the mean ratings for 5th and 7th order renderings are more closer to each other than the 5th and 3rd order ratings. Similar to the boxplots, the meanplots indicate that the 7th order renderings involv- ing the Tapering algorithm were rated worse compared to the 5th order Tapering renderings. Additionally, it can be seen that the average ratings of the algorithms SHF, GEQ, Tapering and MagLS are located in the same range. That leads to the 40 4. Perceptual Evaluation Table 4.2: Results of the four-way repeated measures ANOVA with the within-subject factors algorithm (BEMA, MagLS, Tapering, SHF, GEQ, Raw), order (3, 5, 7), direction (0° , 90°), and room (SBS, LBS) Effect df F εGG η2 p p pGG Algorithm 5 97.964 .572 .899 < .001∗ < .001∗ Order 2 27.681 .808 .716 < .001∗ < .001∗ Direction 1 1.014 1 .084 .335 .335 Room 1 .462 1 .040 .511 .511 Algorithm×Order 10 4.989 .514 .312 < .001∗ < .001∗ Algorithm×Direction 5 3.302 .703 .231 .011∗ .024∗ Order×Direction 2 1.811 .876 .141 .187 .193 Algorithm×Room 5 1.267 .683 .103 .291 .300 Order×Room 2 .224 .677 .020 .801 .715 Direction×Room 1 .016 1 .001 .902 .902 Algorithm×Order×Direction 10 2.329 .518 .175 .016 .052 Algorithm×Order×Room 10 1.817 .481 .142 .066 .128 Algorithm×Direction×Room 5 1.544 .575 .123 .191 .223 Order×Direction×Room 2 .372 .917 .033 .693 .676 Algorithm×Order×Direction×Room 10 1.589 .424 .126 .119 .190 εGG: Greenhouse-Geisser epsilons pGG: Greenhouse-Geisser corrected p-values. Statistical significance at 5% level are indicated by asterisks assumption that these four algorithms achieve similar auralization improvements. Their average ratings for all SH rendering orders are located between 69.65 and 84.46. (GEQ: 74.44 (N = 3), 80.52 (N = 5), 81.02 (N = 7), SHF: 69.65 (N = 3), 73.48 (N = 5), 78.27 (N = 7), MagLS: 72.4 (N = 3), 78.67 (N = 5), 79.29 (N = 7), Taper: 73.75 (N = 3), 84.46 (N = 5), 77.73 (N = 7). The following analysis refers to main effects and first order interactions only, and neglects the three and four-way interaction effects. For the four-way ANOVA in- volving all algorithms it was found that the algorithm and order main effects, as well as the first order interaction effects algorithm × order and algorithm × direction were significant. These significances will be examined successively in the following: Algorithm and Order Dependency The significances of the main effects order (F (2, 22) = 27.681, p < .001, η2 p = .716, ε = .808) and algorithm (F (5, 55) = 97.964, p < .001, η2 p = .899, ε = .572) match the findings observed in Figure 4.2 and Figure 4.3. Higher SH orders lead to different, and mostly higher similarity ratings. Hence, the auralization quality highly depends on the rendering order and the algorithm. The first order interaction effect of the factor algorithm × order (F (10, 110) = 4.98, p < .001, η2 p = .312, ε = .514) shows that not solely the SH order influences the quality of the untreated rendering, but also the algorithm performs differently with respect to the SH order. The single algorithm ANOVAs support this. For every ANOVA the order was indicated as a significant effect. 41 4. Perceptual Evaluation Huge Significant Moderate Small No BEMA Raw SHF GEQ Tapering MagLS Figure 4.3: Mean values for algorithm and SH order separately. Every mean value was cal- culated by averaging all room and direction ratings for each participant. The 95% within-subject confidence intervals for the main effect algorithm were calculated according to Loftus (2002) and Jarmasz and Hollands (2009). Red colors indicate order 3 renderings, black colors order 5 and blue order 7 renderings. The highest mean value for order 3 and 7 were achieved by the Global Equalization Filter (74.44, 81.02), for order 5 by the Tapering approach (84.46). To validate the observation that the algorithms SHF, GEQ, Tapering and MagLS achieved similar improvements, the four-way repeated measures ANOVA just for these algorithms was applied. The factor algorithm was not indicated as significant anymore (p = .124). Thus, in general, all algorithms except BEMA achieved similar perceptual improvements. Directional Dependency The four-way ma