Perceptual Evaluation of Mitigation Approaches
of Errors due to Spatial Undersampling in Bin-
aural Renderings of Spherical Microphone Array
Data

Tim Lübeck
Department of Civil Engineering and Architecture
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2019


Master’s thesis 2019

Perceptual Evaluation of Mitigation Approaches
of Errors due to Spatial Undersampling in

Binaural Renderings of Spherical Microphone
Array Data

Tim Lübeck

Department of Architecture and Civil Engineering
Division of Applied Acoustics
Audio Technology Group

Chalmers University of Technology
Gothenburg, Sweden 2019


Faculty of Information-, Media- and Electrical Engineering
Institute of Communications Engineering

TH Köln - University of Applied Sciences


Perceptual Evaluation of Mitigation Approaches of Errors due to Spatial Undersampling
in Binaural Renderings of Spherical Microphone Array Data
Tim Lübeck

© Tim Lübeck, 2019.

Supervisor:
Jens Ahrens, Chalmers University of Technology, Department of Architecture and
Civil Engineering

Examiner:
Jens Ahrens, Chalmers University of Technology, Department of Architecture and
Civil Engineering
Christoph Pörschmann, TH Köln, Institute of Communications Engineering

Master’s Thesis ACEX30-19-101
Department of Architecture and Civil Engineering
Division of Applied Acoustics
Audio Technology Group
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Python simulation of a plane wave impact on a spherical microphone array.

Typeset in LATEX
Gothenburg, Sweden 2019

vi


Perceptual Evaluation of Mitigation Approaches of Errors due to Spatial
Undersampling in Binaural Renderings of Spherical Microphone Array
Data

Tim Lübeck
Department of Architecture and Civil Engineering
Chalmers University of Technology

Home University:
TH Köln - University of Applied Sciences,
Institute of Communications Engineering

Abstract
High fidelity virtual acoustic environments (VAEs) have become of increasing in-
terest in the context of immersive virtual and augmented reality applications, and
have developed to a reputable field of research. To present a real-life sound scene as
VAE to a listener, the spatial sound field has to be captured. Traditionally, spatial
sound fields are sequentially measured by means of a dummy head. Using spherical
microphone array recordings is a promising alternative to time-consuming dummy
head measurements. This enables the possibility of dynamic real-time applications
that involve recordings and reproductions of spatial sound scenes as for example
live concerts or telephone conferences. However, the quality of the rendered VAEs is
significantly limited by the density of the microphone distribution provided by the
microphone array. A limited number of microphones leads to spatial undersampling
of the surrounding sound field and creates audible artifacts in the VAE. This work
investigates the errors due to spatial undersampling and state-of-the-art approaches
to mitigate them. In a concluding listening experiment, the improvements that can
be achieved with these approaches were perceptually evaluated. It was found that
there are a few algorithms that significantly improve the binaural auralization of
spherical microphone array data.

Keywords: Spatial Aliasing, SH Order Truncation, Spherical Harmonics, Plane
Wave Decomposition, Spherical Microphone Arrays

vii


Acknowledgements
First of all, I would like to thank my supervisor Jens Ahrens whose door was always
open to me and my questions. Through him, I have gained a lot of expertise for my
future career.
Many thanks go to my second supervisor Christoph Pörschmann. He already ac-
companies me throughout my entire studies and played a decisive role in my current
graduation. He also opens up important chances for my future personal and career
development.
I also would like to thank the entire Division of Applied Acoustics of Chalmers
University for giving me the opportunity to work on this project. It was always
a pleasant atmosphere, that made my time at Chalmers a great experience and a
perfect completion of my university studies. At this stage, I want to say thank you
to all the people at the Division who participated in my final listening experiment.
Further, I would like to thank my office colleague Hannes Helmholz for the enjoyable
cooperation. Hopefully, there will be more common projects in the future.
Last but not least, I would like to thank Johannes Arend who taught me all my
basic knowledge in audio processing and scientific writing and still has the patience
to guide me with the statistical evaluations.
As this work was done during an Erasmus semester abroad at the Chalmers Univer-
sity, many thanks go to the Erasmus program for their financial support.

Tim Lübeck, Gothenburg, August 2019

ix


Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Mathematical Derivation of Capturing and

Reproducing Spatial Sound Fields . . . . . . . . . . . . . . . . . . . . 3
2.1.1 The Acoustic Wave Equation and its Solutions . . . . . . . . . 3

2.1.1.1 Spherical Bessel Functions . . . . . . . . . . . . . . . 4
2.1.1.2 Solutions of the Angular Components - Spherical

Harmonics . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 The Spatial Fourier Transform . . . . . . . . . . . . . . . . . . 5
2.1.3 The Plane Wave Composition . . . . . . . . . . . . . . . . . . 5
2.1.4 Binaural Reproduction . . . . . . . . . . . . . . . . . . . . . . 7

2.2 Spherical Array Processing . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Discretization - Sampling Scheme . . . . . . . . . . . . . . . . 8
2.2.2 The Discrete Spatial Fourier Transform . . . . . . . . . . . . . 8

2.3 Technical Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Spatial Aliasing . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Truncation Error . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.3 Consequences of Undersampled Microphone

Array Data Renderings . . . . . . . . . . . . . . . . . . . . . . 12
2.3.4 Reducing the Impairment of Spatial Undersampling . . . . . . 13

2.3.4.1 Spectral Equalization . . . . . . . . . . . . . . . . . 14
2.3.4.2 Bandwidth Extension Algorithm for Microphone Ar-

rays . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.4.3 Magnitude Least Squares . . . . . . . . . . . . . . . 18
2.3.4.4 Tapering . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.4.5 Matrix Regularization . . . . . . . . . . . . . . . . . 20
2.3.4.6 Spatial Anti-Aliasing Filters . . . . . . . . . . . . . . 22

2.3.5 Analysis of the Resulting Binaural Signals . . . . . . . . . . . 23
2.3.5.1 MagLS . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.5.2 Spherical Head Filter - SHF . . . . . . . . . . . . . . 24
2.3.5.3 Tapering . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.5.4 BEMA . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.3.5.5 Global Equalization Filter - GEQ . . . . . . . . . . . 26

xi


Contents

2.3.5.6 Overall Comparison . . . . . . . . . . . . . . . . . . 27

3 Implementation 31
3.1 Rigid Sphere Impulse Response Measurements . . . . . . . . . . . . . 31
3.2 Signal Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Auralization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Perceptual Evaluation 35
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.1 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2.2 Experimental Paradigm . . . . . . . . . . . . . . . . . . . . . 36
4.2.3 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5 Conclusion 45
5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Follow-Up Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 47

A Appendix 51
A.1 Radial Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Comparison of Algorithm Improvements . . . . . . . . . . . . . . . . 52
A.3 Results of the Listening Experiment . . . . . . . . . . . . . . . . . . . 54

A.3.1 Overall Descriptive Values . . . . . . . . . . . . . . . . . . . . 54
A.3.2 Nested ANOVA Results . . . . . . . . . . . . . . . . . . . . . 54

xii


1
Introduction

1.1 Motivation

In the last decades the rising number of virtual and augmented reality applications
creates the demand for high fidelity virtual acoustic environments (VAEs). Further-
more, presenting immersive spatial sound fields to a listener has become a reputable
field of research. One common way of presenting a VAE is binaural synthesis. The
headphone-based auralization stimulates the binaural spatial hearing of the human
auditory system, and ideally reproduces the same sound pressure at the ear drums
that would be perceived in a real-life sound scene. Binaural renderings are based
on appropriate head-related impulse responses (HRIRs) or binaural room impulse
responses (BRIRs) that describe the transformation of sound reaching the human
ear drums. This information is usually obtained by a dummy head with two micro-
phones in the ear canals which capture the surrounding sound field. For presenting
a full-spherical VAE, in particular when involving the head-orientation of the lis-
tener, time-consuming dummy head measurements with adequately high resolution
are unavoidable.
When thinking of dynamic auralizations of spatial sound fields, for example broad-
castings of live concerts, or spatial acoustic telephone conferences, dummy head
measurements are unfeasible. An alternative to dummy head measurements is the
recording of spatial sound fields by means of spherical microphone arrays. Micro-
phone arrays allow to capture the entire surrounding sound field at once, instead of
sequentially measuring it with a dummy head. The reproduction of array recordings
is not restricted to binaural auralizations, but could also involve wave field synthe-
ses, Ambisonic reproductions or any other object based VAE. Furthermore, spherical
microphone arrays can be used for a wide range of further applications and research
areas, for example sound field analyses as beamforming or acoustic holography.
This work focuses on binaural reproductions of spherical microphone array captur-
ings. Bernschütz (2016) elaborated a mathematical closed description for recording
a spatial sound field by means of spherical microphone arrays, and reproducing it via
dynamic binaural synthesis. Theoretically, this reproduction works without any loss
of quality compared to dummy head auralizations. However, real-life implementa-
tions have technical constrains, mainly caused by a limited number of microphones
that lead to spatial undersampling of the surrounding sound field. Two technical
constraints, namely the spatial aliasing and order truncation effect, lead to audible
artifacts in the binaural rendering. They are the spatial analogous to the artifacts
occurring at the time-frequency sampling.

1


1. Introduction

This work investigates the spatial aliasing and order truncation effect and their
consequences on the perceived quality of spatial audio reproductions. Moreover,
state-of-the-art algorithms to reduce the influence of these artifacts are studied,
implemented and compared in a conclusive listening experiment.

1.2 Related Works
Benjamin Bernschütz developed a comprehensve signal processing chain including
the microphone array capturing of spatial sound fields and the reproduction of bin-
aural signals as part of his doctoral dissertation (Bernschütz, 2016). Most of the
theoretical derivations in this work are based on this dissertation. The Sound Field
Analysis MATLAB toolbox (SOFiA) provides the implementation of the signal pro-
cessing described in his thesis, see Bernschütz et al. (2011). The Python port1 is
an equivalent implementation realized by Christoph Hohnerlein, see Hohnerlein and
Ahrens (2017). All implementations, visualizations, and the final generation of the
listening experiment stimuli is done with tools from this Python Sound Field Anal-
ysis toolbox (SFA).
The auralization of the binaural signals is done with the SoundScape Renderer
(SSR), developed by Geier et al. (2008). It allows to render arbitrary HRIR or
BRIR data sets defined on a pure horizontal sampling grid with 1° resolution.
To achieve the main goal of spherical microphone array renderings, namely the real-
time auralization, Helmholz et al. (2019) developed a Python application which
allows to dynamically capture and reproduce spatial sound fields. Although this
work just focuses on impulse response based renderings, it is intended to use the
findings to improve this real-time auralization as much as possible.

1https://github.com/AppliedAcousticsChalmers/sound_field_analysis-py

2

https://github.com/AppliedAcousticsChalmers/sound_field_analysis-py


2
Theory

2.1 Mathematical Derivation of Capturing and
Reproducing Spatial Sound Fields

This chapter briefly summarizes the theory of capturing spatial sound fields with
spherical microphone arrays and reproducing these captures as a virtual acoustic
environment. Therefore, in the first section, the mathematical fundamentals of
sound field capturing are explained. Section 2.2 describes the limitations that arise
when applying the mathematical derivations to real-life spherical microphone arrays.
Finally, the consequences of these technical constraints are discussed, and a number
of state-of-the-art reduction approaches are introduced.

2.1.1 The Acoustic Wave Equation and its Solutions
In linear acoustics every sound pressure p(x, t) in the free space x at the time t is
satisfying the homogeneous acoustic wave equation, with c denoting the speed of
sound in the air and ∇2 the Laplacian in Cartesian coordinates.

∇2p(x, t)− 1
c2
∂2

∂t2
p(x, t) = 0. (2.1)

This holds for every homogeneous fluid with no viscosity, see e.g. Kinsler et al.
(1999) or Williams (1999, Eq. 2.1, p. 15). For a sound field consisting of a single-
frequency, the pressure p(x, t) can be expressed as

p(x, t) = p(x)ejωt, (2.2)

with the angular frequency ω = 2πf and the temporal frequency f . Using this
description, the time-frequency transform of the wave equation yields the Helmholtz
equation

∇2p(x, ω) +
(
ω

c

)2
p(x, ω) = 0 (2.3)

which delivers stationary solutions in the frequency domain. ω/c is mostly denoted
as the wave number k.
As in this work spatial sound fields are measured by means of spherical micro-
phone arrays, it is reasonable to use spherical coordinates to describe the point
r = (r, φ, θ) in the three dimensional space. Whereby φ denotes the azimuth an-
gle ranging from 0 to 2π and θ the colatitude ranging from 0 to π. Expressing the

3


2. Theory

Helmholtz equation in spherical coordinates and using the assumption of a separable
solution (p(r, φ, θ) = R(r) ·Φ(φ) ·Θ(θ)), the Helmholtz equation can be decomposed
into three separate equations

d

dr

(
r2dR

dr

)
+
[ (

ω

c

)2
r2 − n(n+ 1)

]
R = 0 (2.4)

1
sin θ

d

dθ

(
sin θdΘ

dθ

)
+
[
n(n+ 1)− m2

sin2 θ

]
Θ = 0 (2.5)

d2Φ
dφ2 +m2Φ = 0. (2.6)

A detailed derivation can be found in Rafaely (2015, pp. 31-34). These equations
depend either on r, φ, or θ only and their solutions are well known. Equation
2.4 describes the radial behavior and can be transformed to the spherical Bessel
equation, solved by the spherical Bessel functions. Equation 2.5 and Equation 2.6
describe the angular components with respect to φ and θ and can be solved by the
spherical harmonics.

2.1.1.1 Spherical Bessel Functions

With respect to the present problem, the radial behavior of the wave equation can
be described by n-th order solutions. There are Bessel functions of the first kind jn
used for interior problems, whereby all sources are surrounding the array surface, or
Bessel functions of the second kind denoted as yn. Furthermore, there are Hankel
functions of the first kind h(1)

n = jn(z) + iy(z), or second kind h(2)
n = jn(z)− iyn(z)

used for exterior problems, where the sensors completely enclose the sources to be
captured. Hankel functions of the second kind are also denoted as third kind Bessel
functions.

2.1.1.2 Solutions of the Angular Components - Spherical Harmonics

Equation 2.5 is denoted as Legendre equation which can be solved by associated
n-th order and m-th mode Legendre functions Pm

n (cos θ) of the first kind. Equation
2.6 is an ordinary second order differential equation which dependents on a specific
integer constant m, resulting from the periodicity of Φ, as a function over the unit
circle.
Both angular components of the Helmholtz equation Θ(θ) and Φ(φ) are usually
described by the spherical harmonics (SH)

Y m
n (θ, φ) =

√√√√2n+ 1
4π

(n−m)!
(n+m)!P

m
n (cos θ)ejmφ. (2.7)

There exist several ways of defining the spherical harmonics, see e.g. Williams (1999,
Eq. 6.20, p. 186), or Jackson (1962, Eq. 3.53, p. 108) which, however, has no effect

4


2. Theory

on the fundamental theory. One important property of the spherical harmonics is
that they set up an orthonormal basis of the sphere

< Y m
n (θ, φ), Y m′

n′ > =
∫ 2π

0

∫ π

0
Y m
n (θ, φ)Y m′

n′ (θ′, φ′)∗ sin θdθdφ

= δnn′(φ− φ′)δmm′(cos θ − cos θ′),
(2.8)

where (·)∗ denotes the complex conjugate. δnn′(φ−φ′)δmm′(cos θ− cos θ′) is denoted
as Dirac Delta or Kronecker Delta function.

2.1.2 The Spatial Fourier Transform
Spherical harmonics are a closed set of solutions for the angular component of the
Laplacian and thus the Helmholtz equation. Furthermore, they form a complete
set of orthogonal basis functions of the sphere. Therefore, arbitrary functions ful-
filling the wave equation can be expanded over a sphere by means of the spherical
harmonics. The time-frequency Fourier transform expands functions as a linear
combination of the orthogonal sine and cosine functions over a circle. Similarly, the
spatial or spherical Fourier transform (SFT) expands spatial functions as a linear
combination of the orthogonal SHs over a sphere. The spatial Fourier transform
is often denoted as spherical harmonic expansion and the domain described by the
spatial Fourier expansion coefficients (SH coefficients) as spherical wave spectrum
or SH domain. For an arbitrary function G(φ, θ, r, ω) the SH coefficients G̊nm(r, ω)
can be calculated with

G̊nm(r, ω) =
∫ 2π

0

∫ π

0
G(r, φ, θ, ω)Y m

n (θ, φ)∗ sin θdθdφ. (2.9)

During the entire work, (̊·) is denoting the SH coefficients in spherical coordinates.
The inverse spatial Fourier transform (ISFT) is given by

G(r, φ, θ, ω) =
∞∑
n=0

n∑
m=−n

G̊nm(r, ω)Y m
n (θ, φ). (2.10)

2.1.3 The Plane Wave Composition
Theoretically, a plane wave arises from an infinitely distant point source, such that
a straight plane wavefront can be assumed at the point of interest. According to
Rafaely (2003), every sound field can be described as a superposition of multiple
plane waves. For the further processing, it becomes useful to describe sound fields
as such a plane wave continuum to find the so-called plane wave (PW) components
D(φd, θd, ω). These PW components or PW coefficients describe which portion of
a unit-amplitude plane wave impinging on a sphere at (φd, θd) would generate the
same sound pressure as the surrounding sound field at that point.
According to Williams (1999, p. 259), the SH expansion with respect to the position
(r0, φ, θ) of an unit-amplitude plane wave impinging on an open sphere at (θd, φd)
can be expressed as

p(r0, φ, θ, φd, θd, ω) = 4π
∞∑
n=0

n∑
m=−n

i njn
(
ω

c
r
)
Y m
n (θ, φ)Y m

n (θd, φd)∗, (2.11)

5


2. Theory

with the imaginary unit i and the spherical Bessel function jn. The sound field
s(r0, φ, θ, ω) can now be expressed as the summation of an infinite number of weighted
plane waves impinging on a sensor surface S0 with radius r0 from all possible direc-
tions (φd, θd)

s(r0, φ, θ, ω) =
∫ 2π

0

∫ π

0
D(φd, θd, ω, )p(r0, φd, θd, φ, θ, ω) sin θdθdφ, (2.12)

where D(φd, θd, ω) are the desired frequency dependent weighting coefficients of the
unity plane waves p(r0, φ, θ, φd, θd, ω). Substituting Equation 2.11 in Equation 2.12
and rearanging in favor of expressing D(φd, θd, ω) yields

D(φd, θd, ω) =
∞∑
n=0

n∑
m=−n

1
4πinjn

(
ω
c
r0
) S̊nm(r0, ω)Y m

n (θd, φd), (2.13)

where S̊nm(r0, ω) denotes the SH coefficients of the sound field s(r0, ω, φ, θ). This
equation is known as the conventional plane wave decomposition (PWD), see e.g.
Bernschütz (2016, p. 63) or Rafaely (2003, pp. 42-45).
Considering Equation 2.11 and keeping the property of orthogonality (Eq. 2.8) in
mind, the PWD can be interpreted as a varying spatial Dirac impulse in (φd, θd)
direction, sampling the spatial sound field.
Equation 2.13 refers to a sound field at an open sphere. The term

dn = 1
4πinjn

(
ω
c
r0
) (2.14)

is denoted as radial filter dn and contains the spherical Bessel functions jn, discussed
in Section 2.1.1.1. These radial filters basically remove the dependency of the ex-
pansion surface and thus depend on the array setup. Expressing the PWD more
generally leads to

D(φd, θd, ω) =
∞∑
n=0

n∑
m=−n

dnS̊nm(r0, ω)Y m
n (φd, θd). (2.15)

When calculating the PW components at a rigid sphere, the radial filters dn have
to compensate for scattering effects of the array body and thus will be extended to

dn = 1

4πin
(
jn
(
ω
c
r0
)
− j′

n(ω
c
r0)

h
′(2)
n (ω

c
r0)
h

(2)
n (ω

c
r0)
) . (2.16)

Radial filters which are usually realized as FIR filters have a significant influence on
the auralization quality, and have been discussed extensively in past investigations,
e.g. Bernschütz (2016, pp. 90-118), Rafaely (2015, pp. 34-38) or Lösler and Zotter
(2015). However, this work will not cover the radial filters in more detail 1.
Figure 2.1 shows magnitudes of PW coefficients of a simulated broadband plane
wave impinging on an ideal array at φ = 0°, θ = 90°, with respect to the frequency
and azimuth direction.

1Diagrams of the radial filters used for the auralization in this work are depicted in the appendix,
see Figure A.1

6


2. Theory

500 1k 20k

Frequency in Hz

-180

-120

-60

0

60

120

180
A

z
im

u
th

in
D

e
g
re

e

−50

−40

−30

−20

−10

0

M
a
g
n
it

u
d
e

in
d
B

Figure 2.1: A simulated broadband plane wave impinging on an ideal array with a radius of
10 cm at φ = 0°, θ = 90° direction. The figure shows the magnitudes of the plane wave coefficients
for different azimuth angles at the array surface with respect to the frequency. The plane wave
coefficients were rendered up to an SH order of 7.

2.1.4 Binaural Reproduction
To reproduce the spatial sound field recording as binaural VAE, the PW components
have to be merged with corresponding head-related transfer functions (HRTF) in
an appropriate way. An HRTF can be understood as the transformation of a single
plane wave from a sound source to the human ears. Thus, the binaural signals
Y l,r(ω) that a sound field would generate at the eardrums can be calculated by
weighting every PW coefficient D from direction (φd, θd) with corresponding HRTFs
from the same direction and integrate them over the entire array surface

Y l,r(ω) = 1
4π

∫ 2π

0

∫ π

0
H l,r(φ, θ, ω)

∞∑
n=0

n∑
m=−n

dnS̊nm(r0, ω)Y m
n (φd, θd) sin θdθdφ.

(2.17)
H l,r(φ, θ, ω) is denoting the HRTFs. A slightly different method for calculating the
binaural signals can be found, for example, in Ben-Hur et al. (2018). The HRTF set
H l,r(φ, θ, ω) is transformed into the wave spectrum domain as well. By employing
the property of orthogonality of the spherical harmonics (Eq. 2.8) Equation 2.17
can be rearranged to

Y l,r(ω) =
∞∑
n=0

n∑
m=−n

dnS̊nm(ω, r0)H̊ l,r
nm(ω), (2.18)

where H̊ l,r
nm denotes the HRTF SH coefficients. This way of binaural reproduction

highly depends on the convention of the SH basis functions which was presented in
detail by Andersson (2017, p. 7).

7


2. Theory

2.2 Spherical Array Processing

2.2.1 Discretization - Sampling Scheme
So far, a continuous knowledge of the sound pressure distribution on the array sur-
face was expected for the mathematical derivations. However, real-world microphone
arrays do not provide a closed sensor surface S0, and the capturing of a continuous
sound pressure is impossible. Instead, there exist a few spherical microphone arrays
with a certain number of microphones. For example some first order arrays as the
Core Sound Tetra Mic, or Sennheiser AMBEO, described in Gonzalez et al. (2018),
or some higher order microphone arrays as the 19 channel Zylia2, the 32 channel
EIGENMIKE (Meyer and Elko, 2002), or the planned 7th order HØSMA 7N (Dzi-
wis et al., 2019). The positions of the microphones on the array surface, denoted
as sampling points or sampling nodes, are defined by the sampling scheme. In the
following, Msg (sg: sampling grid) describes the number of microphones which are
placed at the positions (φsp, θsp) (sp denotes the index of the sampling point). To
every microphone, one weight wsp is assigned which indicates the portion of the
array surface covered by the corresponding microphone. It can be shown that the
arrangement of the microphone positions has a significant influence on the follow-
ing derivations, although the amount of microphones remains the same. Different
sampling schemes are discussed in previous studies, for example Zotter (2009a) or
Zotter (2009b). Every sampling scheme has an individual order Nsg dependent on
the scheme efficiency ηsg. It can be calculated with Msg ≈ ηsg(Nsg + 1)2. Thus, to
expand up to an SH order of Nsg at least Msg microphones have to be provided by
the microphone array.
For simplification, the following derivations just consider arrangements with an effi-
ciency of one and microphone positions on a single radius r0. Thus, the dependency
on r0 vanishes. Furthermore, the direction (φ, θ) will be denoted as Ω.

2.2.2 The Discrete Spatial Fourier Transform
Instead of a continuous SFT given by Equation 2.9, sampling at discrete microphone
positions yields the discrete SFT (DSFT)

G̊nm(ω) =
Msg∑
sp=1

wsp G(Ωsp, ω)Y m
n (Ωsp)∗. (2.19)

Consequently, the limited number of sound field SH coefficients limits the modal
order of the PWD to Nsg

D(Ωd, ω) =
Nsg∑
n=0

n∑
m=−n

dn S̊nm(ω)Y m
n (Ωd). (2.20)

The ideal directional Dirac impulse becomes broader which leads to a reduced spatial
resolution. Figure 2.2 shows the influence of the limitation of the PWD order. For

2https://www.zylia.co/

8

https://www.zylia.co/


2. Theory

small orders N , the number of side-lobes decreases, but their magnitudes become
larger. For higher orders, the number of side-lobes increases, but their magnitude
levels decrease. For ideal sampling, and thus an infinite PWD order, all side-lobes
vanish.

−120 120 180

−10 
−20 
−30 
−40 
−50 
−60 
−70 
−80 

 0   

N
or

m
. m

ag
ni

tu
de

 in
 d

B N=5

−60 0 60
Azimuth angle in DEG

−120 120 180

−10 
−20 
−30 
−40 
−50 
−60 
−70 
−80 

 0   N=10

−60 0 60
Azimuth angle in DEG

−120 120 180

−10 
−20 
−30 
−40 
−50 
−60 
−70 
−80 

 0   

N
or

m
. m

ag
ni

tu
de

 in
 d

B N=20

−60 0 60
Azimuth angle in DEG

−120 120 180

 −10 
 −20 
 −30 
 −40 
 −50 
 −60 
 −70 
 −80 

 0   N=30

−60 0 60
Azimuth angle in DEG

Figure 2.2: Normalized magnitudes of PW coefficients with respect to the azimuth directions
for SH rendering orders N = 5, N = 10, N = 20, N = 30. The figure is taken from Bernschütz
(2016, p. 73) to illustrate the influence of SH order truncation. Smaller N leads to less side-
lobes with higher magnitudes, larger N to an increasing number of side-lobes, but with decreasing
magnitudes. An ideal PWD with an infinite SH order leads to an ideal spatial Dirac impulse.

2.3 Technical Constraints

2.3.1 Spatial Aliasing
Besides other constraints, as the gain limitation of the radial filters for lower fre-
quencies to avoid noise amplification, the major problem of real-world microphone
arrays is the limited microphone distribution density that leads to spatial under-
sampling. One consequence of those undersampled sound field captures is named
spatial aliasing and will be discussed in this section.
Sampling a continuous-time signal basically leads to a replication of the correspond-
ing spectrum at half the sampling frequency (Nyquist-Frequency). When sampling
slower than twice of the highest signal frequency, frequency components of the spec-
tral replications are aliased into the lower frequency bands (Shannon, 1998). This
temporal aliasing effect is illustrated in Figure 2.3. Same applies when spatial sam-
pling a space-continuous sound field. Sampling with a limited number of sampling
points results in the replication of the modal spectra in the wave spectrum do-
main. Those replications are aliased into the lower modal wave spectra, as can be

9


2. Theory

seen in Figure 2.4. However, for spatial aliasing, there is no certain temporal fre-
quency bound, as the Nyquist-Frequency for the continuous-time sampling, where
no aliasing artifacts arise. Natural sound fields are not band-limited, and thus spa-
tial aliasing is present for every temporal frequency. Nevertheless, higher temporal
frequencies are more affected by spatial aliasing than lower temporal frequencies.
According to Rafaely (2005), a temporal frequency fa can be estimated where spa-
tial aliasing artifacts increase rapidly

fA = Nsgc

2πr0
. (2.21)

f
fnfn

f
fn fsfnfs

(a)

(b)

−

− −

Figure 2.3: Sampling a time continuous signal with a sampling rate smaller than twice of the
highest signal frequency leads to the time-aliasing effect. Components of the spectral replications
are aliased into the higher frequencies. The figure is taken from Ahrens (2012, p. 119).

Figure 2.4: Sampling a space-continuous signal results in replications in the wave spectrum
domain. Analogous to the time-aliasing, portions of the wave spectrum replications of higher
modes are aliased into lower modes. The figure is taken from Ahrens (2012, p. 125).

Thus, SH coefficients of orders N < Nsg are not mentionably affected by spatial
aliasing, or in other words, to avoid spatial aliasing up to certain temporal fre-
quency f , a grid order of Nsg > (2πfr0)/c = k r0 is necessary. This frequency fa
will be denoted as spatial alias frequency in the following.
According to Ben-Hur et al. (2018), the aliasing error can be mathematically ex-
pressed with

10


2. Theory

∞∑
n=0

n∑
m=−n

Y m
n (θ′, φ′)∗Y m

n (θd, φd)︸ ︷︷ ︸
{δnn′δmm′ +ε(n,m,n′,m′)}

. (2.22)

The SH basis functions are not completely orthogonal and thus produce an addi-
tional error ε.
Neglecting trivial solutions as increasing the number of microphones, or decreasing
the array radius r0 to improve the microphone density, there exist a small number of
approaches to reduce the impact of spatial aliasing. These approaches are discussed
in Section 2.3.4

2.3.2 Truncation Error

A further essential constraint induced by sparse microphone arrangements is the
SH series truncation error. When describing HRIRs or spatial sound fields using
SH coefficients, in real-life implementations just a limited modal order of SH basis
functions can be used. The DSFT is limited to Msg, as can bee seen in Equation
2.19. However, HRIRs and natural sound fields are not limited in their spatial reso-
lution and thus would require an SH representation with infinite modal order. It is
unavoidable to truncate the natural SH order series at a certain point and cut away
the higher SH modes that contain important spatial details.
This is highly relevant, especially for HRIRs. The nature of the human pinna induces
relevant localization cues at high-frequency components. Representing this localiza-
tion information in the SH domain requires a high spatial resolution and hence higher
SH orders. Neglecting these modes leads to audible artifacts in binaural renderings,
as less spaciousness or coloration effects. In particular, the contralateral ear signal
that is not exposed by direct sound incidences is more affected by the truncation
error.
HRIR data sets are usually measured on sampling grids with an adequate resolution
to store all necessary spatial details. However, there are just a limited number of
sound field signals obtained by spherical microphone arrays. When merging high
modal resolution HRIRs and limited modal resolution sound field data, the trunca-
tion effect, and thus loss of spatial detail is unavoidable.
Furthermore, similar to the windowing of a time-signal with a harsh rectangular
window which results in side-lobes in the frequency domain, harsh truncating of the
SH order results in artifacts in the wave spectrum domain. This effect is illustrated
in Figure 2.2 or Figure 2.5. Truncating the SH order series at lower orders leads to
fewer side-lobes but with higher amplitude and energy, truncating at higher orders
yields multiple side-lobes with decreased magnitudes. Figure 2.5 (bottom, right)
was generated by rendering up to an SH order N = 85 to simulate a nearly ideal
spatial Dirac impulse.

11


2. Theory

0°

45°

90°

135°

180°

225°

270°

315°

N=3

0°

45°

90°

135°

180°

225°

270°

315°

N=5

0°

45°

90°

135°

180°

225°

270°

315°

N=7

0°

45°

90°

135°

180°

225°

270°

315°

N=85

Figure 2.5: Normalized logarithmic magnitudes of SH coefficients of a 0° spatial Dirac impulse
computed with Equation 2.8 for maximum SH orders N = 3 (black) N = 5 (blue) N = 7 (red),
and N = 85 (grey).

2.3.3 Consequences of Undersampled Microphone
Array Data Renderings

Using undersampled microphone array recordings for rendering a VAE results in the
spatial aliasing effect and at the same time to the SH order truncation. Therefore,
it is hard to distinguish between the consequences of those effects. However, spa-
tial aliasing solely depends on the number of microphones, or more precisely on the
density of the microphone arrangement, whereas the truncation effect just depends
on the SH rendering order. Thus, both effects result in different and even contrary
audible artifacts. Ben-Hur et al. (2019) denoted the overall error caused by sparse
microphone constellations as sparsity error and mathematically break down the dif-
ferences of the aliasing and truncation effect.
Spatial undersampling yields two different artifacts in binaural reproductions. On
the one hand, the spectral modification, and on the other hand, a loss of spatial
information. Figure 2.6 illustrates both of the artifacts. Similar to Figure 2.1, it
shows the magnitudes of the PW coefficients of a simulated broadband plane wave
impinging on an array at φ = 0°, θ = 90° with respect to the frequency and azimuth
angle. This time, on a real-life array with 10 cm radius and a 5th order Lebedev
sampling scheme. Whereas for the ideal array the magnitudes for certain azimuth
directions keep constant for all frequencies, the spatial resolution for the real-life
array is totally lost for high frequencies. Additionally, at high frequencies an overall
increased level of the PW coefficients can be observed.

The influence of spatial aliasing and truncation error on the time-frequency spec-
trum is illustrated in Figure 2.7. Both diagrams show the frequency responses of
90° BRIRs to the left ear resulting from binaural reproductions of array impulse re-

12


2. Theory

500 1k 20k

Frequency in Hz

-180

-120

-60

0

60

120

180
A

z
im

u
th

in
D

e
g
re

e

−50

−40

−30

−20

−10

0

M
a
g
n
it

u
d
e

in
d
B

Figure 2.6: Plane wave impact at φ = 0°, θ = 90° on a real-life microphone array with a radius
of 10 cm and a 7th order Lebedev. The figure shows the magnitudes of the PW coefficients for
different azimuth angles at the array surface with respect to the frequency. The PW coefficients
was rendered up to an SH order of 7.

sponses measured in Control Room 7 at the WDR broadcasting studios (Section 3.1
gives a more detailed explanation of the array measurements by Stade et al. (2012)
used for the renderings in this work). The left-hand figure depicts BRTFs (binaural
room transfer function) based on 1202 sampling point Lebedev grid array impulse
responses. The black curve BRTF was rendered up to an SH order of 29, the blue
one up to an SH order of 5. Hence, both BRTFs are not noteably affected by spatial
aliasing, but the blue curve is significantly more affected by the truncation error. It
can be seen that the truncation error yields a decreased level for higher frequencies,
and thus has a low-pass characteristic. The right-hand diagram shows BRTFs based
on a 1202 Lebedev grid (black curve) and a 50 sampling point Lebedev grid (blue
curve), both rendered up to an SH order of 5. Hence, both renderings contain the
same truncation error, but just the blue BRTF is influenced by spatial aliasing. It
can be observed that spatial aliasing leads to an increased level at higher frequen-
cies, and thus has a high-pass characteristic. Both illustrations show the effect on
the time-frequency spectrum for one direction only. However, it can be shown that
spatial aliasing is present with the same amount for all directions, whereas the trun-
cation error is highly direction dependent and yields spectral artifacts in particular
for lateral directions, see e.g. Zaunschirm et al. (2018), or Hold et al. (2019).

2.3.4 Reducing the Impairment of Spatial Undersampling

This section gives an overview of common approaches to reduce the perceptional
artifacts of spatial undersampled array auralizations. It will be separated between
aliasing and truncation mitigation approaches, however some of them handle both
of the effects.

13


2. Theory

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

Influence of SH Order Truncation

1202 nodes
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

alias frequency

Influence of Spatial Aliasing

50 nodes
N = 5

1202 nodes
N = 5

Figure 2.7: 90° BRTF renderings to the left ear based on array impulse responses in Control
Room 7 at theWDR studios. The left-hand figure displays two BRTFs resulting from measurements
on a 1202 sampling point Lebedev grid, the black one rendered up to a maximum SH order of 29,
the blue one up to an SH order of 5. The differences at higher frequencies illustrate the spectral
influence of the truncation error. The right side figure shows two BRTFs resulting from renderings
up to an SH order of 5. The blue curve resulting from a measurement on a Lebedev grid with 50
sampling points, the black one from a Lebedev grid with 1202 sampling points. This illustrates
the spectral influence of spatial aliasing. Both BRTFs were fractional-octave smoothed over 1/3
octave (This holds for every figure displaying frequency responses in this work).

2.3.4.1 Spectral Equalization

As already mentioned, one artifact of spatial aliasing is the increased sound pres-
sure level for higher frequencies. The simplest way to compensate for this spectral
artifact is to equalize the resulting binaural signals. Bernschütz (2016, pp.129-130)
found that the spectral increase follows a certain regularity. Figure 2.8 (left) taken
from Bernschütz (2016, pp.130) shows the logarithmic deviation of 7th order array
renderings and the corresponding dummy head measurements, averaged over 360
azimuth directions. For the array rendering, the authors used HRTFs with a lim-
ited modal resolution (RHRTFs) (Bernschütz, 2016, pp.83-86). These RHRTFs are
based on SH interpolation and prevent the truncation effect, whereas the depicted
frequency responses are just affected by spatial aliasing. Figure 2.8 illustrates that
for diffuse sound fields, the deviation nearly linear increases with a 6 dB per octave
slope for frequencies above the spatial alias frequency. The right-hand figure shows
the resulting compensation filter, generated by inverting the average deviation curve.
To calculate a generative filter with respect to the array configuration, the author
proposed an ordinary first-order low-pass filter with a cut-off at the spatial alias
frequency.

To evaluate this, Figure 2.9 was created. It shows the average logarithmic deviation
of binaural signals, rendered with 1202 and 50 sampling point array impulse re-
sponses on the same SH order. Just like in Figure 2.7 (right), the 50 sampling point
rendering is affected by spatial aliasing. The blue curve in the upper figure shows
the logarithmic deviation of a 90° BRTF to the left ear, the black curve the loga-
rithmic deviation averaged over right and left BRTFs for 360 azimuth directions.
Bernschütz (2016) compared dummy head BRIRs to RHRTF array renderings and

14


2. Theory

found that the deviation follows a 6 dB per octave slope. However, in Figure 2.9,
it can be observed that the deviation increases nearly linear with 12 dB per octave.
The generative low-pass filter matching the inverse of this curve can be calculated
by second order low-pass filters, depicted in the bottom figure for array orders 5, 7
and 10. Nevertheless, for consistency, the discussion and preparation of the listening
experiment stimuli are referred to the first-order equalization filters as proposed by
Bernschütz (2016).

Figure 2.10 illustrates the result of the equalization, using second order low-pass
filters. Again frequency responses of BRIRs rendered with 1202 and 50 sampling
point array data are compared. Additionally, the blue curve shows the equalized
curve which closely matches the desired grey 1202 sampling point BRTF.

10
Frequency in Hz

2 103 104

M
ag

ni
tu

de
de

vi
at

io
n

in
dB

-10

0

-5

5

10

15

20
Simulation

 (measured)Broadcast Studio
Control  (measured) Room

10
Frequency in Hz

2 103 104

Fi
lte

r
m

ag
ni

tu
de

in
dB

-20

-15

-10

-5

0

5

10

Spectral compensation filter

Figure 2.8: On the left-hand side the average logarithmic deviations of 7th order array render-
ings and corresponding dummy head measurements are depicted. The deviation matches a linear
6 dB per octave increase for frequencies above the alias frequency. The right-hand diagram illus-
trates the resulting compensation filter derived from inverting the average deviation curve. The
picture taken from Bernschütz (2016, pp.130)

102 103 104

Frequency in Hz

0

5

10

15

20

M
a
g
n
it

u
d
e

in
d
B

Log. deviation

Average log. deviation

102 103 104

Frequency in Hz

−20

−15

−10

−5

0

M
a
g
n
it

u
d
e

in
d
B

Nsg = 5

Nsg = 7

Nsg = 10

Figure 2.9: Logarithmic deviation curve of 5th order array renderings done with 50 and 1202
sampling point array measurements. The blue curve depicts the logarithmic deviation for a 90°
BRTF to the left ear, the black curve the deviation averaged over both ears and 360 azimuth
directions. The bottom figure displays generative second order low-pass filters for the SH order 5,
7 and 10.

15


2. Theory

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

1202 nodes

50 nodes

50 nodes, Equalized

Figure 2.10: 90° BRTF to the left ear based on array renderings of 1202 (grey) and 50 (black
and blue) sampling point Lebedev grid measurements. Both were rendered up to an SH order of
5. The black curve is significantly affected by spatial aliasing, the blue curve was equalized by a
second order low-pass filter.

Similarily, the spectral modification caused by the truncation error can be equal-
ized by appropriate compensation filters. Ben-Hur et al. (2017) introduced so-called
Spherical Head Filters (SHF) which also can be applied to the binaural signals. Sim-
ilar to the assumptions for the spatial aliasing compensation filters, Ben-Hur et al.
(2017) expect an average response of a diffuse sound field, but additionally assume
a rigid sphere approximation for the human head. Figure 2.11 depicts the proposed
SHF for different SH orders N . These filters were designed under the assumption of
negligible spatial aliasing.

kr

Figure 2.11: The so-called Spherical Head Filters for SH orders 1, 3, 6 und 12 introduced
by Ben-Hur et al. (2017). They assume negligible spatial aliasing and compensate for spectral
artifacts induced by the SH order truncation.

Both spectral compensation filters are either proposed as truncation compensation
or spatial aliasing compensation filters. However, in real-life both effects are present
at the same time. In Section 2.3.5.5 a combination of both filters compensating the
overall undersampling error is discussed.

16


2. Theory

2.3.4.2 Bandwidth Extension Algorithm for Microphone Arrays

The Bandwidth Extension Algorithm for Microphone Arrays (BEMA) has been de-
veloped as spatial aliasing mitigation approach, see Bernschütz (2012) or Bernschütz
(2016). The sound field SH coefficients of higher frequency bands that are affected
by spatial aliasing, will be substituted with synthesized SH coefficients. As all higher
SH coefficients will be replaced, the BEMA approach compensates not just the spa-
tial aliasing artifacts, but also the truncation error. Basic idea is to use certain
properties of the reliably obtainable frequency bands to estimate the SH coefficients
of the higher frequencies. Therefore, spatial and spectral properties will be treated
separately. The spatial energy distribution will be extracted from the SH coefficients
of frequency bands below the aliasing frequency, captured by the microphone array.
The total energy of the higher frequencies will be obtained by an additional omni-
directional microphone, ideally located in the array center. This approach is based
on the assumption of a natural spatial sound field, where the energy distribution of
adjacent frequency bands follows a certain similarity.

The sound field SH coefficients S̊nm(ω) with respect to the frequencies (ω0, ω1, . . . , ωK)
can be expressed as the (N + 1)2 × (K + 1)2 matrix

S̊nm(ω) =



S̊00(ω0) S̊00(ω1) . . . S̊00(ωK)
S̊−11(ω0) S̊−11(ω1) . . . S̊−11(ωK)
S̊01(ω0) S̊01(ω1) . . . S̊01(ωK)
S̊11(ω0) S̊11(ω1) . . . S̊11(ωK)

... ... . . . ...
S̊NN(ω0) S̊NN(ω1) . . . S̊NN(ωK)


. (2.23)

The matrix S̊anm(ω) (Eq. 2.24) as a subset of S̊nm(ω) with respect to the frequencies
(ωa, . . . , ωK−1, ωK) holds the aliased SH coefficients which will be replaced with the
synthesized BEMA SH coefficients. ωa can be approximated with ωa ≈ Nsg

r0
according

to 2.21.

S̊anm(ω) =



S̊00(ωa) . . . S̊00(ωK−1) S̊00(ωK)
S̊−11(ωa) . . . S̊−11(ωK−1) S̊−11(ωK)
S̊01(ωa) . . . S̊01(ωK−1) S̊01(ωK)
S̊11(ωa) . . . S̊11(ωK−1) S̊11(ωK)

... . . . ... ...
S̊NN(ωa) . . . S̊NN(ωK−1) S̊NN(ωK)


(2.24)

To extract the reliably spatial information denoted as spatiotemporal image I̊mn an
appropriate source band has to be defined. It is preferable to choose source bands as
close as possible to the synthetic bands. In the following, the bands will be assumed
as ωa − µ. The spatial image can be computed with

I̊nm = 1
W

W∑
µ=1

dn

(
ωa − µ
c

r0

)
S̊nm(ωa − µ). (2.25)

17


2. Theory

By multiplying with dn, the dependency on the radial filter vanishes. The spa-
tiotemporal image should just hold information about the energy distribution, as
the total energy will be taken from the center microphone. Therefore, it has to be
power normalized according to

p = 1
W

N∑
n=0

m∑
m=−n

W∑
µ=1

∣∣∣∣dn (ωa − µc
r0

)
S̊nm(ωa − µ)

∣∣∣∣2 (2.26)

p =
N∑
n=0

m∑
m=−n

∣∣∣∣I̊mn∣∣∣∣2 (2.27)

I̊norm
nm =

√
p

p
I̊nm. (2.28)

The center signal will be normalized to its own average level in the source band
(ωa − µ) and the synthetic BEMA SH coefficients can be calculated with

S̊BEMA
nm = 1

dn(ω
c
r0) I̊

norm
nm︸ ︷︷ ︸

spatial information

cnorm(ω)︸ ︷︷ ︸
spectral information

. (2.29)

Here, the corresponding radial filter dn will be taken into account. The authors
propose to optimize the phase of the synthetic SH coefficients. For this, the phase of
the synthetic SH coefficients will be shifted according to a phase offset ∆φ obtained
at the transition bin

∆φ = arg
(

S̊00(ωa)
S̊BEMA

00 (ωa)

)
(2.30)

S̊
BEMA
nm (ω) = S̊BEMA

nm (ω) ej∆φ. (2.31)

Further, it is reasonable to fade in the synthetic coefficients, instead of a harsh
crossover. Bernschütz (2012) introduced one feasible crossover method based on
a smooth overlapping of synthetic and original SH coefficients. This will not be
derived in more detail.
For a single plane wave, the knowledge of the pressure distribution on the array
surface for a single frequency bin allows to obtain the pressure distribution over
the entire frequency range. Hence, BEMA works perfectly for a single plane wave.
However, Bernschütz (2016, pp.144-146) found out that even for three time shifted
plane waves impinging on the array surface the BEMA algorithm result in notable
artifacts. It is unclear, if BEMA improves the auralization of array recordings in
diffuse sound fields.

2.3.4.3 Magnitude Least Squares

Schörkhuber et al. (2018) introduced a method for reducing the impact of order
truncation denoted as Magnitude Least-Squares (MagLS). This method optimizes
the used HRTF set in such a way that they will lead to lower truncation errors after

18


2. Theory

the binaural rendering. The MagLS approach consists of to parts. First of all, the
HRTFs will be modified such that the energy in higher SH orders is reduced without
noteably decreasing the perceptional quality.
This modification is an advancement of the time alignment (TA) approach by Zaun-
schirm et al. (2018). According to the Duplex Theory by Rayleigh (1907), for high
frequencies the Interaural Time Differences (ITDs) become perceptional less impor-
tant than the Interaural Level Differences (ILDs). However, most of the energy in
higher modes is caused by rapid phase changes towards higher frequencies. Thus,
removing the linear phase at higher frequencies will decrease the energy in higher
modes without significantly modifying the ILDs

HTA(Ω, ω) =

H(Ω, ω) if ω ≤ ωc

H(Ω, ω)e−jφl(Ω,ω) if ω ≥ ωc
. (2.32)

Where φl(Ω, ω) is the linear phase with respect to the ITD for direction Ω and ωc
the cut-off frequency which was empirically chosen by the authors. MagLS aims at
finding the optimum phase instead of completely removing the linear phase. This
can be achieved by solving the least-square problem that minimizes the distance of
the magnitudes of the modified HRTFs and a reference set

min
H̊nm(ω)

Msg∑
sp=1

[|Y m
n (Ωsg)H̊nm(ω)| −H(Ωsg, ω)]2. (2.33)

It would be beyond the scope of this thesis to derive this least-square procedure why
the reader is referred to Schörkhuber et al. (2018).

After modifying the phase for higher frequencies either with the TA or the MagLS
approach, in a second step, the HRTFs will be optimized using a so-called covari-
ance constrained matrix R(ω). This matrix will equalize the spectrum based on
a statistical diffuse sound field expectation. For a detailed derivation of R(ω) the
reader is referred to Zaunschirm et al. (2018), or Zotter and Frank (2019).

102 103 104

Frequency in Hz

0

5

10

15

20

25

30

S
H

o
rd

e
r

KU100 HRIRs

102 103 104

Frequency in Hz

MagLS 3th order

102 103 104

Frequency in Hz

MagLS 5th order

102 103 104

Frequency in Hz

MagLS 7th order

Figure 2.12: Normalized logarithmic energy distributions of HRTF SH coefficients rendered
up to the orders N = 1 to N = 30. All energy plots are based on a KU100 HRIR dataset. The
left-hand figure depicts the energy distribution of the untreated HRIR set, the right-hand figures
the distributions of the MagLS pre-processed HRIRs for target rendering orders N = 3, N = 5, and
N = 7. The color encodes the energy ranging from −40 dB to 0 dB normalized to the maximum
values for each frequency bin.

19


2. Theory

Figure 2.12 illustrates the efficient energy concentration of the MagLS approach for
the different target rendering orders N = 3, N = 5, and N = 7. All four plots are
based on the KU100 2707 HRIR dataset by Bernschütz (2013). They are displaying
the energies of the SH coefficients rendered for orders N = 1 up to N = 30 with
respect to the frequency, normalized to the maximum value for each frequency bin.
The left-hand figure was processed with the untreated HRIR set, the right-hand
figures for HRIRs that were pre-processed for target SH rendering order 3, 5 and
7. The color indicates the energies ranging from −40 dB to 0 dB. It can clearly be
seen that for the pre-processed HRIR sets the energy is concentrated to the lower
orders.
There are a number of further HRIR pre-processing approaches that have been
discussed and evaluated by Brinkmann and Weinzierl (2018).

2.3.4.4 Tapering

Hold et al. (2019) introduced a method denoted as spherical harmonic tapering to
reduce the side-lobes induced by order truncation. Hard truncating the SH order is
equivalent to window the SH coefficients with a rectangular window which results
in unwanted side-lobes. Windowing the SH coefficients using appropriate smoother
window functions yields a fade-out of higher order modes which ideally results in
side-lobe supression. Hold et al. (2019) proposed a half-sided Hanning window
to yield this softer truncation. Furthermore, the authors propose to equalize the
binaural signals using the Spherical Head Filters discussed in the previous section.
However, the Hanning tapering requires slightly modified SHF. These modified filters
are depicted in Figure 2.14.

0°

45°

90°

135°

180°

225°

270°

315°

N=3

Tapered

0°

45°

90°

135°

180°

225°

270°

315°

N=5

Tapered

0°

45°

90°

135°

180°

225°

270°

315°

N=7

Tapered

Figure 2.13: Normalized logarithmic magnitudes of SH coefficients of a spatial Dirac impulse
in 0° direction computed with Equation 2.8 for maximum SH orders N = 3, N = 5, and N = 7.
For each order the rectangular windowed (grey dashed) and Hanning windowed (colored) SH
coefficients are depicted. SH Tapering using a Hanning window significantly results in side-lobe
suppression, but also in a wider main-lobe with an increased level.

2.3.4.5 Matrix Regularization

Alon and Rafaely (2012) (Alon and Rafaely, 2017) developed a method for spatial
aliasing cancellation based on a matrix regularization. For this, it is advantageous
to introduce the matrix denotation of the PWD:

20


2. Theory

Figure 2.14: Hold et al. (2019) propose to equalize the resulting binaural signals after weighting
the sound field SH coefficients with a tapering window. The Spherical Head Filters have to be
adapted with respect to the window which is shown in the figure, taken from Hold et al. (2019).
The adaptation basically results in a shift of the cut-off towards higher frequencies.

According to Equation 2.15, the PW components D(ω) can be calculated from the
sound pressure SH coefficients P̊nm as follows

D = Y diag(b)P̊nm. (2.34)

With the radial functions b and the SH basis functions Y

Y =


Y 0

0 (Ω1) Y −1
1 (Ω1) . . . Y

Nsg
Nsg (Ω1)

Y 0
0 (Ω2) Y −1

1 (Ω2) . . . Y
Nsg
Nsg (Ω2)

... ... . . . ...
Y 0

0 (ΩMsg) Y −1
1 (ΩMsg) ... Y Nsg

Nsg (ΩMsg)

 . (2.35)

As long as the sound field is limited to an SH order of N , thus Msg ≥ (N + 1)2 is
fulfilled, the computation of the PW components from the sound pressure signals
P (Ω, ω) can be assumed to be error free. By rearranging this equation, an expression
for the sound field SH coefficients can be found

P̊nm = diag(b)−1Y∗D. (2.36)

However, real-life sound fields are not limited in their expansion order. Hence, for
undersampled sound field recordings with Msg ≤ (N + 1)2 the PWD produces an
error and Equation 2.3.4.5 transforms to

D = Ỹdiag(b̃) ˜̊Pnm. (2.37)

With Ỹ =
[
Y,Y∆

]
, b̃ =

[
bT , bT∆

]T
, and ˜̊Pnm =

[
P̊T
nm, P̊T

∆

]T
. Where (·)∆ denotes the

spatial aliasing affected higher order parts. Consequently, the aliased SH coefficients
can be expressed as

21


2. Theory

P̊ alias.
nm = diag(b)−1Y∗D = diag(b)−1Y∗︸ ︷︷ ︸

stable

Ỹdiag(b̃)︸ ︷︷ ︸
aliased︸ ︷︷ ︸

Ap

˜̊Pnm (2.38)

Where the matrix Ap is denoted as aliasing projection matrix. The first part of
Ap consists of the stable obtainable wave spectrum information, the second part
describes the aliased information. It can be written as

[
1,∆ε

]
(1 = unit matrix, ∆ε

= aliasing error pattern). P̊ alias.
nm can thus be broken down to

P̊ alias.
nm = P̊nm + ∆ε

˜̊P∆. (2.39)

Aim of the matrix regularization is to minimize the term ∆ε
˜̊P∆. For this purpose,

a regularization matrix CAC is introduced

P̊nm
!= CAC P̊ alias.

nm = CAC Ap
˜̊Pnm = CACP̊nm + CAC ∆ε

˜̊P∆. (2.40)

One measure to express the overall aliasing error εPWD is the squared Euclidean dis-
tance between the ideal SH coefficients P̊ and the aliased coefficients ˜̊P. Thus,
ideally, multiplication with CAC should zero the spatial aliasing error εPWD =∥∥∥ ˜̊P− P̊

∥∥∥2
. Substituting CAC Ap and calculating the minimum by deriving and

resolving to zero yields the final aliasing regularization matrix

CAC = (Ap Ap
H)−1Ap. (2.41)

With Ap = diag(b)−1Y∗ Ỹ diag(b̃) and (·)H the Hermetian operator.
Alon and Rafaely (2017) showed that this approach works well for simulated plane
wave scenarios. However, it is unclear if it produces satisfying improvements for
real-life diffuse sound fields. The matrix regularization approach would be highly
suitable for real-time applications, because the CAC matrix just depends on the
radial filters and thus could be completely pre-rendered. Furthermore, there are
a number of further regularization approaches discussed in literature preventing
different artifacts, for example reduction of noise gain. Those regularizations could
be easily combined.

2.3.4.6 Spatial Anti-Aliasing Filters

To prevent time aliasing when sampling continuous time signals, low-pass filters are
commonly used to reduce the energy in higher frequency components. Thus, it is
an obvious approach to apply spatial low-pass filters, also denoted as modal low-
pass filters, before spatial sampling. Rafaely et al. (2007) and Meyer et al. (2008)
presented a theoretical approach for deploying those modal low-pass filters. Basic
idea is to weight the sound pressure function s with a filter function h before spatial
sampling in such a way that it contains less energy in higher orders. It was shown

22


2. Theory

that an ideal filter function that corresponds to a rectangular modal window can be
expressed with

h(θ, φ) = N + 1
4π(cos θ − 1)[PN+1(cos θ)− PN(cos θ)]. (2.42)

This filter eliminates the energy in all SH modes above N . The spatial filtering
process can be described as spherical correlation of the sound pressure s and an
additional transducer providing an expanded membrane covering a wider area of
the sphere. This additional transducer rotates over the entire surface to weight the
incident sound pressure s. Theoretically, this spatial filtering leads to remarkable
reductions of spatial aliasing, but has some essential limitations. Firstly, similar to
time domain filtering where ideal rectangular windows lead to large side-lobes in
the frequency domain, modal rectangular windows yield side-lobes in the spherical
wave spectrum domain. To prevent this, Rafaely et al. (2007) and Elahi et al. (2019)
discussed different window functions used for spatial filtering which are depicted in
Figure 2.15. A further limitation is the realization of an additional expanded trans-
ducer itself. The expansion of the sensing membrane could be simulated by multiple
narrow microphones, however, the design is highly challenging. Bernschütz (2016,
pp.132-138) extensively discusses expanded transducers and mentions an approach
for using such a sub-array for implementing the spatial alias filtering.

Figure 2.15: The left-hand figure taken from Rafaely et al. (2007) illustrates the magnitudes
of the modal-filters designed for reducing the energy of SH orders greater than 4. It shows the
ideal spatial aliasing filter, as well as rectangular and Hamming window spatial constrained filters.
The right-hand figure taken from Elahi et al. (2019) shows similar filter magnitude responses, but
designed for damping SH orders larger than 5. Additionally, the proposed modal-filter based on a
slepian eigenfunction window is depicted.

2.3.5 Analysis of the Resulting Binaural Signals
Informal prior listening tests showed that not all approaches described in Section
2.3.4 yield satisfying audible improvements. Thus, just a few algorithms will be
investigated in more detail. This section will discuss the most promising approaches
and their influence on the binaural signals.

23


2. Theory

2.3.5.1 MagLS

The Magnitude Least-Squares algorithm (MagLS) is an approach for mitigation of
the truncation error. Figure 2.16 and Figure 2.17 show the comparison of two render-
ings based on array impulse response measurements in the CR7 on a 1202 sampling
node Lebedev grid, for the SH orders 29 and 5. This ensures that both renderings
are affected by the same spatial aliasing, but by different truncation errors. The
figures show left (blue) and right (red) ear signals separately which simplifies to
illustrate the directional dependency of the truncation error. In particular, the 90°
BRTFs (Fig. 2.17) demonstrate that the contralateral left ear signals are more af-
fected by the truncation error as the ipsilateral right ear signals. The shape of the
frequency response to the left ear is totally modified, whereas the right ear signal is
just damped with a nearly constant offset towards higher frequencies.
Comparing the 0° and 90° ear signals shows that the truncation error has a different
amount of impact for different directions. However, the MagLS algorithm yields a
significant improvement for both directions.

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

Left 0° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

Right 0° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

1202 nodes, MagLS
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

1202 nodes, MagLS
N = 5

1202 nodes
N = 29

Figure 2.16: MagLS: Frequency responses of 0° left and right ear signals resulting from array
renderings based on a 1202 sampling node grid rendered up to an SH order N of 29 (black curves)
and 5 (colored curves). The upper diagrams show the influence of the truncation error for the
right and left ear separately. The bottom diagrams depict the MagLS improvements. It can be
seen that they work equally for both ear signals.

2.3.5.2 Spherical Head Filter - SHF

The Spherical Head Filters (SHF) are global filters applied to the binaural ear signals
compensating for the spectral artifacts induced by SH order truncation. Figure 2.18
(left) depicts the SHF filters for the orders N = 3, 5 and 7. The improvements that
can be achieved with the SHF filtering are illustrated in Figure 2.19. The binaural
signals were rendered based on the same array configurations as for the MagLS Fig-
ures 2.16 and 2.17, however, just for the left ear. The ear signal for 0° (left) shows a

24


2. Theory

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

Left 90° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

Right 90° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

1202 nodes, MagLS
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

1202 nodes, MagLS
N = 5

1202 nodes
N = 29

Figure 2.17: MagLS: Frequency responses for 90° left (contralateral) and right (ipsilateral)
ear signals resulting from array renderings based on a 1202 sampling node grid rendered up to an
SH order N of 29 (black curves) and 5 (colored curves). The upper diagrams show the influence of
the truncation error for the right and left ear separately. The bottom diagrams depict the MagLS
improvements. Involving the Figure 2.16 shows that MagLS achieves remarkable improvements
independent on the direction.

significant improvement of the frequency response. Comparing it to the 90° signals
demonstrates that the global filter applied to all directions equally yields unwanted
effects for the more critical lateral frequency responses. It thus compensates the col-
oration for frontal directions more than for lateral directions. However, on average,
the SHF filtering leads to a spectral improvement for all directions.

2.3.5.3 Tapering

The Tapering algorithm tries to mitigate the harsh truncation of the SH order series
by applying a Hanning window based weighting of the SH coefficients. Additionally,
the investigators propose to apply a slightly modified version of the SHF. Figure
2.20 displays the binaural signals resulting from the same renderings as before (e.g.
Fig. 2.19), but involving the Tapering algorithm. Since the observation of frequency
responses for single directions mainly illustrates the spectral improvements, Figure
2.20 basically indicates the influence of the modified SHF filtering. However, in-
formal listening probes revealed that the Tapering yields audible improvements for
lateral directions. The final listening experiment will show if the Tapering has a per-
ceivable influence, or if the SHF filtering yields the more significant improvement.

25


2. Theory

103 104

Frequency in Hz

−20

−15

−10

−5

0

5

10

15

20
M

a
g
n
it

u
d
e

in
d
B

N=3
N=5

N=7

Nsg=3

Nsg=5

Nsg=7

103 104

Frequency in Hz

N = 3, Nsg = 5

N = 5, Nsg = 5

N = 7, Nsg = 7

N = 3, Nsg = 7

Figure 2.18: SHF: Spherical Head Filters and aliasing compensation filters (left) for multiple
SH orders N and grid orders Nsg. The aliasing compensation filters (with negative slope) depends
on the grid order, the SHF (with positive slope) solely on the SH order. The right-hand figure
depicts the combination of both filters for different N -Nsg configurations resulting in the global
undersampling filters (GEQ).

2.3.5.4 BEMA

Figure 2.21 and Figure 2.22 were generated to study the improvement of BEMA.
Therefore, array measurements in the CR7 on a 1202 sampling node grid (black
curves) and 50 sampling node grid (colored curves) are rendered up to the same SH
order of 5. They are affected by the same truncation error, but by different amount
of spatial aliasing. The four upper diagrams show an increased level of the 50 node
rendering and thus demonstrate the spatial aliasing impact. The bottom diagrams
depict the BEMA treated signals that are mostly slightly improved compared to the
non-treated signals. However, some signals, for example the right ear signal for 90°,
is even worse than the raw array rendering. In prior informal listening tests, it turned
out that BEMA mostly yields unsatisfying results. Especially for highly reverberant
rooms, BEMA leads to more audible artifacts than improvements. Nevertheless, it
was investigated as part of the listening test.

2.3.5.5 Global Equalization Filter - GEQ

Except for the BEMA algorithm, all approaches either treat the truncation or the
spatial aliasing error. To compensate for the spectral artifacts induced by spatial
aliasing and SH order truncation, a global undersampling compensation filter (GEQ)
was derived. For this, the SHF and the generative first-order low-pass AEQ (Sec.
2.3.4.1) are combined. Figure 2.18 depicts AEQ and SHF (left), as well as the
resulting GEQ (right). The high-shelf shape of the GEQ makes clear that the
low-pass characteristic of the truncation error has more impact than the high-pass
characteristic of spatial aliasing. An important note at this point: The SHFs just
depend on the SH rendering order N , but the design of the AEQ on the spatial
aliasing frequency (approximated by Eq. 2.21) and thus the sampling scheme order
Nsg. Hence, two binaural renderings, one up to SH order 3, another up to order 5,
requires the same AEQ, but different SHF and thus different shapes of the global
equalization filter.

26


2. Theory

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

Left 0° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

Left 90° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

1202 nodes, SHF
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

1202 nodes, SHF
N = 5

1202 nodes
N = 29

Figure 2.19: SHF: Frequency responses of 0° (left) and 90°(right) ear signals to the left ears
resulting from array renderings based on a 1202 sampling node grid rendered up to an SH order N
of 29 (black curves) and 5 (blue curves). The upper diagrams show the influence of the truncation
error for two different directions. The bottom diagrams depict the improvements of Spherical Head
Filter equalization. It can be seen that global equalization achieves different improvements with
respect to the direction.

2.3.5.6 Overall Comparison

The previous discussion of the improvement algorithms focuses either on the trunca-
tion or spatial aliasing reduction. Therefore, certain rendering configurations were
used to illustrate the influence on the single effects. However, in real-life, rendering
with an SH order of 5 whereas 1202 microphone signals are given is not meaningful.
Moreover, the listening test will investigate the overall perceived improvement of
array renderings. To do this, they will be compared to dummy head measurements.
Figure 2.23 and Figure 2.24 compare 3rd, 5th and 7th order array renderings based
on array measurements in the SBS to corresponding dummy head BRTFs. Figure
2.23 compares BRTFs for 0° orientation to the left ear, Figure 2.24 the 90° BRTFs 3.
The black curve depicts the dummy head BRTF, the dashed grey curve the raw un-
treated array rendering. Considering the 3rd order rendering at 0°, it can be found
that the Tapering and SHF lead to a too high gain of higher frequency components.
MagLS and GEQ match the dummy head frequency response the best. For SH order
5 and 7, it can not be clearly determined which algorithm generates the best match
of the frequency response. In addition to the frequency responses, the spatial alias
frequency according to Equation 2.21 is marked. Consistent with the derivations in
Section 2.3.1 this frequency is shifted to higher frequencies for higher-order render-
ings. Furthermore, it can be seen that the deviation of raw rendering and dummy
head reference decreases for higher order renderings.

3Similar comparisons for the LBS which was also tested in the listening experiment can be
found in the appendix, see. Figure A.3 and Figure A.3.

27


2. Theory

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

Left 0° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

Left 90° BRIR

1202 nodes
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

1202 nodes, Tapering
N = 5

1202 nodes
N = 29

102 103 104

Frequency in Hz

1202 nodes, Tapering
N = 5

1202 nodes
N = 29

Figure 2.20: Tapering: Frequency responses of 0° (left) and 90° (right) ear signals to the
left ear resulting from array renderings based on a 1202 sampling node grid rendered up to an
SH order N of 29 (black curves) and 5 (blue curves). The upper diagrams show the influence of
the truncation error for two different directions. The bottom diagrams depict the improvements
of Tapering equalization. Similar as for the SHF, Tapering achieves different improvements with
respect to the sound source direction.

28


2. Theory

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

Left 0° BRIR

50 nodes
N = 5

1202 nodes
N = 5

Right 0° BRIR

50 nodes
N = 5

1202 nodes
N = 5

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

50 nodes, BEMA
N = 5

1202 nodes
N = 5

102 103 104

Frequency in Hz

50 nodes, BEMA
N = 5

1202 nodes
N = 5

Figure 2.21: BEMA: 0° BRTFs rendered on SH order N = 5 based on 1202 (black) and 50
(colored) sampling node grids in the CR7. The upper diagrams illustrate the influence of spatial
aliasing, the bottom diagrams the BEMA improvement. It can be seen that BEMA has a different
impact on the ipsilateral right (red) and contralateral left (blue) ear signal.

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

Left 90° BRIR

50 nodes
N = 5

1202 nodes
N = 5

Right 90° BRIR

50 nodes
N = 5

1202 nodes
N = 5

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

50 nodes
N = 5

1202 nodes, BEMA
N = 5

102 103 104

Frequency in Hz

50 nodes
N = 5

1202 nodes, BEMA
N = 5

Figure 2.22: BEMA: 90° BRTFs rendered on SH order N = 5 based on 1202 (black) and 50
(colored) sampling node grids in the CR7. The upper diagrams illustrate the influence of spatial
aliasing, the bottom diagrams the BEMA improvement. Involving Figure 2.21 shows that BEMA
achieves different improvements with respect to the direction.

29


2. Theory

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

alias frequency

SH order 3

DH

Raw

MagLS

BEMA

SHF

GEQ

Tapering

102 103 104

Frequency in Hz

alias frequency

SH order 5

DH

Raw

MagLS

BEMA

SHF

GEQ

Tapering

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

alias frequency

SH order 7

DH

Raw

MagLS

BEMA

SHF

GEQ

Tapering

Figure 2.23: Comparsion of 0° BRTFs to the contralateral left ear resulting from 3rd, 5th and
7th order binaural array renderings and dummy head (DH) measurements in the SBS. The array
renderings were processed with each of the proposed improvement algorithms.

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

alias frequency

SH order 3

DH

Raw

MagLS

BEMA

SHF

GEQ

Tapering

102 103 104

Frequency in Hz

alias frequency

SH order 5

DH

Raw

MagLS

BEMA

SHF

GEQ

Tapering

102 103 104

Frequency in Hz

−40

−30

−20

−10

0

10

M
a
g
n
it

u
d
e

in
d
B

alias frequency

SH order 7

DH

Raw

MagLS

BEMA

SHF

GEQ

Tapering

Figure 2.24: Comparsion of 90° BRTFs to the contralateral left ear resulting from 3rd, 5th
and 7th order binaural array renderings and dummy head (DH) measurements in the SBS. The
array renderings were processed with each of the proposed improvement algorithms.

30


3
Implementation

3.1 Rigid Sphere Impulse Response Measurements
Although the main purpose of microphone array renderings is dynamic real-time
auralization, this work focuses on static, impulse response based renderings. The
entire development, as well as the preparation of the listening experiment stimuli,
is based on renderings of measured array impulse responses.
Stade et al. (2012) provide a comprehensive measurement database which is used
by a large number of researchers studying spherical microphone array processing.
Stade et al. (2012) captured dummy head room impulse responses and spherical mi-
crophone array impulse responses at the same position in four different rooms at the
WDR broadcasting studios in Cologne. The dummy head captures were performed
using a Neumann KU100 on a pure horizontal grid with an angular resolution of 1°.
The array measurements were done using the VariSphere Array (VSA) developed by
Bernschütz et al. (2010). It is a fully automated robotic measurement system which
allows to sequentially capture impulse responses for simulating a rigid sphere mi-
crophone array. Arbitrary amounts of microphone positions on arbitrary sampling
schemes are provided. Stade et al. (2012) captured array responses on a 50, 86, 110,
and 1202 sampling point Lebedev grid in a large and small Broadcasting Studio
(LBS, SBS), as well as in the Control Room 1 and 7 (CR1, CR7). For all sampling
schemes, the microphone capsules were mounted on a rigid sphere body extension
with a radius of 8.75 cm. For a more detailed description of the measurement setup,
as for example specifications of the microphone capsules, the loudspeaker used for
excitation, or the exact measurement positions, the reader is referred to Stade et al.
(2012).
Furthermore, Bernschütz (2013) provides an HRTF dataset measured on a 2702
sampling node Lebedev grid which is used for all array renderings in this work.
This measurement was done using the KU100 dummy head as well.

3.2 Signal Processing
A complete signal processing chain for binaural auralization of array impulse re-
sponses was firstly implemented by Bernschütz et al. (2011). For this, the MATLAB
Sound Field Analysis Toolbox (SOFiA) was developed. It allows to render binaural
signals for any direction based on array impulse responses measured on arbitrary
sampling schemes. Besides a number of additional valuable tools, e.g. for simulat-
ing or plotting sound field data, all necessary signal processing steps, introduced in

31


3. Implementation

Section 2.1 were implemented. Hohnerlein and Ahrens (2017) ported the SOFiA
toolbox to Python. All signal processing presented in this work, is based on this
Python port. A basic overview of the signal flow of the binaural reproduction of
array impulse responses is shown in Figure 3.1. The microphone impulse responses
measured on a M array

sg sampling grid are transformed to the frequency domain first,
and then using the DSFT (Eq. 2.19) to the SH domain with an SH order of N .
According to the array setup, basically involving the choice of an open or rigid
sphere configuration, as well as the array radius, the N + 1 radial filters are calcu-
lated in the frequency domain. Likewise, the MHRIR

sg sampling node HRIR dataset
is transformed to the SH domain with the same SH order N . The reproduction of
the binaural signals is done for certain directions. In case of a dynamic auralization
using the SSR, 360 BRTFs on a horizontal grid, denoted as the decomposition grid,
are calculated. This calculation is realized using Equation 2.18 by weighting the
sound field SH coefficients with the HRTF SH coefficients in the SH domain. A final
IFFT yields the binaural impulse responses used for convolution in the SSR.

Microphone
Signals

HRIRs

Radial Filters Binaural
Signals

Time Domain

Time Domain Frequency Domain
SH Domain Decomposition

Grid

Binaural
Reproduction

FFT DSFT

FFT DSFT

IFFT

Array Configuration

Marray
 sg

MHRIR
 sg

MDG
 sg

MDG
 sg

MHRIR
 sg

Marray
 sg

Figure 3.1: Basic signal flow of binaural renderings of microphone array signals. Array impulse
responses measured on a Marray

sg sampling point grid, as well as the MHRIR
sg HRIR dataset are

transformed to the frequency domain first and then to the SH domain by means of the DSFT up
to an SH order N . With respect to the array configuration, N + 1 radial filters are calculated in
the frequency domain. Using Equation 2.18, the binaural signals are calculated forMDG

sg directions
defined by the decomposition grid. A final IFFT yields the time domain BRIRs.

The existing sound field analysis Python toolbox was extended during the work of
this master’s thesis. BEMA was directly ported from the SOFiA toolbox. The ta-
pering algorithm was implemented according to Hold et al. (2019)1. To realize the
MagLS algorithm, the authors (Zaunschirm et al., 2018) provided their MATLAB
code. This enabled the possibility to prepare the KU100 HRIR set for renderings
on arbitrary SH orders. The Spherical Head Filters, as well as the overall spec-
tral equalization filters, were realized as linear phase FIR filters generated with the
firwin function available in the Python-Scipy library. Furthermore, the matrix reg-
ularization approach was implemented, however, it turned out that at this stage of
development it yields no perceptional satisfying results.

1https://github.com/chris-hld/spaudiopy/tree/ICASSP2019

32

https://github.com/chris-hld/spaudiopy/tree/ICASSP2019


3. Implementation

3.3 Auralization
The auralization of array data based-BRIRs, as well as the measured dummy head-
BRIRs, is done with the SoundScape Renderer (SSR). It allows the real-time convo-
lution of arbitrary HRIRs or BRIRs defined on a pure horizontal grid with 1° resolution
with arbitrary input audio signals. The dry audio signal to auralize can be played
back on any audio player and routed into the SSR via Jackaudio2. To consider the
head orientation of the listener, the SSR provides the connection of a variety of
head-tracking systems. This enables the presentation of a dynamic binaural syn-
thesis where the entire sound scene keeps static while moving the head. The SSR
submits the possibility to load multiple BRIR data sets into the working memory
and switch them dynamically during the playback of the dry audio signal. Fur-
thermore, the SSR can handle commands sent via TCP which allows to remotely
control the SSR with arbitrary applications. The conducted listening experiment
introduced in the next section is based on a Python application with QT5 graphical
user interface that controls the SSR via TCP connection.

Figure 3.2: The graphical user interface of the SoundScape Renderer in the Binaural Room
Synthesis (BRS) mode shows an example scene consisting of four sound sources. Every sound
source represents one BRIR set loaded into the working memory. Every BRIR set can be unmuted
dynamically which enables the real-time rendering of a dynamic binaural synthesis.

2http://jackaudio.org/

33

http://jackaudio.org/


3. Implementation

34


4
Perceptual Evaluation

4.1 Introduction
Ideally, spherical microphone array recordings and subsequent binaural reproduc-
tions should realize perceptual similar auralizations achieved with dummy head mea-
surements in the same scenario. Bernschütz (2016), or Ahrens and Andersson (2019)
conducted listening experiments comparing headphone-based binaural syntheses us-
ing dummy head BRIR measurements to headphone-based auralizations based on
microphone impulse response measurements with different amounts of microphones.
Both, dummy head and array impulse response measurements were held under the
same conditions. Bernschütz (2016) found that for SH rendering orders above order
8 most of the perceptual differences decrease significantly. Ahrens and Andersson
(2019) confirmed this finding in their follow-up study. Rendering orders up to 8 yield
audible differences compared to the dummy head auralizations. In particular, for
lateral source directions, remarkable differences arise. The latter can be explained
with the truncation error. Further, the investigators show that spatial aliasing has
a significant influence on the auralization quality. The present investigation can be
seen as a continuation of these studies:

Section 2.3.4 introduced a number of truncation error and spatial aliasing error
mitigation approaches that were developed over the last few years. Section 2.3.5
discusses the most promising algorithms, namely MagLS, Tapering, SHF and GEQ,
and illustrates their influence on the binaural renderings. This work presents a com-
parison of the perceivable improvements that can be achieved with this selection of
methods by means of a listening experiment. The influence of the spatial aliasing
and truncation impairments will not be treated separately, but the overall quality
improvement of array renderings compared to corresponding dummy head measure-
ments will be tested.

4.2 Experimental Design

4.2.1 Stimuli
To compare the algorithms in a listening experiment, different BRIR renderings
were presented to the listener using a dynamic binaural synthesis. Each BRIR was
rendered based on array impulse responses in the WDR broadcasting studios, as

35


4. Perceptual Evaluation

described in Section 3. Ahrens and Andersson (2019) showed that up to an SH
rendering order of 8 audible differences arise. As this study aims at investigating
undersampled array constellations, renderings of orders 3, 5 and 7 are compared.
The rendering orders 3 and 5 are based on the 50 sampling node Lebedev grid mea-
surement, the 7th order rendering on the 86 sampling node Lebedev grid. To study
the dependency on the environment, the binaural renderings were done for array
impulse responses in the LBS (approx. RT60 = 1.8 sec, at 1 kHz) and SBS (approx.
RT60 = 1.1 sec, at 1 kHz). Each of these rendering configurations were processed
with each improvement algorithm. It was presented a dynamic binaural synthesis
where the participant was free to rotate its head and listen to all directions. To
simulate an initial lateral head orientation which allows for studying the direction-
dependent performance of the algorithms, each BRIR was rotated clockwise for 90°.
This is of high interest in particular for the algorithms reducing the truncation er-
ror. To ensure consistency to the study of Ahrens and Andersson (2019), for all
renderings the radial filters were soft-limited to 0 dB1.
The use of linear-phase filters results in BRIRs which are time delayed in relation
to the BRIRs measured with the dummy head. To provide a continuous playback
while switching the stimuli during the listening experiments, all delays where com-
pensated. This was achieved by shifting the BRIRs according to their peak mea-
sured with a simple onset detection provided by the AKtools MATLAB toolbox2
(Brinkmann and Weinzierl, 2017).
As the test signal, acoustical dry drum recordings were presented. Drum recordings
consist of a wide spectrum and typically provide high transients that make drums
to rather critical test signals. Furthermore, Ahrens and Andersson (2019) used the
same test signal which supports the consistency of the experimental results.

4.2.2 Experimental Paradigm
To compare the different algorithms, a listening test based on the multiple stim-
ulus with hidden reference and anchor (MUSHRA) test paradigm was used. The
MUSHRA methodology is a multi-stimulus test design proposed by the International
Telecommunication Union (ITU), see ITU-R BS.1534-3 (2015). It was developed for
evaluating the preceptual quality of audio systems and codecs. It allows to compare
up to 12 different systems, or different levels of audio systems presented at the same
time. Therefore, the reference stimulus labeled as such, as well as multiple further
stimuli were presented per trial. One of the stimuli is the hidden reference, one of
them a hidden anchor-stimulus. All stimuli are compared to the reference stimulus
in terms of intermediate perceived quality. The anchor is a stimulus that should
be perceived as worst by the listener and is used for screening of the ratings. Each
rating is done with a continuous movable slider ranging from 0-100 which allows
to register very small differences. The sliders are divided into five evenly spaced
segments labeled with ’Excellent’ (100-80), ’Good’ (80-60), ’Fair’ (60-40), ’Poor’

1Again the reader is referred to the appendix where the soft-limited radial filters are depicted,
Figure A.1

2www.ak.tu-berlin.de/aktools

36

www.ak.tu-berlin.de/aktools


4. Perceptual Evaluation

(40-20), ’Bad’ (20-0).
In contrast to the MUSHRA design introduced by the ITU, the aim was to evaluate
the overall perceived difference to the reference stimuli instead of their intermedi-
ate quality. Therefore, a slightly adapted test design was used in which the slider
labels were replaced with ’No difference’ (100), ’Small difference’ (75), ’ Moderate
difference’ (50), ’Significant difference’ (25) and ’Huge difference’ (0). Furthermore,
the placements of the labels were adjusted such that ’No’ and ’Huge’ difference
represents the extreme ratings 100 and 0. Figure 4.1 depicts the PyQT based ex-
perimental graphical user interface developed for the listening test.
All stimuli described above have to be compared to the corresponding dummy head
auralization in terms of overall perceived differences. For the anchor stimulus, a
3 kHz low-pass filtered diotic signal based on the associated dummy head BRIR
was generated. Hence, for each trial, BRIRs processed with the algorithms MagLS,
BEMA, Tapering, SHF, GEQ, as well as the untreated BRIR rendering (raw), the
hidden dummy head reference, and the anchor stimulus were presented. These con-
ditions were rendered for two rooms, three SH orders, and two directions. Table 4.1
summarizes these 12 trials with 8 conditions each.

Table 4.1: Overview of the stimuli tested in the final listening experiment. All of the conditions
were presented per trial, whereas the reference stimulus (DH Ref), anchor stimulus, and the raw
rendering are listed as an algorithm. Each of those sets of conditions were rendered for 3 SH orders,
2 directions and 2 rooms. In total 12 trials with 8 conditions each.

Algorithms SH orders (grid) Directions Rooms
BEMA 3 (50) 0° LBS
MagLS 5 (50) 90° SBS
SHF 7 (86)

Tapering
GEQ
Raw

DH Ref
Anchor

4.2.3 Setup
The experiment was conducted in a quiet acoustically damped audio laboratory at
the Chalmers University. The participants were presented a dynamic binaural syn-
thesis using AKG K702 headphones with a Lake People G109 headphone amplifier
at a playback level of 66 dB(A). The head orientation was tracked with a Polhemus
Patriot. The rendering of the binaural VAE was done with the SSR in BRS mode.
A Python application with QT5 graphical user interface which controls the SSR via
TCP commands was developed for the listening experiment. The output signals of
the SSR were routed to an Antilope Audio Orion 32 channel DA converter via a 48
kHz JACK-audio server providing 512 samples per buffer. To compensate for the
binaural chain of the AKG headphones and the KU100 dummy head, a headphone-
equalization according to Stade et al. (2012) was performed. The entire rendering
and performance of the listening experiment was done on an iMac Pro 1.1.

37


4. Perceptual Evaluation

Figure 4.1: PyQT5 implementation of the graphical user interface for the listening test. 8
stimuli, one of them the hidden reference, one the hidden anchor, are compared to the reference
stimulus. Each slider allows to rate the perceived difference to the dummy head reference.

4.2.4 Procedure
12 participants, 3 of them female, aged between 22 and 50 took part in the ex-
periment. Most of them were master’s students or staff at the Division of Applied
Acoustics of the Chalmers University. 9 participants had experience with binaural
synthesis and 8 reported already participated in a listening experiment before. The
subjects sat in front of a computer screen with a keyboard and mouse. Their task
was to rate the similarity of each stimulus compared to the reference according to the
stimuli preparation discussed above. Each stimulus, as well as the reference could
be listen to as much as desired. The participants were allowed and strongly encour-
aged to move their heads during the presentation of the stimuli. At the beginning
of each experiment, the subjects had to rate four training stimuli which cover the
entire similarity range of the presented stimuli in the following test. It took about
30 minutes for the participants to rate all 12 trials.

4.3 Results

According to the recommendation of ITU-R BS.1534-3 (2015), all anchor and ref-
erence ratings were post-screened before applying statistical analysis. All anchor
ratings were below 30 and most of the reference ratings higher than 80. Solely two

38


4. Perceptual Evaluation

reference ratings of 50 and 49 were conspicuous which, however, is a relatively low
portion of in total 96 ratings per participant. There were no further noticeable ab-
normalities in these ratings, wherefore none of the participants was excluded from
the statistics.

A first overview of the results is presented as boxplots in Figure 4.2 illustrating the
ratings for the SBS and LBS at 0° and 90° separately. Each of the four figures shows
a boxplot for each algorithm and the SH orders 3 (red), 5 (black), and 7 (blue). The
boxes displaying the 25th and 75th percentiles and the corresponding median value
as a black line. Furthermore, outliers are indicated as grey circles and minimum and
maxim ratings, not identified as outliers, by the upper and lower whiskers. Boxplots
refer to median values which are presented in Table A.1 in the appendix. Addition-
ally, the table shows a number of further descriptive statistical values.
The boxplots confirm that for most conditions the anchor and reference stimuli were
indicated as such by the listener, the anchor median value of all conditions is 1 and
the reference rating median value is 100. Furthermore, the boxplots for the un-
treated renderings illustrate that for all four room-direction combinations, higher
order renderings yielded higher similarity ratings. The 5th order rendering conse-
quently achieved higher ratings than the 3rd order rendering, however, 7th order
renderings in the SBS achieved lower ratings than the rendering on SH order 5.
Apart from this, it is remarkable that all improvement algorithms achieved higher
ratings than raw renderings except the BEMA algorithm. Hence, all other algo-
rithms achieved perceptual improvements.

For a more detailed analysis, repeated measures ANOVAs were performed. To test
the data requirements for the ANOVA a Lilliefors test for normality distribution
was applied. It failed to reject the null hypothesis in 5 of 72 conditions at a signif-
icance level of .05. However, parametric tests as ANOVAs are generally robust to
violations of normal distribution assumption. For the further analysis, Greenhouse-
Geisser corrected p-values are considered (the associated ε values for correction of
the degrees of freedom of the F -distribution are reported as well).
A four-way repeated measures ANOVA with the within-subject factors algorithm
(BEMA, MagLS, Tapering, SHF, GEQ and raw), order N (3, 5, 7), room (SBS,
LBS) and direction (0° , 90°) was performed. Even though the denotation is rather
inappropriate, the raw rendering is one factor-step of the within-subject factor al-
gorithm. The ANOVA results are presented in Table 4.2.
In addition, a number of nested repeated measures ANOVAs were performed whose
results are presented in the appendix. For each of the six algorithms, one three-way
ANOVA with the factors order (3, 5, 7), room (SBS, LBS), and direction (0° , 90°),
as well as a four-way ANOVA with the subset of MagLS, Tapering, SHF, and GEQ
for the factor algorithm and the factors order (3, 5, 7), room (SBS, LBS), and di-
rection (0° , 90°) was applied.

Since ANOVAs refer to mean values, Figure 4.3 was generated to support the
ANOVA results. It depicts meanplots for each algorithm and the orders 3 (red),
5 (black), and 7 (blue), separately. Each mean value was calculated by averaging

39


4. Perceptual Evaluation

Figure 4.2: Boxplots illustrating the ratings for each room and direction separately. Each
figure shows the boxplots for each algorithm and the SH orders 3 (red), 5 (black) and 7 (blue).
Each box indicates the 25th and 75th percentiles and the median value as a black line. The outliers
are marked as grey circles and the minimum and maximum ratings not identified as outliers with
black upper whiskers.

the ratings for both rooms, both directions, and all participants. Furthermore, 95%
within-subject confidence intervals as proposed by Loftus (2002) and Jarmasz and
Hollands (2009) based on the algorithm main effect are shown. The figure confirms
the observations taken from the boxplots that for the raw rendering higher SH orders
achieved better ratings than lower-order renderings and that, except for BEMA all
algorithms led to an improvement compared to the raw auralization. More clearly
than the boxplots, the meanplots suggest that the perceptual quality of the array
rendering does not scale linearly with the SH order. Considering just the raw con-
dition shows that the mean ratings for 5th and 7th order renderings are more closer
to each other than the 5th and 3rd order ratings.
Similar to the boxplots, the meanplots indicate that the 7th order renderings involv-
ing the Tapering algorithm were rated worse compared to the 5th order Tapering
renderings. Additionally, it can be seen that the average ratings of the algorithms
SHF, GEQ, Tapering and MagLS are located in the same range. That leads to the

40


4. Perceptual Evaluation

Table 4.2: Results of the four-way repeated measures ANOVA with the within-subject factors
algorithm (BEMA, MagLS, Tapering, SHF, GEQ, Raw), order (3, 5, 7), direction (0° , 90°), and
room (SBS, LBS)

Effect df F εGG η2
p p pGG

Algorithm 5 97.964 .572 .899 < .001∗ < .001∗

Order 2 27.681 .808 .716 < .001∗ < .001∗

Direction 1 1.014 1 .084 .335 .335
Room 1 .462 1 .040 .511 .511

Algorithm×Order 10 4.989 .514 .312 < .001∗ < .001∗

Algorithm×Direction 5 3.302 .703 .231 .011∗ .024∗

Order×Direction 2 1.811 .876 .141 .187 .193
Algorithm×Room 5 1.267 .683 .103 .291 .300
Order×Room 2 .224 .677 .020 .801 .715

Direction×Room 1 .016 1 .001 .902 .902
Algorithm×Order×Direction 10 2.329 .518 .175 .016 .052
Algorithm×Order×Room 10 1.817 .481 .142 .066 .128

Algorithm×Direction×Room 5 1.544 .575 .123 .191 .223
Order×Direction×Room 2 .372 .917 .033 .693 .676

Algorithm×Order×Direction×Room 10 1.589 .424 .126 .119 .190

εGG: Greenhouse-Geisser epsilons
pGG: Greenhouse-Geisser corrected p-values.
Statistical significance at 5% level are indicated by asterisks

assumption that these four algorithms achieve similar auralization improvements.
Their average ratings for all SH rendering orders are located between 69.65 and
84.46. (GEQ: 74.44 (N = 3), 80.52 (N = 5), 81.02 (N = 7), SHF: 69.65 (N = 3),
73.48 (N = 5), 78.27 (N = 7), MagLS: 72.4 (N = 3), 78.67 (N = 5), 79.29 (N = 7),
Taper: 73.75 (N = 3), 84.46 (N = 5), 77.73 (N = 7).

The following analysis refers to main effects and first order interactions only, and
neglects the three and four-way interaction effects. For the four-way ANOVA in-
volving all algorithms it was found that the algorithm and order main effects, as well
as the first order interaction effects algorithm × order and algorithm × direction
were significant. These significances will be examined successively in the following:

Algorithm and Order Dependency

The significances of the main effects order (F (2, 22) = 27.681, p < .001, η2
p = .716, ε

= .808) and algorithm (F (5, 55) = 97.964, p < .001, η2
p = .899, ε = .572) match the

findings observed in Figure 4.2 and Figure 4.3. Higher SH orders lead to different,
and mostly higher similarity ratings. Hence, the auralization quality highly depends
on the rendering order and the algorithm.
The first order interaction effect of the factor algorithm × order (F (10, 110) = 4.98,
p < .001, η2

p = .312, ε = .514) shows that not solely the SH order influences the
quality of the untreated rendering, but also the algorithm performs differently with
respect to the SH order. The single algorithm ANOVAs support this. For every
ANOVA the order was indicated as a significant effect.

41


4. Perceptual Evaluation

Huge

Significant

Moderate

Small

No

BEMA Raw SHF GEQ Tapering MagLS

Figure 4.3: Mean values for algorithm and SH order separately. Every mean value was cal-
culated by averaging all room and direction ratings for each participant. The 95% within-subject
confidence intervals for the main effect algorithm were calculated according to Loftus (2002) and
Jarmasz and Hollands (2009). Red colors indicate order 3 renderings, black colors order 5 and
blue order 7 renderings. The highest mean value for order 3 and 7 were achieved by the Global
Equalization Filter (74.44, 81.02), for order 5 by the Tapering approach (84.46).

To validate the observation that the algorithms SHF, GEQ, Tapering and MagLS
achieved similar improvements, the four-way repeated measures ANOVA just for
these algorithms was applied. The factor algorithm was not indicated as significant
anymore (p = .124). Thus, in general, all algorithms except BEMA achieved similar
perceptual improvements.

Directional Dependency

The four-way ma