Generating Personalized HRTF
Using Scanned Mesh from iPhone FaceID

WENKANG LIU

Division of Applied Acoustic

CHALMERS UNIVERSITY OF TECHNOLOGY

Gothenburg, Sweden 2023

www.chalmers.se

www.chalmers.se


Master’s thesis 2023

Generating Personalized HRTF
Using Scanned Mesh from iPhone FaceID

© WENKANG LIU,2023

Supervisor: Sergejs Dombrovskis, China Euro Vehicle Technology AB
Examiner: Jens Arhens, Chalmers University of Technology

Cover: 3D scanning for Kemar via the HRTF plotting calculated by Kemar’s 3D
mesh

Department of Architecture and Civil Engineering
Division of Applied Acoustic

Chalmers University of Technology
Gothenburg, Sweden 2023


Generating Personalized Head-Related Transfer Function (HRTF)
using Scanned Mesh from iPhone FaceID
Wenkang Liu
Division of Applied Acoustic
Chalmers University of Technology

Abstract
In recent years, the advancements in virtual reality (VR) and augmented reality
(AR) technologies have been impressive. Binaural audio rendering plays a vital
role in these technologies and is used in various applications such as gaming, video
conferencing, and hearing aids. Providing a high-quality immersive experience in a
virtual environment heavily relies on the spatial audio quality.

The head-related transfer function (HRTF) describes how sound is filtered by the
head, torso, and ears as it travels from the sound source to the listener’s eardrum.
To achieve spatial audio that better matches auditory perception, researchers have
proposed several HRTF personalization methods, including measurement methods,
database matching methods, modeling simulation methods, and anthropometric pa-
rameter regression methods.

This paper proposes a new modeling simulation method for personalized HRTF
workflow that consists of three parts. Firstly, the participant’s face and torso are
scanned in 3D using the iPhone Face ID component. Secondly, the scanned mesh is
optimized and cleaned using MeshLab and Blender. Finally, the personalized HRTF
is generated using Mesh2hrtf and COMSOL. The effectiveness of the personalized
HRTF is evaluated by comparing the simulated HRTF with the measured HRTF.
Moreover, a test is designed using an adjustable equalizer-based headphone-speaker
control to evaluate the performance of the generated personalized HRTF.

The results demonstrate that the HRTF generated using the FaceID scan grid is
highly comparable to the measured HRTF and produces predictable outcomes in
the listening test. This method shows promise as a low-cost alternative for cus-
tomizing HRTFs.

Keywords: spacial audio, auditory perception, psychoacoustics, head-related trans-
fer functions(HRTF), 3D scanning, mesh optimization, boundary element method(BEM),
sound quality

iii


Acknowledgements
I would like to sincerely thank all those who contributed to the successful completion
of this project. First and foremost, I would like to thank my supervisors, Jens A. and
Sergejs S., for their invaluable guidance and support throughout the research pro-
cess. Their insights and feedbacks were instrumental in helping us achieve our goals.

I would also like to thank the faculty and staff of Divison of Applied Acoustic for
providing me with support and resources. In addition, I would like to thank my
classmates and friends who provided helpful discussions and feedback during the
research process.

Finally, I would like to thank all the participants involved in the data collection
process. Their contributions were critical to the success of this project. We hope
that our research will have a meaningful impact in the field of HRTF and audio
signal processing.

Wenkang Liu, Gothenburg, Dec. 2023

iv


Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Classical HRTF Measurement Method . . . . . . . . . . . . . 4
1.1.2 Numerical HRTF Simulation . . . . . . . . . . . . . . . . . . . 6

1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Individual HRTF modeling 7
2.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Head-related Transfer Function . . . . . . . . . . . . . . . . . 8
2.1.2 Burton-Miller Boundary Element Method (BM-BEM) . . . . . 9
2.1.3 Personalized HRTF via Mesh2HRTF and COMSOL . . . . . . 10

2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 More Things Need to Know . . . . . . . . . . . . . . . . . . . 11

2.3 Acquisition of 3D Meshes. . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Selection of Software and Hardware . . . . . . . . . . . . . . . 11
2.3.2 Preparation and Strategy for Scanning . . . . . . . . . . . . . 12
2.3.3 Scanning Process . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.4 Cleaning and Merging of Meshes . . . . . . . . . . . . . . . . 16

2.4 Simulation of Individual HRTF in Mesh2HRTF . . . . . . . . . . . . 19
2.4.1 Pre-processing in Mesh2HRTF . . . . . . . . . . . . . . . . . . 19
2.4.2 NumCalc Simulation . . . . . . . . . . . . . . . . . . . . . . . 22
2.4.3 HRTF SOFA File Generation . . . . . . . . . . . . . . . . . . 22

2.5 Simulation of Individual HRTF in Comsol . . . . . . . . . . . . . . . 23
2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6.1 COMSOL Simulation Results . . . . . . . . . . . . . . . . . . 25
2.6.2 Mesh2hrtf Simulation Results . . . . . . . . . . . . . . . . . . 26
2.6.3 Comparison of Simulation Results . . . . . . . . . . . . . . . . 29

2.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 Listening Test of Individual HRTF Performance 33
3.1 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.1.1 Two-device Test . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.2 Digital Equalization . . . . . . . . . . . . . . . . . . . . . . . 34

3.1.2.1 Shelving Filters . . . . . . . . . . . . . . . . . . . . . 34
3.1.2.2 Peak Filter . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

v


Contents

3.2.1 Set up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2.2 GUI Design in MATLAB . . . . . . . . . . . . . . . . . . . . . 36
3.2.3 Listening Test Protocol . . . . . . . . . . . . . . . . . . . . . . 38

3.2.3.1 Participants’ HRTF Generation . . . . . . . . . . . . 38
3.2.3.2 Participants . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3.3 Stimuli . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3.4 Procedure . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.3.5 Questionnaire . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Graphic Representation of the Results . . . . . . . . . . . . . 40
3.3.2 Comparision with Headphone Transfer Function . . . . . . . . 43

3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Conclusion 45
4.1 Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2 Discussion of Contributions . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Bibliography 49

A Why is iPhone XR I

vi


1
Introduction

Lots of theoretical studies show that about 60% of people perceive the objective
world by visual perception, 30% by hearing, and the other 10% by touch and smell.
Therefore, besides vision, voice is the main means of getting information about the
objective world, which is very important in our daily life. Human hearing system
plays an important role in understanding the message conveyed by language and
understanding the dialogue in TV. In order to avoid danger, the auditory system is
able to pick up the sound of danger and react in time. In contrast to the visual sys-
tem, the auditory system "Listens" in all directions, and sounds can be heard from
the back, up, or down. Furthermore, because of the physical nature of sound, for
objects that are invisible because of obstacles, the auditory system can determine
the basic characteristics and status of an object by means of voice. The human audi-
tory system is very powerful. Using only two ears, the multi-dimension information
can be easily distinguished, such as the vertical direction, the horizontal direction,
the sound distance and the environment information. The research on the auditory
system is one of the most advanced research fields in the field of audio signal pro-
cessing.
In recent years, spacial audio technology has attracted the attention of domestic
and international research institutes and manufacturers. How to realize spacial au-
dio with high fidelity is one of the hot topics in the field of multimedia. Using a
spacial audio technique to realistically display the spatial orientation of a sound
source, using a binaural playback technique, for example, Head Related Transfer
Function (HRTF). The Head Related Transfer Function Technique takes into ac-
count the transmission of sound waves to the ear as a filter, taking into account only
the free field (with the exception of the listener itself and no other reflection object).
More in details, HRTF reconstructs the spatial direction signal which is generated
by the convolution of the original audio signal and the HRTF, so as to reproduce
the real spacial sound in the two ears.

1.1 Motivation
The human auditory system is a complex system that is vital to our daily lives.
Sound waves are collected by the external ear canal and transmitted to the inner
ear, where they are amplified and processed by the auditory chain in the tympanic
membrane. However, the sound that reaches the eardrum is not the same for all
listeners because it is affected by the listener’s head, torso and body (Møller, 1992).
The brain analyzes these sound signals and calibrates our hearing with the help of

1


1. Introduction

vision.

Interaural level difference (ILD) and interaural time difference (ITD) help the brain
to discriminate the direction of sound in the horizontal plane. However, these cues
are not sufficient for vertical localization. The complex diffusion, resonance and
reflection of sound in the ear provide subtle differences in the frequency spectrum
that can help the brain perceive vertical differences. In this case, the spectral and
temporal properties of sound provide spatial cues that are hidden within fixed spec-
tral cues.

Therefore, the use of HRTF rendering to produce realistic spatial audio effects can
greatly improve our auditory immersion in virtual environments (Blauert, 1997).
HRTF describes the reflections from the head, shoulders and neck and their effects
on the sound reaching the eardrum. Since sounds from different locations have
different frequency characteristics after passing through the HRTF, humans can dis-
tinguish between sounds from different locations.

Based on the structure modeling approach, the parts of the head, shoulders, and ears
that influence the HRTF are created separately. These modules are then combined
to get personalized HRTF. The method can be used to deal with every part of the
system, and it is easy to realize in real time. Brown et al. (1998) proposed a struc-
tural model for the synthesis of binaural sound. The input is mono sound, followed
by head shadow and shoulder echo. Then, the signals of both models are combined,
and spacial audio is output with Pinna model. The model is based on the sound
wave propagation and diffraction and considers the influence of human structure
on sound waves. It has a practical physical meaning to model the shoulders, head,
and ears of a person separately. According to the characteristics of the listener,
the parameters of the model can be adjusted to get the personalized HRTF. The
structure model is easy to implement in DSP (Digital Signal Processor). Geronazo
et al. (2010) developed a personalized approach to Pinna Related Transfer Function
(PRTF). Their approach only takes into account the effect of ear reflection on sound
waves. The PRTF is divided into resonant and notch components. Later, they pre-
sented a personalized HRTF approach based on auricle parameters and a structural
model based on auricle model. Through the use of human measuring parameters, the
HRTF can be customized. Compared with the non-personalized structure model,
the personalized HRTF is more effective, and the computation is low, so it can be
realized in real time. Geronazo et al. (2013) developed a Mixed StructuralMode1
model for HRTF. Every module in the model can be chosen from the integration
module, the measuring module and the data base. The synthesized HRTF has an
exponential level, and the optimum combination is chosen as the HRTF. The hybrid
structure model is flexible and has better location performance, but it needs a lot
of calculation.
The HRTF personalization algorithm based on Principal Component Analysis (PCA)
is generally used to analyze the HRTF database, and the Principal Components
(PCs) with high correlation and relatively few are selected for further processing, so
as to reduce dimension. Finally, high dimensional reconstruction is carried out to

2


1. Introduction

obtain personalized HRTF. Kistler et al. (1992) measured the HRTF of 10 individ-
uals at 265 azimuth angles, and then analyzed these 5,300 (2 × 10 × 265) HRTF
with PCA. The results indicated that 90% of the original HRTF data set could
be included in the original data set. The performance of the reconstructed HRTF
with the five components is similar to that of HRTF. However, when the number of
PC is reduced, the quality of the reconstructed HRTF is correspondingly reduced.
Afterwards, Shin et al. (2008) used PCA for personalization of HRTF. Their ap-
proach first extracted Pinna Related Impulse Response (PRIR) from all individuals’
temporal HRIR in the middle vertical plane (with 0 horizontal angle) in the HRTF
database, and analyzed the ear response. Then, the first 5 main components are
selected, and the measurer uses a graphical interface to adjust the weights of the
5 main components, and then re-synthesize the PRIR with the adjusted weights.
The experiment results indicate that the HRIR with the proposed method has less
location error and lower confusion rate than non-personalized HRIR. Hwang et al.
(2008) proposed a similar approach to that of Shin, who also performed PCA analy-
sis on HRTF database vertical data. However, they extended their choice of primary
components to 12, and the difference between the reconstructed HRIR and the orig-
inal HRIR was below 4.8%. The tester only has to make adjustments to the first
three critical PCWs, which saves the adjustment time compared to the 5 PCWs.
The HRTF individualization approach based on database matching is used to map
the individual HRTF to the database, and the HRTF is used as the personalized
result. The shortcoming of this method is that it needs a lot of data base and
representative data to get good effect. Apart from the structure model and the per-
sonalized approach based on the PCA, the other HRTF individualization approach
is based on the measurement parameter. This approach is based on the assump-
tion that some measuring parameters of the human body are similar, and that the
HRTF and the individual HRTF are similar. Zotkin et al. In 2002, a personalized
HRTF scheme was proposed on the basis of measuring parameter matching. Seven
ear measurement parameters were chosen, and HRTF database was used to find the
most suitable individuals for these 7 parameters. The HRTF of an individual is
then used as a personalized HRTF. Experimental results indicate that the HRTF of
the proposed method is superior to the conventional HRTF in terms of precision.
Subsequently, Zotkin et al. (2003) proposed a Head and Torso Model to compensate
for the low frequency loss, which further improved the experimental results. Algazi
et al. (2007) proposed a new approach to modeling the ear based on measurement
parameters. The Pinna Related Transfer Function (PRTF) was broken down into a
few small pieces, and then a low order filter was used to describe them. Based on the
measurement parameters of the human body, they set up the relationship between
the parameters of the body and the filter coefficients, and finally, the personalized
PRTF was obtained. The experiment results indicate that only a few parameters
can be used to get the coefficients of the filter, and the PRTF can be approximated
well. Iida et al. (2014) estimated a trough central frequency in the PRTF on the
basis of measurement parameters of a person’s ear, and then searched the HRRF
database for HRTF nearest to those trough center frequencies, resulting in the best
matching HRTF being deemed personalized. The HRTF localization performance of
this method is similar to that of HRTF. Meshram et al. (2014) proposed an image

3


1. Introduction

modeling approach for personalized HRTF. First, they take pictures of the head,
shoulders, and other parts of the body with a camera. Then, they use advanced
imaging techniques to estimate the 3D model of the head. Then, the acoustic equa-
tion can be solved by simulation to get personalized HRTF. The HRTF obtained by
this method is superior to the HRTF of KEMAR, but the computation is too much
and the calculation time is large. Torres et al. (2015) used Active Shape Models
(ASM) to obtain the parameter characteristics of the subject subject through com-
puter vision, and then select suitable parameters to search for the HRTF with the
nearest parameter from the HRTF library.
Various methods have been proposed to personalize HRTFs, but they all have their
limitations. In this paper, a new method to personalize HRTFs is tested by using
3D grid scanning of an iPhone. The performance of the method will be evaluated
by comparative analysis with existing techniques and by conducting auditory ex-
periments. By developing and testing an accurate and cost-effective method for
personalizing HRTFs, this study can contribute to improving auditory immersion in
virtual environments and suggest directions for further research.

1.1.1 Classical HRTF Measurement Method

HRTF measurements are usually performed in an anechoic chamber so that the
measured HRTF does not record information about a specific space that should
not be present. In the early stages of HRTF research, impulse response signals
at different locations were recorded by microphones placed inside the ear. The
basic principles and methods of HRTF measurements today are the same as in
the early days. However, the early HRTF measurement process was more complex
and the results were worse than the digital measurement techniques used today.
The drawbacks of the analog measurement methods used in the early studies were
mainly due to the fact that the hardware and software used were not sufficient to
support such fine measurements.

4


1. Introduction

Figure 1.1: Dummy head HRTF measurements in anechoic chamber, RWTH
Aachen University

The team from RWTH Aachen University published HRTF measurement data for
dummy in 2017. The article describes how HRTF measurements can be carried
out using classical methods. Compared to earlier measurement methods, the au-
thors used a more miniaturized device and a tighter measurement process. The
measurement goal of the authors’ team was to achieve reliable individual HRTF
measurements in an anechoic chamber in as short a time as possible with minimal
impact on the measurement itself. A new loudspeaker array was used in the exper-
iment, and this new design allowed the device to be significantly reduced in size.
As can be seen in Figure 1, the setup is a circular array of loudspeakers, which
are placed along the zenith direction. The dummy head stands on a turntable and
rotates with the center point of the head as the center point of the arc. During the
test, each speaker in turn emits an impulse response, and the speakers built into
the ear canal of the dummy head record these sounds. For each recording of the
speaker array, the turntable changed the horizontal angle of the dummy’s head until
all horizontal angles were measured. For the collected data, some post-processing
was performed, including the measurement pulse loudness was cropped to the same
length, the reference transfer function was regularized, etc.
Although this method has greatly simplified the traditional HRTF measurement
process, there are still many inconveniences in the actual measurement: if a real
person participates in the test, the subject has to maintain a stable sitting position
for a long period of time, otherwise the microphone in the ear canal will be displaced,
and the body deformation will also make the sound not experience correct reflection
and diffusion; and the test conditions such as anechoic chamber, microphone array
and back-end control system are also demanding.

5


1. Introduction

1.1.2 Numerical HRTF Simulation
In recent years, numerical calculation methods have emerged as an alternative ap-
proach to obtaining Head-Related Transfer Functions (HRTFs). This technique
involves modeling the linear transformation of sound before it reaches the listener’s
ear canal, including spatial cues. The Boundary Element Method (BEM) is the most
commonly used numerical method for HRTF calculation, which uses the Helmholtz
equation to describe sound waves in a domain and transforms it into a boundary
integral equation. However, this approach is based on the assumption that only
surface features of the ear, head, and shoulders are relevant, and the propagation
through other body parts is ignored(Katz, 2001a). Additionally, human skin has
been shown to have acoustic rigidity, while hair does not.

Accurate three-dimensional geometrical shapes of the pinna, head, and torso are
required for personalized HRTF calculation, and the accuracy of the calculation de-
pends largely on the accuracy of the 3D geometrical measurements, especially in the
high-frequency range. However, to ensure accurate acquisition, such measurements
also require considerable costs. To reduce the cost of personalization, several ap-
proximate acquisition methods have been developed, such as physiologically-based
personalized methods and subjective experiments based on a small number of mea-
surements. Nonetheless, Zhong and Xie (2012) have pointed out that the accuracy
of HRTF, especially in the high-frequency range, needs to be improved, and there
is still a significant gap.

As artificial intelligence continues to advance in the field of acoustics, AI-based
methods have shown great potential in improving the efficiency of obtaining person-
alized HRTFs.

Gebru et al [10] designed a HRTF prediction system based on deep learning, and the
input parameters of this system can be measured without a professional listening
room, which also reduces the cost and obtains good results. Good results were
obtained.

1.2 Outline
This report is bifurcated into two main sections. The first section, i.e., Chapter
2, comprehensively explicates the process of generating personalized HRTF from
Kemar scanning to deriving it in the SOFA format. Furthermore, this chapter also
presents a comparative analysis of the results obtained from the proposed method
with those of other conventional software models.
The second chapter encompasses a basic listening test designed to evaluate the
effectiveness of the HRTF simulated through mesh scanning of the test participants.
This chapter delves into the design of the test software, outlines the experimental
procedure, and presents the results of the listening test.

6


2
Individual HRTF modeling

This chapter provides an introduction to the main focus of this study, which is the
simulation workflow for personalized head-related transfer functions (HRTFs) based
on individual head models. The chapter is divided into two parts: the first part
describes the methodology for obtaining the head model using Heges 3D software
and an iPhone XR, including pre-processing steps such as cleaning, repairing, and
simplifying the scanned mesh. The second part compares different simulation meth-
ods and presents post-processing results.

It is worth noting that RWTH Aachen University has conducted professional high-
resolution scanning and HRTF practical tests on the same Kemar model, enabling
a comprehensive comparison between the Kemar model scanned with an iPhone
and with professional 3D equipment. The actual test results can serve as the most
accurate reference for comparison.

The conclusion of this chapter provides a basis for the auditory tests in the next
chapter and suggests corrections. Therefore, the present study aims to explore the
simulation method for personalized HRTFs based on individual head models and
provide valuable insights and references for future auditory research.

2.1 Theory
In 1974, Jens Blauert first proposed the concept of head dependent transfer func-
tion. He pointed out that when the head is fixed and stationary, the sound waves
emitted by the sound source reach the ears through scattering and reflection from
the head, auricle, trunk, etc., and can be regarded as a linear time invariant (LTI)
filter. Its characteristics can be fully described by the frequency domain transfer
function of the filter. This filtering process is represented by a head related transfer
function, and its corresponding time-domain form is called Head Related Impulse
Response (HRIR). Specifically, the head related transfer function HRTF describes
the filtering effect of the head, auricle, and torso when receiving sound from an
acoustic point source at a specific location in the listener’s ear canal under free field
acoustic conditions. Its definition is as follows:

HRTFL = HRTFL(r, θ, φ, ω, a) = PL(r, θ, φ, ω, a)
P0(r, ω) (2.1)

7


2. Individual HRTF modeling

HRTFR = HRTFR(r, θ, φ, ω, a) = PR(r, θ, φ, ω, a)
P0(r, ω) (2.2)

Among them, PL and PR are the sound pressure of the sound source at the left and
right ears of the human body, P0 is the sound pressure of the sound source at the
center of the line connecting the two ear canals of the human body in the absence
of the human body, and r is the distance from the sound source to the center of the
head, θ is the horizontal azimuth angle of the sound source, φ Is the height angle
of the sound source, ω Is the frequency of sound waves, α It is a morphological
parameter of the human body.
Usually, there are two main ways to obtain HRTF: one is through experimental mea-
surements in an anechoic chamber, and the other is through theoretical calculations.
The methods obtained through measurement are mainly divided into two types: one
is to use linear time invariance, and the other is to use deconvolution. For linear
time invariant systems, when the input signal is a unit impulse response δ(t) . The
output h(t) of the system is the impact response of the linear time invariant system,
which is the transfer function of the system.

x(t) ∗ h(t) = y(t) (2.3)

δ(t) =

0, t 6= 0
1, t = 0

(2.4)

Deconvolution refers to inputting any input signal x(t) into a system with a system
function h(t) to obtain an output signal of y(t). When x(t) is known and y(t) is
measured experimentally, the system function h(t) only needs to be calculated using
the following formula:

x(t) ∗ h(t) = y(t) (2.5)

h(t) = IFFT{FFT (y(t))/FFT (x(t))} (2.6)

The CIPIC database is obtained through linear time invariance measurement, using
a random signal (Golay signal) as the excitation, and the autocorrelation function
of the signal is a strict unit impulse response δ(t) . The databases such as Listen
HRTF are measured and calculated through deconvolution methods.
The essence of the method of obtaining HRTF through theoretical calculation is to
physically solve the scattering and diffraction processes of sound by the head, ear,
and other objects.

2.1.1 Head-related Transfer Function
The head-related transfer function (HRTF) is an acoustic transfer function that de-
scribes the distance between a point source in a free field and a specified location
in the listener’s ear canal, and plays an important role in creating an immersive
virtual acoustic environment (VAE) for headphone or speaker playback. HRTF is
highly personalised and depends on the direction and distance (near-field HRTF)
Head-related impulse response (HRIR) is the time-domain representation of HRTF.
All relevant acoustic information for localising real sound sources is contained in

8


2. Individual HRTF modeling

the HRTF, i.e. ITD and ILD, and the monaural spectral factors. As each person’s
anatomy is different, the HRTF is unique for each individual. VAEs created using
a non-personalised HRTF may have a poor listening experience, such as reduced
accuracy of sound image localisation and a confusing sense of distance. For far-field
VAEs, it is usually possible to adjust the sound pressure to vary with the distance of
the sound, according to the inverse square law. In the near field, however, the HRTF
varies significantly with distance, and this is when a separate HRTF is needed to
accurately describe it.
In practice, early individualised HRTFs were obtained from acoustic measurements.
This was done by placing a microphone in the subject’s ear canal so that the mi-
crophone recorded the sweep signal from the different directions of the acoustics.
Eventually the signals from all the different vertical and horizontal angles were col-
lected to create a frequency response function with the input signal of the sound.
Testing a high density HRTF data set is time consuming, especially for real subjects
- it often means sitting motionless in an anechoic chamber for several hours. The
use of sparse HRTF datasets interpolated or extrapolated with distance or direc-
tion to obtain high-density HRTF datasets is effective in reducing the number of
measurement points, but still requires a large number of measurements.

2.1.2 Burton-Miller Boundary Element Method (BM-BEM)
The Burton-Miller Boundary Element Method (BM-BEM) is a numerical approach
that is employed in the field of computational acoustics to solve problems concerning
wave propagation and scattering. This variant of the Boundary Element Method
(BEM) was introduced by Burton and Miller in 1971 and is specifically advantageous
for solving exterior acoustic problems, such as scattering from objects in a free field
or radiation from a vibrating surface.

BM-BEM transforms the governing equation of the acoustic problem into an integral
equation on the boundary of the domain, which is then discretized to obtain a linear
system of equations that can be solved numerically. The numerical solution provides
the values of the acoustic pressure and/or velocity at all points in the domain.

Compared to traditional BEM, BM-BEM offers various advantages. Firstly, it em-
ploys an alternate representation of the fundamental solution which leads to a com-
putationally efficient symmetric coefficient matrix instead of a non-symmetric one
used in traditional BEM. Secondly, BM-BEM provides a stable numerical solution
even for highly oscillatory kernels, which can be challenging to handle in traditional
BEM.

To overcome these issues, the Burton-Miller equation combines the Helmholtz equa-
tion and its normal phase derivative equation. It can uniquely solve the full fre-
quency band in the external region.

Mesh2HRTF uses a 3-dimensional Burton-Miller with a BEM implementation with
the Multilevel Fast Multipole Method (ML-FMM) and provides add-ons for existing

9


2. Individual HRTF modeling

cross-platform applications for pre-processing of geometric data and visualisation of
results.

2.1.3 Personalized HRTF via Mesh2HRTF and COMSOL
For simulations using BEM via different software, appropriate scan meshes are re-
quired. However, the processing workflow for calculating BEM may differ.
For Mesh2HRTF simulations, an ideal 3D mesh should have a high-resolution ear
shape and a relatively low-resolution head and torso part to reduce computation
time and improve accuracy.
Mesh2HRTF workflow requirements:
1. A computer with at least 16GB RAM.
2. An accurate 3D mesh of the individual ear and head shape.
3. Mesh correction and simplification are performed in Blender, which is the final
step in mesh processing.
4. Simulating in "NumCalc" of Mesh2HRTF under the Python environment.

Some additional free or open source software for cleaning up 3D meshes and listen-
ing to the generated SOFA files.
Sergejs D., the senior system engineer from CEVT and the supervisor of this project,
has created a starter guide to Mesh2HRTF. For more detailed steps, please refer to
the citation.[4]

COMSOL is a wide-used simulation software that can also be used to simulate the
acoustic properties of an individual’s head and torso.
COMSOL workflow requirements:
1. A computer with at least 16GB RAM.
2. An accurate 3D mesh of the individual ear and head shape.
3. Simulating in the Acoustics Module under COMSOL Multiphysics.
The detailed software operation steps (excluding hardware) are demonstrated in an
article titled "Head and Torso HRTF Computation" in the Application Gallery page
of COMSOL.[5]

2.2 Methods
Mesh2HRTF is an open-source project available on GitHub that provides a user-
friendly package for the numerical computation of head-related transfer functions
(HRTFs) for researchers and enthusiasts in the field of binaural spatial audio. The
software reads the 3D human body mesh, calculates the corresponding sound fields,
and produces HRTFs using NumCalc. To accommodate multiple computational
platforms, Mesh2HRTF is primarily a command-line tool focused on the numerical
core, which includes the 3D Burton-Miller alignment BEM and the Multilevel Fast
Multipole Method (ML-FMM) implementation. It also offers add-ons for existing
cross-platform applications to pre-process geometric data and visualize results.
For the Pressure Acoustics Boundary Element interface in Acoustics Module, COM-

10


2. Individual HRTF modeling

SOL, no more specific boundary element method is described.
Before establishing a 3D grid, the original HRTF data will be pre-processed as fol-
lows:
(1) Perform minimum phase processing on the original HRIRs data in the database
to remove delay information;
(2) Transform the removed delayed HRIRs data obtained in (1) into HRTFs through
Fast Fourier Transform (FFT);
(3) Calculate the logarithmic domain form of HRTFs, log-HRTFs;
(4) Perform a mean removal operation on HRTFs in each logarithmic domain;

2.2.1 More Things Need to Know
SOFA file: The spatially oriented format for acoustics (SOFA) aims at represent-
ing spatial data in a general way, allowing to store not only HRTFs but also more
complex data, e.g., directional room impulse responses (DRIRs) measured with a
multichannel microphone array excited by a loudspeaker array. In order to sim-
plify the adaption of SOFA for various applications, examples of implementation of
the format specifications are provided together with a collection of exemplary data
sets converted to SOFA. In this project, the personalised HRTF obtained from the
Mesh2HRTF simulation will be recorded in a file in SOFA format.

2.3 Acquisition of 3D Meshes.
Research by Mesh2HRTF developers suggests that an ideal 3D mesh should have
approximately 40,000 elements with lengths between 0.5 mm and 10 mm.[3] COM-
SOL requires a 3D mesh of the head and torso. Typically, the mesh around the ear
is detailed, while the head and torso areas are simplified to speed up calculations
and improve accuracy. Mesh2HRTF and COMSOL have similar requirements in
this regard. [5]
The goal of scanning is to obtain a total mesh that includes a high-resolution ear and
low-resolution head and torso. The total mesh needs to be scanned carefully with
the highest resolution around the ear’s contour and relatively quickly with mediocre
resolution around the head and torso. In this study, the Heges 3D v1.6 software and
a standard iPhone XR using iOS 15 were used to perform all scans.

2.3.1 Selection of Software and Hardware
Various methods can be employed to obtain a precise 3D mesh of the individual
ear and head shape. One approach is to use professional 3D scanners, including
portable laser scanners commonly used for traditional HRTF simulation. Alterna-
tively, mobile devices with laser scanning or structured light scanning capabilities
can also perform 3D scanning, with flagship smartphones in the Android and IOS
camps offering such features[1]. While laser scanning can be used for larger ob-
jects, structured light scanning, such as the Face ID on the iPhone, is considered

11


2. Individual HRTF modeling

to have sufficient resolution (0.5mm) for the boundary element method required for
customized HRTF[2].
Before scanning, some software and hardware settings can also be adjusted. The
Heges app can support sharing the user interface to another iOS device, which is very
helpful for operators. This is particularly important when scanning the cochlea, as
some imaging angles cannot be observed well through the posture of taking a selfie.
It is also important to set the size and coordinate axis of the mesh in advance, as
some mesh processing software cannot easily modify the size. Additionally, using the
finest resolution when scanning the head and torso may cause the device to crash due
to insufficient cache or memory. It is reasonable to lower the resolution to 1 mm or
2 mm, which has been proven to be a reasonable and effective operation. Although
the mesh will be further down-sampled to even lower resolutions during processing,
starting with a lower resolution that meets the actual requirements (approximately
2 mm to 5 mm) will result in a smoother surface when merging with high-resolution
cochlear meshes. Cleaning up rough surfaces can be very time-consuming and labo-
rious.

2.3.2 Preparation and Strategy for Scanning

To optimize the conditions for 3D scanning, several steps can be taken. First, it
is important to expose the ear and skin surface as much as possible by covering
or removing the hair. Hair can distort the interaural time difference (ITD) and
other aspects of the head-related transfer function (HRTF) that are important for
accurate scanning. It is recommended to use a tight-fitting swim cap or wig cap to
compress and tidy the hair, and to clean any beard to minimize the amount of 3D
data needed to clean up the remaining hair. This will improve the accuracy of head
boundaries. It is also important to avoid wearing additional items such as glasses
and earrings since they reflect light and have small details that cannot be accurately
scanned.
Reflective make-up should also be avoided as matte surfaces are preferred for most
3D scanners. Clean skin is required to ensure accurate scanning. It should be
noted that the actual simulation calculations do not require such high resolution,
but slightly higher resolution than the simulation has better fault tolerance in the
subsequent mesh processing.
The scanning strategy should begin with scanning the left and right ears in detail,
followed by the face, and finally the neck and shoulders. Since it is common to ob-
tain inadequate meshes (discontinuous surfaces or spiky features), each part should
be scanned multiple times, with at least two qualified meshes obtained for each
part. The scanning time is approximately 30 minutes for those familiar with the
process, which is considered a significant advantage over traditional methods using
professional laser scanners.
The targets of the scanning are also listed in Table 2.1:

12


2. Individual HRTF modeling

Body part Resolution(mm) Scanning times Scanning object
Ear 0.5 Multiple Kemar
Head 1 Multiple Kemar
Torso 2 Single Kemar

Table 2.1: Desired mesh quality for different body parts

2.3.3 Scanning Process
Using an iPhone or even a professional scanner to scan the ear can pose challenges.
The scanning method is based on the principle that Face ID performs best when
the object is about 30-50 cm away. Therefore, capturing an image of a person in a
selfie form is the ideal choice. This ensures that the person being scanned is in the
best position for resolution, while allowing the operator to maintain an appropriate
distance and avoid being captured in the scan.

Starting from the back of the head can be a good option because any accumulated
error in this area is less noticeable.

To help the 3D scanner locate the ear accurately, it is recommended to add a geo-
metric reference in the scanning scene. The reference object should be simple and
have a regular shape, which helps the scanning software locate it and facilitates
future quality checks. It should also be large enough to be viewed from multiple
angles to avoid losing track. In this study, we chose a frame made of a toy building
kit as a reference object, which can be seen in the figure2.1.

If the entire scanning process includes the reference object, it may not be possible
to achieve excellent mesh cleaning at the connection between the head and the ref-
erence object (although this seems to have little effect from the simulation results).
Therefore, assuming no reference object is included is also considered feasible. As

Figure 2.1: Scanning of Kemar in this thesis

13


2. Individual HRTF modeling

can be seen from the figure above, the operator is not recorded in the scanner, while
Kemar is clearly recorded.

To better simulate real-world scenarios, Kemar was placed on another tester’s lap
to mimic tiny unintentional vibrations of the human body. In practice, having an
additional person between the scanner operator and Kemar made it helpful to plan
the scanning route in advance. Also, due to the increased distance, it became some-
what difficult to scan the entire head in one go. As a result, many scans were taken
before the desired head scan was obtained.Without a backup, many defects like the
one shown in the Figure 2.2 are likely to be overlooked.

Figure 2.2: Bad scan of left ear (hole appears)

During the scanning process, it is also necessary to check and compare with the
actual shape in a timely manner. In addition to obvious scanning errors such as
grids containing holes and wrinkles, as shown in Figure 2.3, when scanning the ear
shape in the direction of the arrow, the resulting mesh structure is often distorted.

14


2. Individual HRTF modeling

Figure 2.3: Kemar dummy difficult to scan area

A long scan of the Figure 2.3 area is needed to ensure that the ear crease is suffi-
ciently deep and narrow.

This is a three-part scanning process, with the head scan performed first, followed
by detailed scans of the right and left ear and finally the torso. Attempting to
capture all the details in one scan is challenging due to the high demand it places
on the phone’s processing and storage capacity. Additionally, it would result in a
low fault tolerance rate and costly re-scans. It is important to note that conducting
multiple scans will improve the chances of a successful scan in the later stages. It
is also necessary to clear the storage space on phone in advance. A single scan of
Kemar’s torso for this project can be as large as 700 MB, which means that a full
scan process can take up to 10 GB of storage

Figure 2.4: Original scan of Kemar’s body and left ear

Some of the results of the scan are shown in the Figure 2.4.As can be seen, the scan
results often contain many isolated surfaces, as well as some uneven surfaces and

15


2. Individual HRTF modeling

even holes. Therefore, further mesh cleaning is very necessary.

2.3.4 Cleaning and Merging of Meshes
After the initial scan, a high-resolution ear mesh and a relatively low-resolution
head and torso mesh are applied to specific software for cleaning and merging. The
software used in this project includes Blender, Meshmixer, and Meshlab. It should
be noted that these are not the only available software, but free and open-source
software was prioritized to make the experiment more universal and valuable.

First, some basic automatic cleaning is necessary. For simulation, the 3D mesh
must be an airtight fluid shell without isolated, overlapping geometric shapes. In
the scanned mesh, there will always be sharp protrusions and disconnected free bod-
ies, which will cause simulation errors. Therefore, some basic automatic cleaning is
necessary before cutting and merging meshes.

It should be noted that for surfaces that are difficult to completely remove or paint,
such as hair and eyebrows, down-sampling should be performed in advance and then
the mesh should be cleaned. This is to prevent mesh collapse during painting or
the formation of sharp corners between meshes, which is a common problem when
dealing with dense meshes.

In cleaning operations, taking Meshlab 2021.10, Blender 3.0.0 and Meshmixer 3.5.474
used in this peoject:

Step 1. Delete unnecessary geometries such as reference objects: use ’Edit/Select
Faces and Vertices inside polyline area’ to select the objects that need to be saved.
Then press ’I’ to invert the selection and press ’DELETE’ to delete. Note that the
selection under this method is perspective-based, which may cause some unexpected
consequences: when selecting isolated surfaces near the ears, some useful surfaces of
the neck or back of the head may also be automatically selected and deleted due to
perspective. So be extra careful.
Step 2. Delete all isolated surfaces: use ’Edit/Select Connected Components in a
region’ and drag it to the main mesh for selection. Then press ’I’ to invert the
selection and press ’DELETE’ to delete. Step 3. Save and ’Reload all layers’. Save
the preliminary modified mesh.
Step 4. Use ’Filters/Remeshing, Simplification and Reconstruction/Simplification:
Quadric Edge Collapse Decimation’ to simplify the mesh.
Step 5. Use ’Filters/Remeshing, Simplification and Reconstruction/Surface Recon-
struction: Screened Poisson’ to further down-sample the head and torso. Set ’Re-
construction Depth = 12’. This operation will reduce the accuracy of the mesh, so
it cannot be performed on the ears.The other function of this step is to close the
grid below the shoulders, as you can see from Figure 2.4, the torso mesh is not
closed.The picture on the right shows the placement of the microphone during the
actual HRTF measurement.
Step 6. Please pay special attention to the mesh at the entrance of the ear canal.

16


2. Individual HRTF modeling

Figure 2.5: Ideal external ear canal surface and ear canal in reality[20].

Sometimes, many sharp geometries or holes may form in the ear canal because the
structure deep inside cannot be detected. At this point, some automatic repair
algorithms are needed to fill in. This is a tricky detail to consider because the
parameter points for Mesh2HRTF calculation are at the entrance of the outer ear
canal (which is realistic because the microphone is almost impossible to be placed
at the eardrum). Ideally, ’filling the ear canal entrance’ should be the default. In
this experiment, the appropriate processing in Meshlab was not found, so ’Smooth’
in Blender’s ’Sculpt Mode’ was selected to smooth the geometry of this part.The
ideal external auditory canal surface is shown in Figure 2.5.

Please note that for the head and torso, the priority of downsampling can be in-
creased appropriately. In practice, the torso and head files are too large, causing
the software to often lag or even crash when working with such meshes. Simplify-
ing these meshes will significantly improve the efficiency and success of the process.
Saving frequently is also a forced habit, as Meshlab is very prone to jamming or
crashing when performing some automated algorithms (especially noise reduction).
If not saved in time, this means that the job needs to be started all over again.

The next step is the mesh merging. The mesh of the left and right ears that have
been clipped, and the mesh of the head that has been subtracted from the ear part
will be used.This is followed by the merging of the meshes which can be done in a
variety of ways, with different software offering many options. This stage of the pro-
cess varies from person to person. Finally it is necessary to check that the merged
mesh joints are smooth.

The merging process in this project is as follows: Step 7: Down-sampling the mesh.
Use ’Filters/Remeshing, Simplification and Reconstruction/Surface Reconstruction:
Screened Poisson’ to down-sample the torso, which will effectively reduce the size of
the file. In this project, the number of meshes for the torso and head is controlled
to be no more than 15,000.

17


2. Individual HRTF modeling

Figure 2.6: Cropped left ear (left) and torso (right) meshes

Step 8: Cut the head and ear meshes to the size shown in Figure 2.6. Check that
all meshes should contain partially overlapping surfaces. Cutting too much surface
will cause the mesh to be misaligned.
Step 9: Merge the ears and torso using the ’Point-based gluing’ method in the ’Align’
tool.As can be seen from Figure 2.7, the additional ear grid can greatly assist in the
positioning of the merge.

Figure 2.7: Merging of meshes using Point-based gluing

Step 10: Save the merged mesh and reload all layers. Step 11: Use ’Surface Recon-
struction: Screened Poisson’ to create a new mesh around the scanned data. Set
’Pre-Clean’ to ’Yes’, set "Merge all visible layers" to Yes and adjust ’Reconstruction
Depth = 12’.
Step 12: Use ’Smooth’ in ’Script Mode’ under Blender to smooth the joints of merged
meshes. Radius’ and ’Strength’ in the Smooth function can be adjusted to suit your
needs. The settings for this project are ’Radius = 45 px’ and ’Strength = 0.4’. As
can be seen in Figure 2.8, the smoothed mesh is much flatter and more consistent
with the actual Kemar surface in the area where the ear meets the head.

18


2. Individual HRTF modeling

Figure 2.8: Merged Kemar mesh (Left) and Kemar mesh after smoothing (Right)

Step 13: Use ’Make Solid’ in MeshMixer to ensure that the overall mesh is airtight.
This is an option because in the later stages of the simulation process, the airtight-
ness needs to be ensured, otherwise it will indicate that the ’computational data
does not regress’.

The above process is how the mesh is processed and the prepared mesh can be used
in the Mesh2HRTF and COMSOL simulations. Please save most of the mesh files
and make a note of them, as this will greatly facilitate later modifications as required
by the simulation phase.

For the overall meshing process, the time consumed is dependent on the quality
of the scanned mesh. In practice, poor meshes can often take up to a day to fix,
compared to a good mesh that can be fully processed in just one hour.

2.4 Simulation of Individual HRTF in Mesh2HRTF
This section describes the procedure for placing the prepared meshes into the dif-
ferent workflows.

2.4.1 Pre-processing in Mesh2HRTF
Pre-processing phase is carried out to complete all the parameter settings in Mesh2HRTF.
The rest of the simulation process will either complete automatically or stop with
an error. The HRTFs for the left and right ear are generated separately and these
HRTFs can then be combined to obtain the final personalised HRTF.

Timon et al. suggest gradually reducing the resolution of the mesh as the distance
to the ear close to the HRTF increases to reduce the amount of operations. This

19


2. Individual HRTF modeling

approach is also optimised by considering the curvature of the geometry. The re-
sulting graded meshes allow for faster simulations in the HRTF with equal or better
accuracy than previous work.

In Mesh2HRTF there is a component called ’hrtf mesh grading’, which is used to
optimise the 3D model simulated by Mesh2HRTF. In order to obtain maximum ef-
ficiency, the mesh used to simulate the left ear contains only fine details on the left
side, while the right side is significantly simplified and vice versa.

The pre-processing of Mesh2HRTF includes the optimisation of the mesh and the
setting of other parameters.
Step 1: The optimisation of the mesh is first carried out in blender by importing the
full resolution mesh into the ’3d Model uniform.blend’ example Blender file. This
example file will be used twice, this time to determine the spatial position of the 3D
model. This is done so that the centre of the head is at the origin, the face is in the
positive direction of the X-axis and the left and right ear canals are crossed by the
Y-axis.
Step 2: The 3D model processed in Step 1 is placed into the ’hrtf mesh grading
Windows Exe’ folder and run. The output results in two meshes. They are ’3Dmesh
graded left.ply’ and ’3Dmesh graded right.ply’.’ The results of the ’3D mesh graded
left’ run are shown in Figure 2.9. It can be seen that the left ear of the optimised
model is complete, while the otherwise dense mesh of the right ear has been exten-
sively simplified. The other file runs with the opposite result.

Figure 2.9: Comparison of left (Left) and right ears (Right) after optimisation of
mesh

Step 3: The two exported meshes are imported again into the ’3d Model uni-
form.blend’ file. This will set different material properties for all surfaces. There are
three material properties, ’Skin’, ’Left ear’ and ’Right ear’. The ’Skin’ contains the
properties of human skin, while the ’Left ear’ and ’Right ear’ materials represent the
properties of the blocking ear canal microphone. ’Skin’ was set to the vast majority
of the surface, with the ’Left ear’ and ’Right ear’ finding only a triangle selected

20


2. Individual HRTF modeling

as the most representative. In this project, the ’Left ear’ and ’Right ear’ materials
were selected in the same triangle as shown in Figure ?? for the Y-axis shot into
the ear.

Figure 2.10: Default Vibrating Element

Step 4: Use the Python console in Blender to export the final mesh.The Python
console can be found in Sergejs’ Beginner’s Tutorial. The code for this project will
be shown in Appendix B.

Some other settings:
Step 5: Adjusting the parameters of ’EvaluationGrids’. The HRTF data set is as-
sumed to be one in a spherical space, with different points on the surface of the sphere
corresponding to the amplitude-frequency characteristics of the sound emitted from
that direction to the ear canal. Each point contains a set of impulse responses for
both ears. In Mesh2Input’s EvaluationGrids, the location and density of the points
sampled in this sphere can be set. In this paper, two formats of Evaluation Grids
are set, one following the ARI HRTF database and the other optimised according to
the Kemar HRTF measurements provided by the RWTH Aachen University. Points
containing the same results as the measurements are set in the second Evaluation
Grid. ARI data includes full azimuthal space (0° to 360°) and elevation angles from
-30° to +80°,the resolution of the frontal space in the horizontal plane is 2.5° 1550
points in total. The customised data set contains full azimuthal space (0° to 360°)
and elevation angles from -90° to +90°,the resolution of the frontal space in the
horizontal plane is 1° and 6220 points were selected for the surface.

21


2. Individual HRTF modeling

Figure 2.11: Schematic diagram of the spatial distribution of HRTF collection
points under the ARI standard

2.4.2 NumCalc Simulation
After the Blender project has been exported, 2 project folders (for the left and right
HRTF side) will be found, containing the "info.txt" file and other files and folders.
Move this file to the mesh2hrtf-tools folder and run ’NumCalcManager.py’.
The result of the operation is a personalised HRTF SOFA file for the left and right
ear respectively. In this project, this operation usually takes between 8 and 10 hours,
depending on the frequency range of the HRTF previously set and the density of
the Evaluation Grids.

2.4.3 HRTF SOFA File Generation
After the simulations in the previous section, the HRTFs for the left and right ears
have been generated. The only step left is to merge the HRTFs of the left and right
ears to obtain the final Kemar HRTF.

It is really easy to complete this step, Sergejs provides the ’finalize hrtf simula-
tion.py’ script in the beginner’s tutorial. This script can automatically synthesize
two HRTFs and get the final PDF.

In addition, some images containing HRTF are also provided as shown in Figure
2.12. This is a plot of the HRTF amplitude and frequency characteristics at ear
level, which can help to check the performance more visually.

22


2. Individual HRTF modeling

Figure 2.12: Kemar Mesh2HRTF Results

2.5 Simulation of Individual HRTF in Comsol
In COMSOL Multiphysics, the boundary element interface uses the boundary ele-
ment method (BEM) to model acoustic problems via pressure acoustics. This inter-
face is quite accurate for HRTF analysis, as the HRTF model represents a purely
radiative problem in a free field. This simulation also does not require additional
parameter settings and computing performance.

Please note that COMSOL’s HRTF simulation has some limitations. For example,
in the frequency band above 8000 Hz, the BEM module will show that the simula-
tion results do not converge. Therefore, the COMSOL simulation data is used in
this paper only as a reference for performance testing of 3D meshes and comparison
of different simulation software, and all calculation results will not be used in further
experiments.

2.6 Results
In the data comparison session, all the meshes that were used for testing are listed
in Table 2.2.

Model from Aachen is downloaded from ITA HRTF-database which scanned via a
high resolution laser scanner.
The original Aachen model included the entire surface of Kemar, but in this project,
to maintain consistency of the model, the part of the Aachen model below the shoul-
ders was cut off as shown the right picture in Figure 2.13. All low-resolution models
and high-resolution model based on iPhone scans have a same out-looking, and the

23


2. Individual HRTF modeling

scanner
Source of model Scanning devices Mesh quality

Model from Aachen Structure light 3D scanner 70k surfaces
High-resolution model iPhone XR 900k surfaces

Low-resolution model (COMSOL) iPhone XR 20k surfaces
Low-resolution model (Mesh2hrtf) iPhone XR 15k surfaces

Table 2.2: The model list for all simulations used in this Chapter 2

only difference is the number of meshes. If the full potential of iPhone’s perfor-
mance is unleashed, the highest resolution model can reach 900,000 triangles, which
is higher than the resolution of professional scanning software. As described ear-
lier in the mesh processing method, these iPhone-based meshes were cropped and
stitched in post-processing.

As can be seen in Figure 2.13, the surface scanned with a cell phone still has some
unevenness, while the surface from a professional scanner is much smoother. This
means that in practical operation, there is always a potential risk at the joints of
the mesh.

In the context of COMSOL and Mesh2HRTF, the low-resolution meshes are handled
differently. As mentioned earlier, the Mesh2HRTF component includes an optimiza-
tion mechanism that simplifies the head and shoulder meshes to a greater extent,
while retaining the ear meshes. In contrast, COMSOL’s low-resolution bar mesh is
globally simplified by software from MeshLab, which uniformly simplifies all surfaces
to a single resolution.

Figure 2.13: The different models used in comparison: Model from Aachen(Left)
and High resolution model (Right)

24


2. Individual HRTF modeling

2.6.1 COMSOL Simulation Results
In this project, the upper frequency limit in COMSOL simulations was set to 8250
Hz for low-resolution model. When this limit is exceeded, COMSOL reports that
the calculation will not conform to linear regression. However, the frequency limit
is increased to 8700 Hz and 8950 Hz, respectively, when using high-resolution model
and the Aachen model. This suggests that the quality of the model has a significant
impact on the performance of the simulation for the acoustic boundary element
module in COMSOL.

Figure 2.14: The low frequency range radiation pattern for different models in
COMSOL

Figure 2.14 illustrates the HRTF values calculated for various models positioned in
the horizontal plane (xy-plane). The HRTF has been normalized to 0 dB in the
frontal direction (polar angle θ = 0). The figure consists of four subplots showing
the HRTF performance calculated for each model at different frequencies. By ana-
lyzing the presented plots, it can be inferred that the differences between the HRTF
values of each model below 4K Hz are relatively small, usually within 5 dB. This
error is similar to the computational results presented in the COMSOL example.

However, the outcomes are not entirely satisfactory for the higher frequency ranges
that have not been addressed in the example paper. The presented Figure 2.15
illustrates the radiation patterns of three models at 6k and 8k Hz. When compared
at 6k Hz, the amplitude distributions of the models appear similar in various di-
rections, but begin to exhibit errors of up to 10 dB in some locations. Notably,
this difference becomes noticeable at the position directly opposite the weaker side
of the ear (around 270 degrees). At 8k Hz, the errors start to become significant.
For the vicinity of the 270 degree direction, there is no similarity in the amplitude
characteristics, and the errors are generally above 20 dB. For other positions that
were expected to perform better, the errors are around 10 dB.Furthermore, at 8k
Hz, the high-resolution models exhibit greater consistency with the results of the
Aachen scanning model.

25


2. Individual HRTF modeling

Figure 2.15: The higher frequency range radiation pattern for different models in
COMSOL

Assuming the model obtained from the laser scanner used by Aachen RWTH as
the ideal model, the simulation results and the amplitude distribution of the low-
resolution model obtained from the iPhone scanning are similar to those of the
ideal model up to 6K Hz. However, at 8K Hz, the amplitude distribution of the
low-resolution model is dissimilar to that of the ideal model. On the other hand,
the high-resolution model obtained from the iPhone scanning is still similar to the
amplitude-frequency distribution of the ideal model at 8K Hz, and the error perfor-
mance in comparison at the same frequency is lower than that of the low-resolution
model. The errors between all models increase with an increase in frequency.

The above simulations demonstrate that, in BEM simulations using COMSOL, a
mesh generated from iPhone scanning can replicate the personalized HRTF simula-
tions mentioned in the example file. It was found that 8000 Hz is the highest fre-
quency range that can be computed with COMSOL. This implies that personalized
HRTF obtained from COMSOL may be difficult to apply in auditory experiments
due to the limitation of frequency range. From the simulation results, higher resolu-
tion models will maintain more accurate results in relatively higher frequency ranges.

2.6.2 Mesh2hrtf Simulation Results
For the Mesh2hrtf results, a special MATLAB plugin called SOFA API for Matlab
and Octave version 1.1.3 is required to process the SOFA files. This plugin loads
the analysis and renders the HRTFed audio using SOFA files. This module is also
used in the listening test later on.

The sources, resolutions and scanning devices of all HRTFs in this comparison are
shown in Table 2.3.All HRTFs used for comparison were horizontal with respect to

26


2. Individual HRTF modeling

the ear-nose horizontal plane.

HRTF Approach Resolution 3D mesh sources
ITA HRTF-database Measurement - -

Simulation Mesh2HRTF - Laser scanner
Simulation Mesh2HRTF High iPhone
Simulation Mesh2HRTF Low iPhone

Table 2.3: Acquisition source, quality and scanning method of the HRTF used

Figure 2.16 shows the left HRTF results of different models simulated at 30-16000Hz.
It can be observed that they are generally consistent from a global perspective,
and their amplitude-frequency distributions are very similar. For frequencies below
1kHz, the addition of HRTF does not bring any significant changes to the sound
at all angles. However, around 2kHz, some reflections can be observed at around
50 degrees, which can be attributed to the shoulders. Around 8kHz, higher sound
pressure levels are received at 0-180 degrees, while many dips are observed at 180-
360 degrees, indicating that the sound from specific directions at these frequencies
reaching the ear canal will be significantly lower.

Around 10kHz, differences between the simulation and measurement values can be
noticed: the amplitude-frequency distributions of all simulated values are somewhat
fragmented. This could be due to insufficient smoothness of the fine surface transi-
tions in the mesh preparation.

27


2. Individual HRTF modeling

Figure 2.16: Simulated HRTF results from different models by Mesh2hrtf

The Error Plotting is a graphical representation obtained by subtracting the sim-
ulated HRTF from the measured HRTF. Specific points on the plot are labeled to
facilitate observation. The results show that, in the frequency range up to 2 kHz, the
differences between the simulated and measured values are negligible for all models.
However, between 4 kHz and 6 kHz, the simulated values for all models are consis-
tently lower than the measured values.

Differences between the simulated and measured values become increasingly evident
above 6 kHz. This can be attributed to several factors. Firstly, misalignment along
the horizontal axis, possibly due to initial differences in the horizontal plane of the
Kemar mannequin or displacement of surfaces during the scanning process, results
in significant vertical errors in the error plotting. Secondly, errors in the scanning
grid are amplified at high frequencies, which is difficult to avoid.It is worth mention-
ing that, in order to simulate the actual scanning process, the iPhone was placed on
the subject’s lap to introduce some jitter, which undoubtedly adds some uncertainty
to the mesh.

This can be observed from the error plotting, where the HRTF generated based on
the high-resolution model has smaller errors compared to the other two models in
the frequency range up to 4 kHz. In the frequency range of 4 kHz to 8 kHz, the
HRTF generated based on the professional laser scanner has more correlated results.
However, in the frequency range above 10 kHz, all simulation results have significant
differences compared to the measured values.

28


2. Individual HRTF modeling

Figure 2.17: Error plotting for different models

2.6.3 Comparison of Simulation Results
For the comparison between COMSOL and HRTF, both were compared on the
HRTF horizontal plane, and the radiation pattern was used in this section.

Figure 2.18: The radiation pattern for different simulation tools

Figure 2.18 shows that for different simulation software, the simulated values are
very similar across all frequency ranges. Any minor differences are likely due to

29


2. Individual HRTF modeling

different optimization algorithms used. Within the frequency range of up to 6kHz,
the simulation results match the scanning results very well, with an average error not
exceeding 5dB. However, at 8kHz frequency range, the simulated values are about
15dB higher than the measured values. This is a very interesting result, and the
potential reason for this discrepancy could be attributed to some uncertainties in
the actual measurement process, such as microphone performance or environmental
conditions in the measurement room.

2.7 Discussion
This chapter outlines the process of preparing HRTFs based on iPhone scanning,
which can be divided into three main parts: obtaining the 3D mesh, cleaning and
optimizing the 3D mesh, and comparing simulated HRTF results. However, this
project did not conduct in-depth research on the algorithmic aspects of HRTFs.

For 3D mesh acquisition, Faceid structured light components are applied to acquire
3D meshes with the software Heges 3D to obtain high precision ear meshes and
relatively coarse torso meshes. These scanned meshes are saved in .stl format for
further processing.

Regarding the cleanup and optimization of the mesh, all operations are performed in
several open-source software described in the previous section. These steps include
mesh cleanup, simplification of the head and torso meshes, merging of the ear and
torso meshes, and finally overall mesh correction and error correction. There are no
very standardized steps here, just the functions that should be performed step by
step, and the results will be more desirable.

For the simulated HRTF, a comparison with the measured values shows that the
simulation results of Mesh2hrtf and COMSOL are in high agreement with the mea-
sured values up to 8k Hz, with an average error of less than or approximately 5
dB. above 8k Hz, they are more different. Specifically, almost all simulation results
above 8 k Hz will show an error of about 15 dB compared to the measured results.It
is worth mentioning that COMSOL cannot consistently obtain results above 8kHz,
however the same mesh can be obtained in Mesh2hrtf at more than 16k Hz.

The following questions were sent down during the process.

1. The impact of hair is particularly severe and exists in the scanning process of
real people. In this project, wigs and swimming caps were used to cover the hair.
Although the hair on the top of the head can be effectively covered, surfaces that
cannot be covered, such as those on the temples and neck, still produce unsatisfac-
tory results. Especially for people with long hair, wearing a hat cannot accurately
restore the shape of the head. Secondly, although the effect of hair on ear scanning
is not obvious, the merging of meshes requires facial meshes around the ears, which
are almost impossible to avoid the impact of hair in actual scanning. So far, more
effort is still needed for 3D scanning of people with long hair.

30


2. Individual HRTF modeling

2. Frequent 3D operations often lead to the problem of the overall mesh not be-
ing closed. Usually, this problem occurs after merging the meshes, and automatic
errors occur when loading the mesh for simulation and processing. Unfortunately,
Meshlab cannot solve this problem well, but the 3Dbuilder software in the Windows
environment can automatically repair it. Generally, the mesh repaired by 3Dbuilder
is sufficient for the next preprocessing step. However, please note that the automat-
ically repaired surface often means that the mesh is not so accurate, and these areas
usually appear in the area where the trunk and ear meshes are merged. Subsequent
research may help solve this problem.

The 3D scanning part is now a relatively mature workflow, however, the ideal HRTF
preparation should unify the cleaning and merging of meshes in one software or ap-
plication, and more effort is undoubtedly needed to achieve a fast and easy collation
of qualified meshes.

31


2. Individual HRTF modeling

32


3
Listening Test of Individual HRTF

Performance

Listening tests are a widely recognized and reliable method for evaluating the acous-
tic performance of sound systems. In contrast to the technical measurements de-
scribed in the previous section, experiential testing through listening can provide a
more comprehensive and nuanced assessment of performance, including various sub-
jective factors such as perception and preference. In this context, the present chapter
reports on a listening test involving a cohort of six participants, which was conducted
to evaluate the efficacy of an individualized HRTF processing approach.The person-
alized HRTFs were acquired using the same methodology as in the previous chapter.

3.1 Theory
This section describes the theories used in listening tests.

3.1.1 Two-device Test
Two-device Test is a commonly employed methodology in audio quality evaluation
and comparison studies. This test involves the simultaneous use of two audio devices
to assess and compare their performance. It serves as a controlled experiment to
measure various aspects of audio reproduction and gauge any perceived differences
between the devices being tested.

In a typical Two-device Test, a specific audio source is played in parallel on both
devices under investigation. Listeners are presented with the audio output and are
tasked with evaluating and differentiating the sound quality, clarity, or any other
relevant perceptual attributes. These attributes may include but are not limited to
frequency response, stereo imaging, dynamic range, distortion, tonal balance, spatial
characteristics, or other factors that contribute to the overall auditory experience.

The evaluation process often involves the use of subjective rating scales or method-
ologies, such as pairwise comparison, ranking, or rating scales, to collect listener
preferences or judgments. Listeners may provide ratings or rankings based on their
perception of audio quality, preference for one device over the other, or the perceived
differences in audio reproduction.

33


3. Listening Test of Individual HRTF Performance

3.1.2 Digital Equalization
Digital equalization refers to the process of modifying the frequency response of an
audio signal using digital signal processing techniques. The goal of digital equal-
ization is to correct or enhance the frequency balance of an audio signal, typically
by adjusting the amplitude of specific frequency bands. This can be done using
various types of digital filters, such as parametric, graphic, or shelving filters, which
are designed to alter the amplitude response of the signal over a specified range of
frequencies.

Parametric equalizer is a type of digital equalizer that allows the user to adjust the
frequency response of an audio signal by manipulating the parameters of one or
more digital filters. Unlike graphic equalizers, which provide fixed frequency bands
with fixed levels of gain or attenuation, parametric equalizers offer more precise and
flexible control over the frequency response of an audio signal.

A typical parametric equalizer consists of one or more filters, each of which can
be adjusted to target a specific frequency range, or band. Each filter is defined by
several parameters, including center frequency, bandwidth, and gain. The center
frequency determines the center point of the band affected by the filter, while the
bandwidth controls the range of frequencies affected. The gain parameter deter-
mines the amount of boost or cut applied to the selected frequency range.

Parametric equalizers are commonly used in professional audio production, live
sound reinforcement, and home audio systems to correct for frequency imbalances or
to enhance the tonal quality of an audio signal. They offer a high degree of precision
and flexibility, allowing users to tailor the frequency response to match the specific
requirements of a given sound system or recording.

3.1.2.1 Shelving Filters

Shelving filter is one type of equalizer filter used in this listening test. It is commonly
used to adjust the frequency response of an audio signal by boosting or attenuating
specific frequency ranges.

Shelving filters work by gradually increasing or decreasing the amplitude of frequen-
cies above or below a specific cutoff point, known as the "Shelf Frequency". The
filter’s slope determines the rate at which the amplitude changes beyond the shelf
frequency.

There are two types of shelving filters: high shelf filters and low shelf filters. A high
shelf filter attenuates or boosts frequencies above the shelf frequency, while a low
shelf filter attenuates or boosts frequencies below the shelf frequency.

34


3. Listening Test of Individual HRTF Performance

3.1.2.2 Peak Filter

Peak filter works by increasing or decreasing the amplitude of a narrow range of
frequencies around a specific center frequency, known as the "Central Frequency."
The width of the range of frequencies affected by the filter is determined by the "Q
Factor", which is a measure of the filter’s bandwidth.

Peak filters are useful for adjusting the tonal balance of an audio signal by selec-
tively boosting or cutting specific frequency ranges. They are often used to remove
resonances or other frequency-specific issues in recordings, or to enhance certain
aspects of a sound, such as the "punch" of a kick drum or the "presence" of a vocal.

3.2 Method
This section describes the design and implementation of the listening test.

3.2.1 Set up
In order to validate whether the Kemar results from the previous chapter could also
be applied to spatial audio reproduction based on personalized HRTFs generated
from iPhone scans of human ears, a two device test-based auditory experiment was
conducted.

The same stimuli were convolved with the HRIRs corresponding to specific speaker
positions, eg. directly in front and directly to the left(0 and 90 degrees). And played
back using Sennheiser HD 560S headphones, Focusrite Scarlett 4i4 audio interface,
Lake People G109 headphone preamplifier, and Genelec 8030C speakers. The sound
EQ and comparison functions were implemented in Matlab, which will be detailed
in a later section.

Equipment Device model Application
Laptop Macbook Pro Control of all systems

Headphone Sennheiser HD560S HRTF rendered audio
Loudspeaker Genelec 8030C Initial audio

Audio interface Focusrite Scarlett4i4 4-channel Audio Generation
Amplifier Lake People G109 Headphone volume control
Cables 6.5 mm Audio Cable Devices connection

Table 3.1: List of equipment used in the listening test

In this experiment, a 15-inch Macbook Pro was used as the control device, and
participants were instructed to perform the two device test EQ operation on the
device. Participants were instructed to keep their head facing the Macbook Pro

35


3. Listening Test of Individual HRTF Performance

at all times, while the horizontal direction of the speaker was adjusted as needed.
It is worth mentioning that in the auditory test, the pitch angle of the speaker’s
geometric center relative to the participant’s ear canal was always 0 degrees, and
the distance was always 1.2 meters.

Figure 3.1: Schematic of experimental set-up

The experiment was conducted in an acoustically treated auditory laboratory, and
participants were allowed to adjust the playback volume to a comfortable level during
the auditory experiment. This means that different participants may choose slightly
different playback volumes based on their personal preferences, and these volumes
may not be consistent with the volume convolved with the initial HRIRs.

3.2.2 GUI Design in MATLAB
The design of the Graphical User Interface (GUI) is based on real-time audio process-
ing using MATLAB, utilizing plugins for ultra-low latency real-time audio process-
ing. This module serves as the basic architecture for all GUI designs. In addition,
the SOFA API plugin is utilized, which in the previous Mesh2HRTF data analysis,
was used to read HRIR data sets from SOFA files. In the experiment, it was neces-
sary to align the actual positions of the loudspeakers with specific positions of the
impulse response of the HRTF. Therefore, position parameters, including horizontal
and vertical angle information, need to be reflected in the design.

As shown in Figure 3.2, the entire software operation process is as follows: First,
mono audio is copied, producing three identical output channels. Then, for Channel
1 and Channel 2, the signals are respectively added with spatial cues of the same

36


3. Listening Test of Individual HRTF Performance

spatial position, that is, convolved with the HRIR (the time domain representation
of the HRTF) at that position. This produces a personalized HRTF-rendered audio
output. The output of the third channel is directly connected to the Genelec 8030C,
which serves as the reference loudspeaker mentioned in the two device test. The au-
dio of this channel is not rendered. Next, a module consisting of four second-order
IIR filters is added to channels 1 and 2, allowing listeners to adjust the parameters
of these filters in real-time to achieve equalization (EQ) goals. Participants adjust
this EQ module to obtain headphone output consistent with the reference sound
source. All EQ filter parameters will be saved and used to analyze the performance
of personalized HRTF.

Figure 3.2: GUI flow chart design

MainFunction is one part of the code that reads audio data, applies equalization
filters, and plays back the audio through the output device. The audio data is read
in blocks, and a filter is generated before each block is processed. If the audio player
is on, the function reads filter coefficients from GUI and applies filters on audio data.
The filter coefficients are generated based on the user’s input. The audio is then
convolved with the filter to create a reverberation effect, and the resulting audio is
played back through the output device in Channel 1 and Channel 2. Finally, the
function checks if the end of the audio file has been reached or if the audio player
has been turned off, and releases the audio device writer and the audio file reader if
necessary.

Another component of the code is the app file which determine the GUI. This GUI
automatically loads the desired audio file and sets specific HRIRs, applies four equal-
ization filters to the audio data, and plays back the resulting audio through the audio
output device. As shown in Figure 3.3, the GUI includes switches and options for
loading, playing, and pausing audio files on the left, a switch for switching output
hardware, sliders and options for adjusting filter coefficients in the middle, and op-
tions for saving all current filter coefficients on the right.

37


3. Listening Test of Individual HRTF Performance

Figure 3.3: GUI in MATLAB including a simple equalizer with four filters

It also features a real-time information display of the audio data being processed,
the current frequency of the EQ operation, and the gain of the audio signal are
graphically displayed in the upper right corner of the interface. In addition, the
GUI automatically stores information about the filter parameters, so that the center
frequency, gain and Q-factor of the original filter are automatically displayed on the
slider after switching to another filter. If the sound is tested for longer than the
length of the audio, the audio is automatically repeated.

3.2.3 Listening Test Protocol
3.2.3.1 Participants’ HRTF Generation

Prior to the auditory testing, the 3D mesh of the participants’ ears, head, and up-
per torso were scanned, cleaned, merged, and simulated using mesh2hrtf to calculate
personalized HRTFs, which were saved in .sofa format.

Sampling Frequency Database Number of positions
48000 ARI 1550
44100 ARI 1550
48000 Default 1850
44100 Default 1850

Table 3.2: List of individual HRTF format

Due to the varying physical features of the participants, the degree of mesh cleaning
and modification differed. The main factor causing this difference is the presence of
hair on the skin surface. In the previous Kemar comparison, the mesh cleaning could
be clearly confirmed because Kemar does not have hair. However, for participants
with long hair, it was difficult to accurately generate the head contour, and the
resulting mesh may be influenced by various aspects, such as the negative impact
on surface optimization, which lack repeatability. This may affect the accuracy and
repeatability of personalized HRTFs to some extent. Generally, personalized HRTFs

38


3. Listening Test of Individual HRTF Performance

of participants with short hair are considered to have higher accuracy.

3.2.3.2 Participants

Five participants (4 males and 1 female) were recruited for the study. The average
age of the participants was 33 years old. All participants stated to have normal
hearing and they were students or staff members of Chalmers University of Tech-
nology. Among them, three participants were considered experienced listeners and
participated in the subjective listening test, while the remaining two participants
lacked expertise in tuning but provided subjective evaluations of the sound quality.

3.2.3.3 Stimuli

For this test, a pop rock music piece was chosen as the stimulus. The piece is
approximately 5 minutes in length, featuring various instrumental sounds and vo-
cals that cover the entire frequency spectrum. The piece also includes numerous
repetitive melodies, which can facilitate tuning for the listener. The music will loop
automatically upon completion of the playback.

3.2.3.4 Procedure

At first, the purpose of the study and the two device test were briefly introduced
to the participants, followed by a detailed instruction on how to switch between the
devices and perform the equalization task.

Participants were given the opportunity to practice using the devices and perform
the equalization task on a set of predetermined stimuli. The stimuli used for training
were played repeatedly, allowing participants to become more comfortable with the
test.

Figure 3.4: Listening test implementation: in a room with thick curtains and
carpets

39


3. Listening Test of Individual HRTF Performance

The experimenter provided guidance to the participants to ensure that they under-
stood the equalization task and were able to operate the devices correctly. Once
the participants demonstrated a clear understanding of the equalization task and
were able to operate the devices effectively, they were considered ready to begin the
actual two device test.

After completing the equalization task and saving the filter coefficients, participants
provided feedback on the performance of the two devices and expressed their sub-
jective preferences. Considering the difficulty of the equalization task for non-expert
listeners, participants were allowed to voluntarily abandon the equalization results
and only record their subjective feelings.

3.2.3.5 Questionnaire

The following table is the questionnaire of listening test table. The goal of this
questionnaire is to find out if the listener could notice the difference of localization
via different audio processing method including individual HRTF generated and
non-individual HRTF generated sound.

Question Processing method Evaluation
Sound Quality individual HRTF
Sound Quality non-individual HRTF
Localization individual HRTF
Localization non-individual HRTF

Table 3.3: Questionaire of Listening test

For evaluation, the participants should use good or bad for sound quality while use
accurate and inaccurate to describe the localization. In general subjective listening
tests, there is often a scoring system for the results. The scoring system describes
the expression of a gradient for the evaluation, e.g. excellent, very good, good, fair,
bad, very bad, not at all. However, due to the lack of sufficient reference standards
and clear calibrations in this experiment, especially for localization, it is not possible
to obtain concise and accurate results without using gradient evaluations.

Another thing worth mentioning is that interviews were conducted for each par-
ticipant after the test. More details based on the information provided by the
participants will be mentioned later.

3.3 Results

3.3.1 Graphic Representation of the Results
Participants selected the filters provided by each GUI for testing under different
relative positions of headphones and impressions. Figure 3.5 shows the result of
their EQ.

40


3. Listening Test of Individual HRTF Performance

Figure 3.5: EQ results when the audio is placed in different positions. Top picture
is the audio in th e front and bottom picture is the audio in the left side.

The EQ experiment conducted in this study mainly focused on the frontal (0 de-
grees) and left (90 degrees) directions. The participants were asked to EQ the sound
in these two directions for approximately half an hour. The distance between the
speakers and the participants was approximately 1.5 meters. As shown in the figure,
the participants used a computer to adjust the sound. The sound was repeatedly
switched back and forth, and the participants adjusted the center frequency, gain,
and bandwidth of the four filters to achieve the most consistent sound.

Figure 3.6: Overall EQ results including data in all directions (black bold line is
the average of all EQs)

41


3. Listening Test of Individual HRTF Performance

The quanstionare results shown in Table 3.4. It can be found that 80% partici-
pants think individual HRTF have a better sound quality compared with the non-
individual sound while all of them believe individual HRTF provide a better local-
ization.

Property Processing Method Preference
Sound Quality individual HRTF 80%
Sound Quality non-individual HRTF 20%
Localization individual HRTF 100%
Localization non-individual HRTF 0%

Table 3.4: Results of Questionaire

Issues identified in the post-test interview session:

1. The tuning of loudness is very crucial in listening tests. If the loudness is differ-
ent, the whole listening experience will be very different.

2. The EQ only targets two individuals, as EQ is usually tailored to experienced
listeners. Adding a head tracker may introduce high latency in the EQ (in fact, EQ
already has some latency in the SOFA API and Matlab). To achieve the ideal EQ,
I may need help from others.

3. It is difficult for candidates who had no listening experiecnce before in this
field to evaluate the performance of stacking HRTFs on multi-channel audio, as the
experiments did not involve a speaker array as a control (most people think that
adding HRTFs to monaural audio is less intuitive than multi-channel audio, as multi-
channel seems to increase details, but if you simply stack the left and right HRTFs
on two-channel audio, the sound image will drift. This is much worse than direct
two-channel audio). EQ cannot directly solve this problem, and a more professional
way is needed to record the sound and then find ways to combine spatial information
with HRTFs (BRIR). From a psychoacoustic perspective, people are theoretically
less sensitive to non-low-frequency room information, and they will automatically
compensate for subtle amplitude changes to distinguish what is a sound source.
However, after multiple layers of filtering (rendering), amplitude changes can lead
to unpleasant results.

42


3. Listening Test of Individual HRTF Performance

3.3.2 Comparision with Headphone Transfer Function

Figure 3.7: Transfer function of Sennheiser HD560S

The experiment was not headphone calibrated, so the calibration curve given by
the listener also contains information about the headphone transfer function. A
comparison of these two functions shows that the EQ curve can approximately
compensate for the differences due to the frequency response curve. This means
that the individual HRTF is reliable for sound reproduction in terms of localization
and sound quality.

3.4 Discussion
In the listening experiment again, the two sections were given different subjective
listening tests. Inexperienced listeners were required to complete questionnaires and
interviews containing questions, while experienced listeners were required to com-
plete additional tuning tests.

The results of the test showed that the personalized HRTF-rendered sources pro-
duced by this method were preferred over the non-personalized HRTF-rendered
sources in terms of both sound quality and localization. For the EQ test, the inte-
grated EQ curve can be approximated as a compensation for the frequency response
curve of the headphone itself, which means that the individual HRTFs produced are
highly usable.

In this study, several issues during the generation of personalized HRTFs need to
be addressed. One of the major challenges is the impact of hair on the accuracy of
HRTF generation. Covering the hair with a wig or swim cap is not sufficient as the
hair on the surface of the temples and neck cannot be covered, leading to unsatis-
factory results. Individuals with long hair are also challenging to scan as wearing
a cap cannot restore the shape of the head accurately. Moreover, the merging of
the mesh requires facial meshes around the ear, which is almost impossible to avoid

43


3. Listening Test of Individual HRTF Performance

the impact of hair during the actual scanning process. Therefore, further efforts are
needed to achieve accurate HRTF generation for individuals with long hair.

44


4
Conclusion

This chapter summarises the findings from the preceding chapters of this thesis.

4.1 Review
In this paper, a new inividual HRTF processing method based on the 3D mesh
scanned by FaceID via iPhone 11 is mentioned. The results show that the gener-
ated individual HRTF produced by this method has a lighter preparation process
compared to the traditional personalized HRTF method, and participants gave high
ratings in the listening test.

Despite the challenges faced in the HRTF generation process, this study demon-
strates a relatively mature 3D scanning workflow in Chapter 2. However, there is a
need to uniformly clean and merge meshes in one software or application to enable
fast and easy collation of qualified meshes for ideal HRTF preparation. In general,
further efforts are needed to refine the HRTF generation process and improve the
accuracy of the results.

A simple listening test was conducted in Chapter 3 for the generated individual
HRTF. Although the cost of HRTF is much reduced compared to the traditional
way of HRTF, the production process of personalized HRTF proposed in this paper
is still difficult to be tested on a large scale in a population. Therefore, this listening
test was conducted with only five participants. From the results, all participants
were satisfied with the personalized HRTF produced.

4.2 Discussion of Contributions
Despite the challenges faced during HRTF generation, this study demonstrates a
relatively mature workflow for 3D scanning. However, there is a need to unify the
cleaning and merging of meshes in one software or application to enable quick and
easy collation of qualified meshes for the desired HRTF preparation. Overall, further
efforts are needed to refine the HRTF generation process and improve the accuracy
of the results.

Chapter 2 describes the workflow of the new method for HRTF preparation and
the comparison of results with other methods. 3D meshes generated using Face ID

45


4. Conclusion

require more cleaning and optimization steps than 3D meshes generated based on
high performance laser scanners. However, the results of individual HRTF param-
eters simulated from meshes acquired by different means are relatively similar and
within 6 dB SPL for the frequency bands that mainly affect localization.

Chapter 3 describes a listening test to test the individual HRTF based on the new
method in Chapter 2. This listening test is based on the Two Device Test and fo-
cuses on comparing the sound quality and localization of the individual HRTF with
that of the non-individual HRTF.The results show that most of the participants
have a higher preference for generated individual HRTF.

4.3 Future Work
There are a number of open source spatial audio production tools available. For ex-
ample, the Spatial Audio Real-Time Applications (SPARTA) suite, developed and
open-sourced by the Aalto University Acoustics Laboratory, can process a variety of
Ambisonics-based effects and render them to loudspeakers or headphones as needed.
These tools can all be used with headphone equipment for perspective tracking, and
SPARTA and BST can input personalized HRIR or BRIR files for binaural render-
ing. Using these tools allows for more extensive personalized HRTF testing and
exploration. and loudspeaker synthesis for virtual acoustic environments, and en-
ables data conversion from sound objects to Ambisonics. All of the above mentioned
tools can be used with headset devices for perspective tracking, and SPARTA and
BST can input personalized HRIR or BRIR files for binaural rendering.

In addition, the human visual perception of space is equally important. As Salmon
[?]said, numerous studies on 3D acoustic-visual interaction have shown that visual
factors play an obvious guiding role in sound source localization, distance percep-
tion, and sound externalization, and also have a certain influence on the perception
of spatial sense of the acoustic environment. Therefore, in practice, the goal of
approximate restoration of the spatial sound field is often achieved through visual
factors, for example, in the synchronization of sound and picture, content correspon-
dence, the sound immersion is further enhanced. Therefore, we can further improve
the quality of spatial audio perception with the help of head-up display devices.

In the end of this paper, some problems about HRTF’s low dimension representation
and personalized modeling in HRTF are pointed out, and some improvements have
been made. But because of the limited time and the experiment condition, the work
is still insufficient. In the study of the artificial neural network personalized HRTF
modeling, the measuring procedure of the main shape of human body is adopted
the traditional measuring method, which is used to measure the body directly. This
method is affected by measuring instruments and operation methods, and it has the
disadvantages of low measuring precision and bad stability. In order to get more
accurate data of human morphology, to improve the accuracy of the prediction of
HRTF and to obtain better location performance, it is necessary to try to do 3D

46


4. Conclusion

scanning with depth camera.

47


4. Conclusion

48


Bibliography

[1] Brown C P, Duda R O. A structural model for binaural sound synthesis[J].
IEEE Transactions on Speech & Audio Processing, 1998, 6(5):476-488.

[2] Geronazzo M, Spagnol S, Avanzini F. Estimation and modeling of pinna-
related transfer functions[C]// Digital Audio Effects (DAFx- 10), Graz, Austria,
2010:431–438.

[3] Geronazzo M, Spagnol S, Avanzini F. A head-related transfer function model
for real-time customized 3-D sound rendering[C]// Signal Image Technol. and
Internet-Based Syst. (SITIS ’11), Dijon, France, 2013:174–179.

[4] Geronazzo M, Spagnol S, Avanzini F. Mixed structural modeling of head-related
transfer functions for customized binaural audio delivery[C]// International
Conference on Digital Signal Processing, Fira, Greece. IEEE, 2013:1-8.

[5] Kistler D J, Wightman F L. A model of head-related transfer functions based on
principal components analysis and minimum-phase reconstruction.[J]. Journal
of the Acoustical Society of America, 1992, 91(3):1637-47.

[6] ShinKH,YoungjinP.Enhancedverticalperceptionthroughhead-
relatedimpulseresponse customization based on pinna response tuning in
the median plane[J]. IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, 2008, 91(1): 345-3 56.

[7] Hwang S, Park Y, Park Y. Modeling and customization of head-related transfer
functions using principal component analysis[C]// International Conference on
Control, Automation and Systems (ICCAS2008),Seoul, Korea, 2008: 227-231.

[8] Hugeng, Wahab W, Gunawan D. Effective preprocessing in modeling head-
related impulse responses based on principal components analysis [J]. Signal
Processing, 2010, 4(4): 201-212.

[9] Zotkin D N, Duraiswami R, Davis L S, et al. Virtual audio system customization
using visual matching of ear parameters[J]. 2002, 3(3):1003-1006.

[10] Zotkin D N,Hwang J,Duraiswaini R,et al. HRTF personalization using anthro-
po