Generating Character Animation for the
Apex Game Engine using Neural Networks
Implementing immersive character animation in an industry-
proven game engine by applying machine learning techniques

Master’s thesis in Computer science and engineering
for a degree at MPIDE - Interaction Design and Technologies, MSc.

JOHN SEGERSTEDT

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2020


Master’s thesis 2021

Generating Character Animation for the
Apex Game Engine using Neural Networks

Implementing immersive character animation in an industry-proven
game engine by applying machine learning techniques

JOHN SEGERSTEDT

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2021

i


Generating Character Animation for the Apex Game Engine using Neural Networks
Implementing immersive character animation in an industry-proven game engine by
applying machine learning techniques

JOHN SEGERSTEDT

Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

© JOHN SEGERSTEDT, 2021.

mentor:
Andreas Tillema, Avalanche Studios Group

initial supervisor:
Marco Fratarcangeli, Department of Computer Science and Engineering

substitute supervisor:
Palle Dahlstedt, Department of Computer Science and Engineering

examiner:
Staffan Björk, Department of Computer Science and Engineering.

Master’s Thesis 2021
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2021

ii


Generating Character Animation for the Apex Game Engine using Neural Networks
Implementing immersive character animation in an industry-proven game engine by
applying machine learning techniques

JOHN SEGERSTEDT

Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
The art of machine learning, here using neural networks to map pairs of inputs to
outputs, has been greatly expanded upon recently. It has been shown to be able to
produce generalizable solutions within multiple different fields of research and has
been deployed in real-world commercial products. One of these research areas in
which regular scientific achievements are made is game development, and specifi-
cally character animation. However, compared to other fields, even though there
has been much work on applying machine learning techniques to character anima-
tion, few efforts have been made to applying them in real-world game engines. This
thesis project aimed to research the applicability of one such piece of previous work,
within the proprietary Apex game engine. The final results included an in-engine
solution, producing character animation purely from a predicative phase-functioned
neural network. Additionally, several different network configurations were evalu-
ated to compare the impact of using, for example, a deeper network or a network
that had trained for a longer period of time, in an attempt to investigate potential
improvements to the original model. These alterations were shown to have negligible
positive impacts on the final results. Also, an additional network configuration was
used to investigate the applicability of this approach on an industry-used skeleton,
producing promising but imperfect results.

Keywords: machine learning, phase-functioned neural network, locomotive character
animation, Avalanche Studios Group, Apex, thesis

iii


iv


Generating Character Animation for the Apex Game Engine using Neural Networks
Implementing immersive character animation in an industry-proven game engine by
applying machine learning techniques

JOHN SEGERSTEDT

Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Acknowledgements
First and foremost, the author wants to show their appreciation for Andreas "Andy"
Tillema for being an incredibly supportive mentor who has, especially during the
later stages of this project, been a pillar of support. Without Andys guidance, this
project would not have been possible.

Secondly, the two supervisors Marco Fratarcangeli and Palle Dahlstedt deserves
thanks for their support and feedback; Marco for initial project scope and Palle for
aiding in the process of turning a programming implementation into an actual thesis.

Thirdly, there have been multiple Avalanche employees that have offered their aid to
this project. Some of these that deserve praise include Robert "Robban" Petterson,
for helping with the .bvh retargeting, and Preeth Punnatjanath, for helping with
.bvh to 3D model skinning. The later of which, able to provide aid on a short notice,
even when tackling other deadlines.

John Segerstedt, Gothenburg, June 2021

v


vi


Contents

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research Question . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Network Layers . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 The Mculloch-Pits Neuron . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Supervised Learning of a Neural Network . . . . . . . . . . . . 7
2.1.4 Underfitting and Overfitting . . . . . . . . . . . . . . . . . . . 8
2.1.5 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.6 Adam Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Phase-Functioned Neural Networks for Character Control (2017) 12
2.2.2 Mode-Adaptive Neural Networks for Quadruped Motion Con-

trol (2018) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Neural State Machine for Character-Scene Interactions (2019) 13
2.2.4 Local Motion Phases for Learning Multi-Contact Character

Movements (2020) . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Learned Motion Matching (2020) . . . . . . . . . . . . . . . . 14

2.3 File Types & Software Libraries . . . . . . . . . . . . . . . . . . . . 15
2.3.1 The .bvh filetype . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Theano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.3 Eigen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Methodology 17
3.1 To Answer the Research Questions . . . . . . . . . . . . . . . . . . . 17

3.1.1 Researching Responsivity . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Researching Accuracy . . . . . . . . . . . . . . . . . . . . . . 18
3.1.3 Researching Architecture . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Simple Difference Significance Evaluation . . . . . . . . . . . . 21

3.2 The Phase-Functioned Neural Network . . . . . . . . . . . . . . . . . 22

vii


Contents

3.2.1 Network Structure . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 The Input Vector . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.3 The Output Vector . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.4 The Phase Function . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 The Full PFNN Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.1 Generate Patches . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.2 Generate Database . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.3 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.3.4 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4 Process 29
4.1 The Runtime Package . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.1.1 Using the Runtime Package . . . . . . . . . . . . . . . . . . . 29
4.1.2 ProceduralAnimations . . . . . . . . . . . . . . . . . . . . . . 30
4.1.3 PFNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.4 Character . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.5 Trajectory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.6 Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1.7 HelperFunctions . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.8 ErrorCalculator . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.9 Waypoint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Result 35
5.1 Responsivity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Responsivity Visualizations . . . . . . . . . . . . . . . . . . . . . . . 37
5.3 Accuracy Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 Accuracy Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.5 Training Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 Pipeline Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 Skinning Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Discussion 49
6.1 Discussing the Research Questions . . . . . . . . . . . . . . . . . . . . 49

6.1.1 Discussing Responsivity . . . . . . . . . . . . . . . . . . . . . 49
6.1.2 Discussing Accuracy . . . . . . . . . . . . . . . . . . . . . . . 50
6.1.3 Discussing Architecture . . . . . . . . . . . . . . . . . . . . . . 52

6.2 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.3 Takeaways . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.3.1 Integration Contextualization . . . . . . . . . . . . . . . . . . 54
6.3.2 Integration Placement . . . . . . . . . . . . . . . . . . . . . . 55
6.3.3 Implementation Expertise . . . . . . . . . . . . . . . . . . . . 55
6.3.4 Equipment Suitability . . . . . . . . . . . . . . . . . . . . . . 56
6.3.5 Neural Network Rigidity . . . . . . . . . . . . . . . . . . . . . 56

6.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.4.1 Runtime skeleton retargeting . . . . . . . . . . . . . . . . . . . 56
6.4.2 Consistent world-axis orientations . . . . . . . . . . . . . . . . 57
6.4.3 Full pipeline integration . . . . . . . . . . . . . . . . . . . . . 58

viii


Contents

7 Conclusions 59
7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 Final Words . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

Bibliography 61

List of Figures 67

List of Tables 69

A Apendix I
A.1 .bvh Interpolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.2 Filenames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
A.3 AVA skeleton joint names . . . . . . . . . . . . . . . . . . . . . . . . IX
A.4 Responsivity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIII
A.5 Accuracy Data - Means . . . . . . . . . . . . . . . . . . . . . . . . . . XV
A.6 Accuracy Data - Standard Deviations . . . . . . . . . . . . . . . . . . XVII
A.7 Training Mean-Squared Error . . . . . . . . . . . . . . . . . . . . . . XIX

ix


1
Introduction

1.1 Background
The domain of machine learning techniques applied within video game development
has been greatly expanded during the last few years with multiple AAA-level pub-
lishers funding machine learning dedicated departments. Amongst others, these
include Ubisoft La Forge funded by Ubisoft Entertainment [1], and SEED funded
by Eletronic Arts Inc [2]. Additionally, there has been great achievements within
academia on this topic, such as the research done at the University of Edinburgh by
Sebastian Starke, Daniel Holden, and others [3].

Within the specific sub-domain of generated character animation, considerable achieve-
ments in research has been made by some of the aforementioned entities, much of
which published as recently as during the year 2020, see Section 2.2.

One of the reasons behind the new additions of machine learning within character
animation is the need of automation for managing potentially hundreds of thousands
of animation clips; as the demand for variation and fidelity, and also more adaptive
and life-like animations, has increased, there has been an exponential demand in
the number of animations within newer game titles. This great escalation of the
problem space can be exemplified when there is an expectancy of animations that
adapt to external factors, such as uneven terrain. Otherwise, the lack of more
context-specific animations may lead to the players’ sense of immersion being broken.
By using scalable and context-free machine learning techniques to generate more
environmentally feasible animations, players may be kept more emerged into the
gameplay experience without having to manually link seemingly endless number of
animation clips and states.

As a result of the video game industry’s immense size, there is a myriad of stakehold-
ers to potential revolutionary and commercially viable innovations using the newly
emerging techniques within machine learning; amongst others within research, de-
velopment, publishing, and consumption, of video games.

Within the context of this thesis, the video game development studio Avalanche
Studios Group [4] is a direct stakeholder to the outcome of this project as a result
of their direct collaboration with the project.

1


1. Introduction

1.2 Research Problem
The aim of this master’s thesis is to investigate the possibility of generating real-
istic character animations using a predicative neural network trained on previously
captured animation data. This is to be achieved utilizing phase-functioned neural
networks (PFNN), based off of previous research by Holden et al. [5].

Additionally, this master thesis aims to contribute to the research field by developing
an implementation solution within the Apex Game Engine [6], contrary to previous
research. This proprietary game engine will be provided by the Avalanche Studios
Group [4] for use within this thesis.

1.3 Research Question
Main research question:

• How can the applicability of a Phase-Functioned Neural Network approach for
generating real-time locomotive character animation in modern game engines
be further improved?

To answer the wicked problem that is the main research question, this thesis aims
to investigate the following subsidiary questions:

• Responsivity - How much computational time is required for procedural single-
character locomotive animations, in a industry-proven game engine, on consumer-
grade hardware?

• Accuracy - How accurate are the generated locomotive character animations,
in an industry-proven game engine, to the original animation data?

• Architecture - How can the phase-functioned neural network architecture pre-
sented in Holden et al. be improved?

Additionally, analysis and comparisons will be made between the quantitative results
during the responsivity and accuracy research of the following networks, see Section
3.1.3:

• Holden - Default

• Holden - Extra Trained

• Holden - Extra Layer

• Avalanche

Also, a summary of some of the most important learnings produced by this thesis
project will be written, see Section 6.3.

2


1. Introduction

1.4 Scope
network type
As of its simplicity, which is further discussed in Section 2.2, a phase-functioned
neural network was selected as the chosen network architecture for this project.

network architecture
To answer the subsidiary research question regarding network architecture, and as
a clear delimitation of the scope of this thesis project, comparisons will be made
solely between the list of network archetypes listed in Section 1.3.

previous research
Since the phase-functioned neural network architecture was developed by Holden et
al. at the University of Edinburgh [5], the publicized network pipeline and accompa-
nying motion data will be used as a basis for this project. This previous research is
based on a left-handed world-axis orientation, see Section 3.3.4. The motion capture
data is of the .bvh filetype, see Section 2.3.1.

gait styles
To limit the space of character animations, only standard bipedal locomotive char-
acter animations are to be considered. Therefore, only animations such as walking,
jogging, crouching, and strafing, are to be considered. Similarly, interactions with
advanced terrain and environments, such as balancing on elevated narrow beams
and dynamic crouching beneath low ceilings will not considered. As such, these
movement styles will be obsolete from the responsivity evaluation, see Section 1.3.
However, motion capture files associated with these movements will still be part of
the data set for the evaluation process, see Section 1.3.

game engine
This thesis will evaluate the feasibility of procedural animations only within the
Apex game engine, provided by Avalanche Studios Group. This engine uses a right-
handed world-axis orientation, see Section 3.3.4.

confidentiality
As a result of the Apex game engine being proprietary software solely used in-house,
a certain level of discretion is required by Avalanche Studios Group. This includes,
but is not limited to, potential omission of engine-specific details from the final
report.

phase function computation
Out of the three methods of computing the phase variable, as presented in Holden
et al. [5], only ‘constant’ is to be considered during this thesis project. This, in an
effort to reduce the number of permutations of network settings.

terrain and inclination
The original demonstration application produced by Holden et al. [5] uses a static
heightmap for terrain height sampling. As this is not the case for the Apex game
engine, the phase-functioned neural network implementation integration as part of
this project will assume a fully flat terrain during runtime.

3


1. Introduction

hardware
The entire evaluation process, and the network training, will be limited to be per-
formed on a single set of hardware specifications:

• CPU - Intel i7-8700k @ 3.70GHz

• GPU - NVIDIA GTX 1060 6GB

• RAM - 16.0GB

4


2
Theory

2.1 Artificial Neural Networks
This section provides an introduction to artificial neural networks, the machine learn-
ing model used for mapping input features to output targets by updating network
weights and biases.

The concept of an artificial neural network is based off the human brain; given a
sensory input, and as internal energy levels surpasses specific thresholds, synapses
are fired between neurons connected in a graph network.

A neural network can be trained to model arbitrary input-output relationships using
either:

• Supervised learning - Comparing network outputs to a ground truth; for an
example, used for image and speech recognition.

• Unsupervised learning - Attempts to minimize a given error measure as no
specific ground truths are given; for an example, used for clustering and clas-
sification.

• Reinforcement learning - Traverses the solution space by being provided in-
termediary encouragement and punishment given specific state spaces; for an
example, used for self-driving vehicles.

For this thesis project, and for the rest of this theory segment, supervised learning
is the considered context.

2.1.1 Network Layers
A neural network can be modelled as a feed-forward directed acyclical graph with
multiple connected layers, see Figure 2.1.

When a neural network is given sensory input, this data is feed into the input
layer. Then, through evaluating the outputs of each layer given its predecessor’s,
see Section 2.1.2, the resulting evaluation of the network is produced in the output
layer. A network can have any number of hidden layers between the input and the
output layers, and any different number of nodes in each layer.

5


2. Theory

Figure 2.1: Simplified fully connected neural network model

Within the research field of machine learning, there are different network architec-
tures consisting of other types of network layers than the fully connected layer type
shown in Figure 2.1. Some other network types and examples of their usage are:

• Recurrent Neural Networks - By allowing for cyclical node connections, infor-
mation is able to be passed and remembered between iterations. Used in text
recognition and translation, amongst other fields.

• Convolutional Neural Networks - Through using feature convolving kernels
that sequentially read subsets of the input, pattern recognition can be per-
formed independent of the location of that pattern within the input data.
Used in image recognition, amongst other fields.

2.1.2 The Mculloch-Pits Neuron
Named after its founders, the smallest component of a neural network is the single
Mculloch-Putts neuron [7]. Such a neuron, see Figure 2.2, produces an output given
an internal threshold and the weighted inputs of other neurons.

Figure 2.2: Simplified Mculloch-Pitts neuron model

Consider Equation 2.1; to evaluate the output signal of a Mculloch-Pitts neuron,
one firstly considers the local field b

(L+1)
i and inputs it into an activation function

g. Popular activation functions include the ReLU (= max(0, bi)) and the Sigmoid

6


2. Theory

(= 1
1+e−bi

) functions [8].

S
(L+1)
i = g(b(L+1)

i ) = g(
∑

j

w
(L+1)
ij S

(L)
j − θi) (2.1)

where:

• g is the activation function of the neuron

• wij is the weight scalar from neuron j to neuron i

• S
(L)
j is the output of neuron j in the previous layer L

• θi is the bias/threshold of neuron i

2.1.3 Supervised Learning of a Neural Network
By adjusting the weights and biases, a network can map any given set of input
features X to any specific output Y . These are often referred to as pairs of input
and output vectors.

To achieve this, a training process such as the following is performed:

Simplified training algorithm of a neural network

1. Split the data into two sets: training and validation

2. Initialize the network with random weights and biases

3. Use backpropagation to train the network using the training data according
to the algorithm below (= an ‘epoch’)

4. Evaluate the accuracy of the network by performing complete prediction of all
data points in the validation set, see Section 2.1.5

5. If the validation accuracy is increasing, according to some heuristic, go to step
3. Otherwise, terminate training (= ‘early stopping’), see Section 2.1.4

Backpropagation is the technique of using the chain rule [9] to compute the update
for weights backwards through the network, using the computed error in the output
layer.

This is done by performing the following algorithm:

Backpropagation algorithm for a neural network

1. Forward propagate the input through the network

2. Calculate the output error using the difference to the ground truth

3. Propagate the errors back through the network

4. Update the weights using the backpropagated errors

7


2. Theory

5. Update the biases using the backpropagated error

equivalent to:

1. S(L)
i ← g(∑j w

(L)
ij S

(L+1)
j − θ(L)

i ), for all neurons i in layers L

2. δ(O)
k ← g′(b(O)

k )(yk − S(O)
k ), for all neurons k in output layer O

3. δ(L)
i ← ∑

j δ
(L+1)
i w

(L+1)
ij g′(bL)

j ), for all neurons i in non-output layers L

4. w(L)
ij ← w

(L)
ij + ηδ

(L)
j S

(L−1)
i , for all neurons i in layers L

5. θ(L)
i ← θ

(L)
i − ηδ(L)

i , for all neurons i in layers L

where η ∈ (0, 1) is the learning rate of the network and which may decay during the
training process. The learning rate, and other parameters such as number of epochs
or the batchsize, are collectively referred to as hyperparameters.

2.1.4 Underfitting and Overfitting
When training a neural network, or when performing other types of regression model
fitting, the choice of model complexity may give rise to issues as a result of a com-
plexity level too low or too high.

Consider Figure 2.3, here one can observe the same data points and three different
regression models. The optimal model for these types of problems is that which can
most accurately represent the data distribution, and therefore can most precisely
predict future data points belonging to the same data set. In the figure, the left
model suffers from underfitting, wheras the right model suffers from overfitting.

Figure 2.3: Simplified example of under/overfitting in 2D regression
Left: Underfitting, model fails to accurately represent data.
Middle: Optimal fit, model mimics the sampled distribution.
Right: Overfitting, model fails to generalize observations

In the previous example shown in Figure 2.3, the polynomial degree of a regression
model was shown to be directly relating to any potential underfitting or overfitting.
However, when it comes to neural network training, the complexity of the model
is already decided upon previous to the training session. As such, the analogous
parameter for neural network training is instead the number of training iterations.

8


2. Theory

Figure 2.4: Simplified example of early stopping
As the prediction error on the training data reduces during the training process,
the prediction error on the validation data initially decreases. After some time the
prediction error starts to increase as generalizability is lost.

Consider Figure 2.4; here a neural network is trained repeatedly using a training
data set of input and output vector pairs, see Section 2.1.3. A trivial expectancy of
such a training process is that the prediction error on the training data set steadily
declines as the training, often counted in the number of epochs, proceeds. However,
to prevent the potential loss of generalization of the network, a separate validation
data set is used. The network is at no point allowed to learn from, or update
its network weights and biases in response to, being shown the validation data set.
Instead, this data set is solely used to estimate the prediction qualities of the network
given unseen data.

In other words, the calculated error of the network predictions on the validation
data set is considered to be proportional to the generalizability of the network. As
such, the ideal performance of the network is when the validation error reaches a
minimum, at which the training process should terminate. This is visualized in
Figure 2.4, where halting training before the validation error starts to increase leads
to potential underfitting, and halting post this minima leads to overfitting.

2.1.5 Gradient Descent
The backpropagation algorithm presented in Section 2.1.3 attempts to minimize
the prediction error by updating each weight and bias with regards to the function
derivative of the error function with respect to each variable respectively.

Conceptually, one can visualize this process by using Figure 2.5, which shows how the
prediction error of a neural network is directly dependent on the values of the weights
and biases. Then, as a training process includes initializing random weights and
biases, consider a marked spot at a random starting point on the line. From there
one, through gradient descent during the network training process, that marked spot
will move downhill as the network weights and biases are updated.

9


2. Theory

Figure 2.5: Simplified example of prediction error depending on weights/biases
This model is heavily simplified. In actuality, a more realistic representation has
one dimension for each weight value and for each bias value.

However, through using true gradient descent on an unknown error landscape, one
might get stuck in local minimas. This can be visualized in Figure 2.5 if one were
to initialize a network configuration close to the noted local minima. To avoid
this serious issue, one introduces even more randomness to the network training
procedure. As such, if we allow the network solution to occasionally move in an
opposite direction of the error function gradient, a weights and biases configuration
may escape a local minima.

This introduction of further randomness in the network training process can be
done by updating the weights and biases after only having seen a randomly chosen
subset, called a batch, of the training data set. This is called batch training and
the technique of adding further randomness to the model training is referred to as
stochastic gradient descent.

2.1.6 Adam Optimizer
The Adam optimization algorithm [10] [11] is an extension to stochastic gradient
descent, see Section 2.1.5, first presented in 2015, aiming to increase efficiency during
neural network training.

The most impactful difference between the Adam optimization algorithm and stan-
dard stochastic gradient descent is where the latter uses a single learning rate for
all trainable parameters, the former uses individualized learning rates for each pa-
rameter that may decay individually. This decay is controlled through the hyper-
parameters beta1 and beta2.

10


2. Theory

2.2 Related Work
Recently, advancements in machine learning using deep neural networks has been
made in a number of various fields:

Since ImageNet [12] 2015, the annual image recognition algorithm competition, com-
petitors have been able to produce machine learning solutions that outperform hu-
mans in classifying photographs of objects [13].

In 2020, the AlphaFold agent developed by the Deepmind team, funded by Google,
was able to make accurate predictions of protein shapes based off its sequence of
amino acids, potentially beeing able to "accelerate research in every field of biology"
[14].

The tech giant Google is using machine learning techniques for a multitude of their
online services, such as: busyness metrics for public areas, individual passage index-
ing on webpages, song/music identification, breaking news detection, and language
translation [15] [16].

Within gaming specifically, there has been multiple recent breakthroughs:

In 2016, the machine learning agent AlphaGo, produced by the Deepmind team,
defeated Lee Sedol, winner of 18 world titles, in the board game of Go [17]. The
achievement lead to AphaGo officially being the first ever computer agent being
rewarded the highest ranking certification within the sport; 9 dan [17], a large step
forward from the narrow matches between IBM’s Deep Blue and Garry Kasparov
in the much less complex game of Chess in 1997.

In a similar vein after AlphaGo’s triumph in Go, computer games has become a
new focus for deep learning agents. One of these is AlphaStar, also developed by
Deepmind, which in 2019 became the first AI to achieve the highest ranking within
a widely popular esport title without any game restrictions in the computer game
of StarCraft II [18].

Another such agent is OpenAI Five, developed by the OpenAI team and funded
amongst others by Elon Musk, which also in 2019 achieved both expert-level per-
formance and human-AI cooperation in the computer game of Dota 2 [19].

However, world class competition is not the only use for neural network agents. Mul-
tiple games have officially launched with either fully, or partially, machine learned
agents, such as the A.I. that players can play against in the game Planetary Annihi-
lation [20]. Another example, where machine learning is only partially used, is the
threat response, the fight-or-flight reaction, of the A.I. agent in the game Supreme
Commander 2 [21]. Additionally, similar agents have been developed to the benefit
of the game designers and developers, for reasons such as gameplay balancing and
strategy win predictions [22].

Other uses of machine learning techniques to increase productivity, generalizability,
or efficiency of the game development process include the field of procedurally gen-
erated game content. Applicable areas were machine learning techniques have been

11


2. Theory

applied such research include: level design, text and narration, music and sound
effects, model textures, and character animations [23].

Some other specific examples of applied machine learning applied during game de-
velopment is much of the research done at the machine learning research division
Ubisoft La Forge [1] who have conducted research on, amongst others; 3D charac-
ter navigation [24], motion in-betweening for character animations [25], data-driven
physics simulations [26], automatic code bug detection [27], and motion capture
data denoising [28].

As this report is on the topic of neural network generated locomotive character an-
imations, the rest of this section is dedicated to presenting different advancements
within this field and to, where relevant, relate their respective strengths and weak-
nesses in contrast to that of phase-functioned neural networks.

The following research is presented chronologically.

2.2.1 Phase-Functioned Neural Networks for Character Con-
trol (2017)

In this paper, Holden et al., at the University of Edinburgh, introduces a single net-
work architecture for generating locomotive character animation over rough terrain
using a phase function [5].

At each frame during runtime, this neural network takes as input the current pose
of the character, any potential user input, and information of specific sample points
along the ground ahead and behind the character, to produce the next character
pose. This character pose consists of information regarding each joint within the
character model, such as their position and orientation.

To accurately model the cyclical nature of the human walk, a phase variable produc-
ing function is introduced. This variable models the transitions between the contact
of each of the two feet of the character with the ground. This ensures a perpetual
forward animation of the character, stepping with each foot sequentially.

For a more in-depth presentation of the phase-functioned neural network used in
this report, see Section 3.2.

2.2.2 Mode-Adaptive Neural Networks for Quadruped Mo-
tion Control (2018)

This paper, authored by researchers at the University of Edinburgh, introduces
a dual neural network system for generating locomotive character animation for
quadrupeds [29].

The first of these two networks is the motion prediction network, highly similar to
the neural network used in a phase-functioned neural network, see Section 3.2. The
second, a gating network that outputs blending coefficients that are used as inputs
in the motion prediction network similar to the phase variable of a phase-functioned

12


2. Theory

neural network, see Section 3.2.

By controlling the motion weights of the motion network using another generative
neural network, one trades manually labeling the phase of the motion training data
for the requirement of training this separate gating network. This, however, is
necessary for quadrupeds as the cyclical leg movement of these is heavily dependent
on gait styles and cannot be modelled with a single phase variable [29].

Result-wise, the mode-adaptive neural networks approach achieve more realistic
quadruped motion for flat terrain than that of a phase-functioned neural network
[30]. However, direct comparisons in memory footprint or computational complexity,
for either flat or rough terrain, was omitted from the original paper.

2.2.3 Neural State Machine for Character-Scene Interac-
tions (2019)

Compared to other research presented here, this paper, also from the University
of Edinburgh, specifically focuses on the interaction between a character and scene
objects, such as opening doors, sitting on chairs, and lifting and carrying boxes [31].

For the training data, specific motion capture clips of the supported object interac-
tions were recorded and a set of control points were manually labeled, such as the
armrests of an interacted with chair. A data augmentation scheme was then used
to generate a 16GB training set from the initial data.

At runtime, the state machine can transition and blend between animation modes
such as walk, sit, open, carry, etc, triggered by user input. Additionally, this system
is fed not solely the pose of the character and specific control points along the
trajectory of the character, but also the geometry of the nearby surroundings through
voxelization.

Contrary to the neural state machine which was built for this specific purpose, a
phase-functioned neural network produces unsatisfactory and jittery results when at-
tempting to produce character animation of scene object interactions [32]. However,
for strictly locomotive tasks, a phase-functioned neural network produced compara-
ble results in areas such as foot sliding and response time [31].

Additionally, the neural state machine presented in this research includes two differ-
ent conjoined neural networks, similar to the system structure presented in Section
2.2.2, and one of which includes a phase variable similar to that of the compared
to solution. Also, the comparatively massive training data adds considerably longer
training time than that of a phase-functioned neural network [31].

2.2.4 Local Motion Phases for Learning Multi-Contact Char-
acter Movements (2020)

This paper, authored by researchers at Electronic Arts and the University of Edin-
burgh, presents the concept of local motion phases and shows successful applications

13


2. Theory

of this concept both within new fields of animations and within the context of pre-
vious research from the university [33].

The local motion phase is conceptually similar to the phase variable in a phase-
functioned neural network, see Section [5]. However, rather than modelling the
locomotive movement of an entire character with a single phase variable, local motion
phases are automatically calculated and maintained on a per-bone basis. This allows
for more realistic animations during highly detailed movements than that which can
be generated by a phase-functioned neural network [34].

At its core, the system presented in this paper has a similar dual network structure
to those mentioned in Section 2.2.2 and Section 2.2.3. However, in addition to
the inclusion of the local motion phases, this paper introduces an autoencoder for
the user input. This generative control model encodes and decodes user input at
runtime, pretrained on the motion capture data.

This approach has been shown to produce high quality animations for tasks such
as lifting boxes similar to those presented in Section 2.2.3, playing basketball, and
for quadruped movement similar to those presented in Section 2.2.2. However, as
this approach builds upon multiple advanced concepts, its final network structure is
substantially more intricate than that of the phase-functioned neural network.

2.2.5 Learned Motion Matching (2020)
Traditional motion matching [35] consists of a system that regularly fetches the most
appropriate pre-processed animation from an animation database, given a set of
pose features, for a specific character. Such a database consists of highly structured
and non-overlapping directional movement, as to minimize the number of recorded
animation sequences. This, as the memory footprint of a motion matching system
scales linearly with the size of the database as the latter must be kept fully in
memory during runtime.

Learned motion matching [36], however, is a technique presented by Ubisoft La Forge
[1] that introduces a neural network approach to motion matching that removes
direct dependency on an animation database. Performance-wise, although a learned
motion matching system is able to produce indistinguishable results to that of a
traditional motion matching system with a substantially smaller memory footprint,
it does so requiring considerably longer computational time [36].

Compared to using a phase-functioned neural network, a learned motion matching
network uses slightly less memory and significantly less computational time at run-
time while requiring an incredibly shorter training period [36] and while being able
to produce similar or better results [37].

Although, a learned motion matching system does not require phase-labeling, it
instead demands to be trained on a meticulously constructed database of encom-
passing motion capture data. Additionally, compared to a phase-functioned neural
network, a motion matching system requires three distinct networks, each with their
own features, targets, and error functions.

14


2. Theory

2.3 File Types & Software Libraries
This section aims to present relevant file types and software libraries used within
this project, and to provide a brief introduction on how to use them.

2.3.1 The .bvh filetype
The .bvh, Biovision Hierarchy, filetype can be used to store motion capture data.
These are human-readable text files, containing both a structural definition for the
motion capture joint skeleton and the per frame joint data.

HIERARCHY
ROOT Hips
{

OFFSET 0.000000 0.000000 0.000000
CHANNELS 6 Xposition Yposition Zposition Zrotation Yrotation Xrotation
JOINT LeftLeg
{

OFFSET 1.000000 -1.000000 1.000000
CHANNELS 3 Zrotation Yrotation Xrotation
End Site
{

OFFSET 0.000000 0.000000 0.000000
}

}
JOINT RightLeg
{

OFFSET -1.000000 -1.000000 1.000000
CHANNELS 3 Zrotation Yrotation Xrotation
End Site
{

OFFSET 0.000000 0.000000 0.000000
}

}
}
MOTION
Frames: 3
Frame Time: 0.008333
2.801100 17.851100 -0.421913 -0.943466 0.030603 6.685755 -1.889587 17.864721 4.969343
-14.847416 -7.065584 -13.249440 1.025640
2.800815 17.848850 -0.421355 -0.932144 0.000126 6.685797 -1.904051 17.868369 4.931144
-14.870445 -7.095567 -13.284463 1.082333
2.800560 17.846750 -0.420267 -0.919127 -0.037791 6.682931 -1.919381 17.879059 4.896097
-14.894115 -7.119117 -13.317921 1.137596

Figure 2.6: Simplified .bvh file example

Consider the simple .bvh example in Figure 2.6.

Firstly, a ‘HIERARCHY’ of skeleton nodes are defined by name and parent offset.
Additionally, each joint has a number of ‘CHANNELS’ associated with them. In
this example, a skeleton of three joints is defined: a parent ‘Hips’ joint with the two
children joints: ‘LeftLeg’ and ‘RightLeg’.

Lastly, a .bvh file features a ‘MOTION’ section where the per frame motion captured
data is listed. In order, each floating point value here corresponds to one of the
‘CHANNELS’ specified in the previous section. Each new row of data corresponds to
a new frame. Joint orientations are stored in degrees, within the interval (−180, 180].

Software, such as Blender, can import .bvh files and render the motion applied to
the skeleton, as defined within the .bvh.

15


2. Theory

2.3.2 Theano
Theano [38] is a Python library built on top of Numpy [39] for efficient multi-
dimensional array computations on the GPU. Theano was initially released in 2007
and further development on the project was shut down in 2017 [40].

Usage of Theano is done by creating ‘theano.function’:s that specify both input/output
parameters and the actual operation to perform. For a simple example of Theano
code, see Figure 2.7. However, Theano can be used for much more complex compu-
tations, such as neural network training, by passing the error function and learning
rate update to the ‘theano.function’ function.

import theano
from theano import tensor

a = tensor.dscalar()
b = tensor.dscalar()

c = a + b
f = theano.function([a,b], c)

print(f(0.5, 1.5))

Figure 2.7: Simplified Theano code snippet for addition on the GPU
The code snippet is expected to print ‘2’ to the console after performing an addition

of the scalars 0.5 and 1.5 on the GPU.

2.3.3 Eigen
Eigen [41] is a C++98 library used to perform high-speed matrix and array oper-
ations. Eigen was first released in 2006 and has since been used in the creation of
other software libraries, such as TensorFlow [41] [42].

In Eigen, one can define both statically-sized and dynamically-sized matrices and
arrays. However, arithmetic inter-data type operations between matrices and arrays
are not allowed, requiring users to cast objects between the types at runtime.

For a simple example on how to use Eigen arrays to represent a single-layed neural
network, see Figure 2.8.

Eigen::ArrayXf W0;
Eigen::ArrayXf b0;
Eigen::ArrayXf Y;

void PerformNetworkPrediction(Eigen::ArrayXf X){
Y = (W0.matrix() * X.matrix()).array() + b0;

}

Figure 2.8: Single-layered network implementation using Eigen
Note: this example requires variable initialization before use of the ‘PerformNet-
workPrediction()’ function.

16


3
Methodology

3.1 To Answer the Research Questions
After the neural network solution has been fully integrated into the Apex game
engine, its viability for generating locomotive character animations will be quanti-
tatively evaluated in the following two ways.

3.1.1 Researching Responsivity
Responsivity will be measured in computational time required, per frame, during
runtime for the network related code.

This will be tested for locomotive character animations around a static obstacleless
track course, as to ensure deterministic user input. This obstacle course will be
defined using waypoints, see Section 4.1.9, that both acts as positional checkpoints
along the course and which dictates what movement style the character is expected
to produce while traversing the environment toward the waypoint.

The track course will be defined using the following waypoints, see Table 3.1:

Index Pos (m) Gait Speed StrafeDir.
0 (-30, 40) Walk 2.5 -
1 (-50, 0) Walk 2.5 (-0.5, -0.5)
2 (-25, -25) Jog 10.5 -
3 (25, -40) Jog 10.5 -
4 (10, -10) Crouch 2 -
5 (25, 50) Walk 2.5 (0, -1)

Table 3.1: Track course details
Pos = the position of the waypoint in meters in world space
Gait = gait type, see Holden et al. [5]
Speed = goal root speed
StrafeDir. = normalized character facing direction vector when strafing

The final responsivity result will include the statistic for mean and standard devia-
tions of frametime on a per lap basis of 19 laps. The in-engine representation of the

17


3. Methodology

track course can be seen in Section 5.2.

The reason behind the choice of evaluating responsivity specifically, is how high
responsiveness is a requirement for both immersive and interactive non-passive ex-
periences, such as computer games, and that it that can be evaluated quantifiably.
Additionally, low responsiveness may influence both player enjoyment and perfor-
mance when playing computer games, interrupting a possible state of flow [43].

3.1.2 Researching Accuracy
The accuracy of the integrated network solution will be measured by comparison
between the predicted output pose and the corresponding ground truth for the
entirety of the training data.

To allow for this comparison, pairs of input and output vectors will be generated
similarly as those during the database generation process, see Section 3.3.2, and will
be evaluated as part of the final runtime package, see Section 4.1.8.

There are multiple different error definitions used within regression, such as mean-
squared error (MSE), root mean-square error (RMSE), and mean-absolute error
(MAE) [44].

The error definition to be used as part of the accuracy evaluation in this report
will be mean-absolute error. This was chosen as firstly, mean-absolute error is more
forgiving for outliers which may be expected in this particular data set, and secondly,
mean-squared error is already used as part of the training process. A different error
calculation for the evaluation process than what is used during training may be useful
for testing generalization and to allow for comparisons between the two errors.

The error will be presented in both mean and standard deviation on a per-file basis,
evaluated through a per-frame calculation according to the following mean-absolute
error formula:

1
|j|
∑

j

|tj − pj|
tj

(3.1)

Where |j| is the total number of frames in this file, tj is the three-dimensional
position of joint j as defined in the motion capture database, and where pj is the
three-dimensional network predicted output position of joint j. The values taken
from the motion capture database, including tj, is referred to as the ground truth.

As this definition includes the tj denominating term, an error evaluated using the
formula can be interpreted as the relative prediction error in percentage. In other
words, an error of 0.01 equals an average joint position error of 1%.

Additionally, as a result of restrictive system memory, the maximum number of
frames considered in each motion capture data file is that which equals at most
500’000 discrete joint positions. In other words, for a skeleton with 191 joints, only
the first 2’617 unique frames will be considered.

18


3. Methodology

3.1.3 Researching Architecture
To answer the subsidiary research question regarding architecture optimality, com-
parisons will be made between the results of the different network configurations, as
presented in Section see Section 1.3.

This analysis is to be done by comparison of the evaluation results, as presented in
Section 3.1.1 and Section 3.1.2, between the following neural network configurations:

• Holden - Default (HOLDEN) - The default network solution as presented in
Holden et al. [5]: a phase-functioned neural network with a single hidden
network layer of 512 nodes, trained for 2’000 epochs 3.3.3.

• Holden - Extra Layer (HOLDEN-XL) - The default network solution as pre-
sented in Holden et al. [5] but using two hidden network layers of 512 nodes
each.

• Holden - Extra Trained (HOLDEN-XT) - The default network solution as
presented in Holden et al. [5] but trained for 4’000 epochs.

• Avalanche (AVA) - The default network solution as presented in Holden et al.
[5] but heavily altered to accompany an in-house skeleton.

The reasoning behind these specific choices in network configurations were that
firstly, there must be a control case network that mimics the original implementa-
tion. Then, having a longer network, or a network that is trained for a longer period
of time, could be used for conceptually straightforward comparisons. Additionally,
evaluating a network with a longer trained process would also be interesting to inves-
tigate whether the original implementation by Holden et al. suffers from overfitting,
see Section 2.1.4, given that that implementation uses no validation data or early
stopping 2.1.4.

the holden configurations
The hyperparameters that will be used for the network configurations are directly
based on the work by Holden et al. [5], see Section 3.2.

Additionally, the 31 joint .bvh, see Section 2.3.1, skeleton used for these configura-
tions is that of the original .bvh files made public by Holden et al. [5], see Figure
3.1.

The HOLDEN and HOLDEN-XT networks will both have an input layer width of
342, a hidden layer width of 512, and an output layer width of 311. The extra
hidden layer present in the HOLDEN-XL network configuration will also have 512
nodes.

19


3. Methodology

Figure 3.1: Visualization of the .bvh skeleton used by the Holden configurations
This is the same skeleton as presented by Holden et al. [5].

the avalanche configuration
The altered AVA network will general use the same network hyperparameters as
that of the Holden configurations.

However, it will use a different skeleton, one that is used in a live Avalanche product.
This skeleton is visualized in Figure 3.2.

Figure 3.2: Visualization of the .bvh skeleton used by the AVA configuration
The ‘extra’ joints that appear outside the character body are used for tasks such as
deformation, object interactions, and player camera locations. Do notice the many
extra, compared to in Figure 3.1, joints in the character head and hands. For the
complete list of joint names in this skeleton, see Appendix A.3.

To accommodate for the AVA skeleton having 191 joints, rather than the 31 of the
Holden configurations, the Avalanche network will have a input layer width of 1’302,

20


3. Methodology

the same hidden layer width of 512, and an output layer width of 1’751.

Also, since this configuration aims to use an in-house Avalanche skeleton, the .bvh
files made public by Holden et al. [5] will need to be re-generated. This retargeting
step will be done by professionals employed at Avalanche.

Additionally, since the network is trained for 60Hz predictions, whereas the retar-
geted .bvh files was retargeted to the in-house standard of 30Hz, these .bvh files will
need to be interpolated, see Figure A.1 and Figure A.2 in Appendix A.1.

As a consequence of the greater number of joints, the network training database
used for the AVA configuration will only include every fourth motion capture frame.
This, as otherwise the training database does not physically fit in the runtime mem-
ory of the system used as part of this thesis, see 1.4. To reemphasize: the AVA
configuration will be trained on a fourth of the number of motion capture frames
than that of the Holden configurations. However, each frame in the Ava training
database will contain data of more than six times the joints than in the Holden
training database. This issue, however, could have been resolved by rewriting the
network training logic such that the network could onload, and offload, parts of the
training database. This would lead to the network being able to indirectly train
on the entire data set, including all movement frames, even though the database
would be too large to fit in system memory at once. However, this procedure would
require an extensive rewrite of the original implementation by Holden et al., and this
would potentially drastically increase the time required during the training process
as onloading and offloading such large chunks of memory is a slow process.

Also, a specific subset of the joints most equivalent to those of the Holden skeleton
used in the Avalanche skeleton will be referred to as the Avalanche Masked skeleton:
AVA-M. In other words, the AVA-M skeleton are the subset of Avalanche joints most
similar to those in the Holden skeleton, see Section 3.3.3.

3.1.4 Simple Difference Significance Evaluation
To evaluate the statistical significance in the difference between two data sets, a and
b, the following version of heuristic will be used:

2|mean(a)−mean(b)|
sd(a) + sd(b) (3.2)

This is equivalent to evaluating the difference between the means of the two data
sets in measurements of the average of their respective standard deviation.

The absolute difference is used here for the same reasons that mean-absolute error
is used for the accuracy evaluation, see Section 3.1.2.

21


3. Methodology

3.2 The Phase-Functioned Neural Network
A phase-functioned neural network, as presented by Holden et al. [5], is a neural
network with weights generated by a cyclic phase variable produced by a phase
function.

This section aims to describe the functional components of a phase-functioned neural
network within the specific context of this project, as presented in Holden et al. [45].

3.2.1 Network Structure
The network architecture used in Holden et al. [5] is a neural network with the
following structure, where each network node uses a trainable bias:

• H0 - Input layer of 342 nodes, see Section 3.2.2.

• H1 - Fully-connected hidden layer of 512 nodes.

• H2 - Output layer of 311 nodes, ELU [8] activation function, see Section 3.2.3.

3.2.2 The Input Vector
The input vector xi, at frame i, is a concatenation of, amongst others; sample points
on the terrain along the traversed and expected path of the animated character, see
Figure 3.3, and the current joint positions and velocities of the character.

xi = {tp
i , td

i , th
i , t

g
i , j

p
i−1, jvi−1} ∈ Rn (3.3)

where:

• tp
i ∈ R2t, the x, y positions of the sample points in character local space

• td
i ∈ R2t, the x, y trajectories of the sample points in character local space

• th
i ∈ R3t, the heights of each sample point and additional sub-sample points

• tg
i ∈ R5t, a vector containing the gait of the character along the sample points

• jpi−1 ∈ R3j, the position of all j character joints in the previous frame j − 1

• jvi−1 ∈ R3j, the velocities of all j character joints in the previous frame j − 1

where:

t is the number of sample points centered around, and including the at the feet of,
the character. This value was set to 12 in Holden et al. [5], equaling five sample
points ahead, and six sample points behind, the character.

j is the number of joints within the character model. This value is was set to 31 in
Holden et al. [5].

22


3. Methodology

Figure 3.3: Subset of PFNN input vector visualized
a: sample point positions - tp

i ∈ R2t

b: sample point trajectories - td
i ∈ R2t

c: (sub-)sample point heights - th
i ∈ R3t

source: Holden et al. [46].

3.2.3 The Output Vector
Similarly, the output vector yi, at frame i, is a concatenation of both predicted
future states, the next pose of the character, and an update of certain metadata.

yi = {tp
i+1, td

i+1, j
p
i , jvi , jai , rx

i , r
z
i , r

a
i , ṗi, ci, } ∈ Rm (3.4)

where:

• tp
i+1 ∈ R2t, the predicted x, y positions of the sample points in character local
space of the next frame i+ 1

• td
i+1 ∈ R2t, the predicted x, y trajectories of the sample points in character
local space of the next frame i+ 1

• jpi ∈ R3j, the generated position of all j character joints

• jvi ∈ R3j, the generated velocities of all j character joints

• jai ∈ R3j, the generated angles of all j character joints

• rx
i ∈ R, local character velocity in the relative x direction

• rz
i ∈ R, local character velocity in the relative z direction

• ra
i ∈ R, local character angular velocity around the world up vector

• ṗi ∈ R, phase variable update delta

• ci ∈ R4, binary contact labels of heel and toe joints with the ground

23


3. Methodology

3.2.4 The Phase Function
The Phase function blends between four sets of network weights, αk0 ,αk1 ,αk2 ,αk3 ,
using cubic Catmull-Rom interpolation [47]. As such, the number of network weights
needed to be stored in memory at runtime is multiple times that of a singular network
configuration.

The phase function Θ is evaluated:

Θ(p; αk0 ,αk1 ,αk2 ,αk3) =
αk1

+w(1
2αk2 − 1

2αk0)
+w2(αk0 − 5

2αk1 + 2αk2 − 1
2αk3)

+w3(3
2α

k1
− 3

2αk2 + 1
2αk3 − 1

2αk0)

(3.5)

where:

w = 4p
2π (mod 1) (3.6)

kn =
⌊ 4p

2π

⌋
+ n− 1 (mod 4) (3.7)

Within this project, the phase function will be evaluated during runtime. An al-
ternative approach would be to precompute the function and store its results in
memory. This would reduce the computational load at runtime but increase the
memory footprint [5].

24


3. Methodology

3.3 The Full PFNN Pipeline

Figure 3.4: The full PFNN pipeline
‘data’ = offline storage
‘script’ = runnable files
‘memory’ = temporary, runtime

This section aims to provide an overview of the full phase-functioned neural network
pipeline, as designed by Holden et al. [5] and as presented in Figure 3.4. The final
integrated version of this model is presented in Section 4.1.

3.3.1 Generate Patches
To allow for the generation of locomotive character animations that adhere to the
roughness of the topography, the training data used later must include different
types of terrain. A solution to this is to fit heightmaps to the separately recorded
motion capture data, firstly producing intermediate patches of terrain.

3.3.2 Generate Database
During this step, each input and output vector pair, see Section 3.2.2 and Section
3.2.3, is produced and stored. Each vector pair is created on a per-frame basis using
motion captured data, see Section 2.3.1, and associated labels, such as the phase
and gait variables. Additionally, for each motion capture clip, the ten most suitable
heightmaps are fitted to the foot-to-ground contacts of the character.

3.3.3 Network Training
Training will be performed using the Theano [38], a Python library for multi-
dimensional array computations on the GPU - see Section 2.3.2, implementation
by Holden et al. [5], and an Adam optimizer, see Section 2.1.6. The result of this
step will be the finalized trained network weights. The default hyperparamters for
the training will be:

25


3. Methodology

• batchsize = 32

• learning rate = 0.0001

• beta1 = 0.9

• beta2 = 0.999

• epochs = 2000

• error function = mean-squared error

For the order of the motion capture data files, see Table A.1 in Appendix A.2.

During the training process, the translation and orientation of the joints not on
the following list, or equivalent to these in the case of the Avalanche configuration,
within the input vector will be put to ≈ 0, as is done in Holden et al. [5]:

• Hips

• LeftUpLeg

• LeftLeg

• LeftFoot

• LeftToeBase

• RightUpLeg

• RightLeg

• RightFoot

• RightToeBase

• Spine

• Spine1

• Neck1

• Head

• LeftArm

• LeftForeArm

• LeftHand

• RightArm

• RightForeArm

• RightHand

Additionally, this training process will not make use of the early-stopping technique,
see Figure 2.4.

3.3.4 Neural Network
This step includes the entire package necessary for runtime pose prediction. Dur-
ing initialization, all necessary trained network weights will be read and loaded in
memory. Then, each frame, a prediction request is passed to the package, provid-
ing a character pose in the current frame and expecting an updated character pose
as return value. In addition to the character pose, other metadata is feed to the
network for prediction, such as sample points of the topography and user input, see
Section 3.2.2.

In this step is where the bulk of the integration work will be. However, the overall
package structure will be based of the demonstration codebase made public by
Holden et al. [5], with the neural network model defined in Eigen, see Section 2.3.3,
arrays and matrices.

Additionally, as mentioned in Section 1.4, the motion capture data, and therefore
the trained neural network, uses left-handed world-axis, whereas the Apex engine
uses a right-handed world-axis, see Figure 3.5.

26


3. Methodology

Figure 3.5: Visual representation of left/right-handed world-axis orientations
Left: Left-handed world-axis (green)

Right: Right-handed world-axis (purple)

For this reason, the runtime neural network package must be altered such that
it can convert between the world-axis orientations. The character pose, living in
a right-handed world-axis, is to be converted to the left-handed world-axis of the
neural network. Then, the neural network outputted updated character pose must
be converted back into right-handed world-axis before being applied to the character
skeleton.

27


3. Methodology

28


4
Process

4.1 The Runtime Package
This section aims to present the runtime package implemented for the phase-functioned
neural network solution, originally based on the demonstration software made public
by Holden et al. [5]. An overview of this package is presented in Figure 4.1.

Figure 4.1: The Procedural Animations runtime package
The neural network solution is accessible either from the ProceduralAnimations
class, or indirectly through the ErrorCalculator class.

4.1.1 Using the Runtime Package
The runtime package, see Section 4.1, is aimed to have a low level of coupling,
such that other programmers need not to interact with, nor understand, the deeper
machinations of the package.

As such, to use the runtime package, a programmer would only need to perform
two things: initialize the Procedural Animations class and to call ‘GetNextPose(...)’
when wanting to use the network for predictions, see Section 4.1.2.

During initialization, the ProceduralAnimations constructor takes three optional
parameters, see Figure 4.2:

• new world transform - A 4D matrix for character scaling/rotation/translation.

• new setting - A Setting enum, see Section 4.1.6, for network configurations.

• new waypoint sptr - A pointer to a vector of Waypoint:s, see Section 4.1.9.

29


4. Process

CProceduralAnimations(
AosMatrix4 new_world_transform = AosMatrix4(0.0f),
CSettings::SETTING new_setting = CSettings::HOLDEN,
std::vector<CProceduralAnimationsWaypoint*>* new_waypoints_ptr = nullptr);

Figure 4.2: Procedural Animations constructor

Then, during the constructor execution, the objects that the ProceduralAnimation
class owns are initialized.

During runtime prediction, only the ‘GenerateNextPose(...)’ function is required, see
Figure 4.3. This function takes two parameters: a pointer to the current character
pose, and a pointer to the translational character-in-world offset. Then, the ‘Gen-
erateNextPose(...)’ updates the two input parameters in place, given the outputs of
the neural network.

void GenerateNextPose(CPose* pose, AosVector3* translation_offset);

Figure 4.3: Procedural Animations per frame prediction

4.1.2 ProceduralAnimations
This class owns the pointers to the PFNN, Character, Trajectory, and Settings rep-
resentations. As the ErrorCalculator class is intended only for evaluation purposes,
the ProceduralAnimations class is the default way to access the phase-functioned
neural network solution.

Inside the ‘GenerateNextPose(...)’ function, see Figure 4.3, the flow of sub-function
calls is organized as follows:

1. Prepare - Stores the input pose information in the Character object.

2. Input - Evaluates the Waypoint information and sets the Trajectory state.

3. Insert - Inserts the Character and Trajectory states into the input vector.

4. Predict - Runs the network prediction, setting the output vector.

5. Output - Stores the relevant output vector information in the return pose.

6. Update - Update Character and Trajectory states using output vector.

The time required to perform these six steps is recorded e ach frame for use in
evaluating the systems responsivity, see Section 3.1.2.

Additionally, this class has debug rendering functionality for visually rendering net-
work parameters, such as the joint skeleton, the sample points, character velocities,
etc., in the engine.

30


4. Process

4.1.3 PFNN
This struct holds the memory representation of the neural network and is responsible
for the network prediction.

When initialized, the PFNN struct loads the network weights and biases into Eigen,
see Section 2.3.3, matrices in memory from stored .bin files. The .bin directory and
network configuration is fetched from the Settings object. Additionally, the PFNN
struct is the only part of the runtime package dependent on the Eigen library.

During the prediction step, the PFNN struct performs the matrix multiplications
necessary to propagate the input vector state, and then standardizes the result
before storing it in the output vector data structure.

4.1.4 Character
The Character struct stores the positions and translations, in model space, of all
character joints in the current frame. Additionally, the same information is stored
for the few previous frames to allow for output blending when setting the return
pose values.

4.1.5 Trajectory
Similar to the Character struct, the Trajectory struct holds all information regarding
the sample points along the ground, see Figure 3.3, such as positions and velocities.
These values are also stored between multiple frames to allow for output blending.

4.1.6 Settings
The Settings class is used to manage easy switching between the different network
configurations, see Section 3.1.3, which is represented as an enum passed to the
constructor.

To allow for a low level of coupling and extensibility, in the form of being able
to add additional network configurations requiring minimal changes in the code
base, the Setting class holds all data that may be affected by the choice of network
configuration. In other words, if one wants to add another network configuration,
one would only need to add support for it in the Setting class.

For an example, all paths to the network .bins are defined in the Setting class. This
means that when a PFNN object initializes, it simply calls something similar to
‘settings->GetWeightsPath()’, without needing any logic, e.g. switch cases, that
requires the knowledge of a network configuration enum or how that configuration
would affect this class. This is shown in Figure 4.4

31


4. Process

class Settings{
enum CONFIG {HOLDEN, AVA};

string path;

Settings(CONFIG new_config){
switch(new_config){

case HOLDEN:
path = "/holden_weights/"
break;

case AVA:
path = "/avalanche_weights/"
break;

}
}

string GetPath(){
return path;

}
}

Figure 4.4: Simplified example of Settings implementation
(DISCLAIMER: PSEUDO CODE! NOT ACTUAL IMPLEMENTATION!)

4.1.7 HelperFunctions
This is a simple, fully static class that holds functions such as debug outprints and
definitions for specific matrix operations.

4.1.8 ErrorCalculator
When evaluating the network, rather than creating an instance of the Procedu-
ralAnimations class, one initializes an ErrorCalculator instead. This object acts
as a wrapper around a ProceduralAnimations instance and, rather than depending
on an input pose, uses stored input and output vector pairs, see Section 3.2.2 and
Section 3.2.3.

This class is therefore responsible for calculating the evaluative results required
in the answering of the research question regarding accuracy, see Section 1.3 and
Section 3.1.2.

This evaluation process can either be run immediately on initialization, or on a per
frame basis to allow for visualization of the network prediction, compared to the
ground truth. This is controlled with a ‘run-offline’ flag.

Since the ErrorCalculator constructs an internal ProceduralAnimations instance, it
also requires the same input parameters; both in the constructor, see Figure 4.5,
and on the per frame prediction, see Figure 4.6.

CProceduralAnimations(
AosMatrix4 new_world_transform = AosMatrix4(0.0f),
CSettings::SETTING new_setting = CSettings::HOLDEN,
std::vector<CProceduralAnimationsWaypoint*>* new_waypoints_ptr = nullptr,
bool run_offline = false);

Figure 4.5: Error Calculator constructor

32


4. Process

float CalculateError(CPose* pose);

Figure 4.6: Error Calculator per frame prediction

4.1.9 Waypoint
Each Waypoint instance is a simple datastructure, representing one checkpoint along
the obstacle course that the characters will traverse as part of the responsivity
research, see Table 3.1 in Section 3.1.1.

In addition to its inherent world translation, each Waypoint holds information rep-
resenting the goal movement style that a character aims to perform when reaching
it. This includes the gait styles; walking, jogging, crouching, etc, but also movement
speed and facing direction. This, in an aim to deterministically simulate user input
during the evaluation process.

The ProceduralAnimations instance keeps track of the current Waypoint index, and
increments that number upon reaching the next checkpoint.

33


4. Process

34


5
Result

5.1 Responsivity Results
All responsivity data, which is used to produce the figures and tables presented in
this section, is available in Appendix A.4. For more information regarding the track
course used, see Section 3.1.1.

The responsivity results presented in Figure 5.1 shows the average frame time com-
putation in milliseconds per lap around the course. The same data is presented as
a box plot in Figure 5.2, and summarized in Table 5.1.

Figure 5.1: Line chart of responsivity results

35


5. Result

Figure 5.2: Boxplot of responsivity results

HOLDEN HOLDEN-XL HOLDEN-XT AVA
Mean 0.376 0.499 0.382 1.55
SD 0.0173 0.0161 0.0167 0.0430

Table 5.1: Mean and standard deviation results of responsivity evaluation
Values are rounded to three significant digits.

By combining the visual results of the line chart in Figure 5.1 and the box plot in
Figure 5.2, one can conclude that there is a considerably sized difference in compu-
tational time required for that of the AVA network configuration. A potential root
cause of this is the great increase in number of joints for that network, see Section
6.1.1 for further discussion on this topic.

For the Holden configurations, the results of HOLDEN and HOLDEN-XT have al-
most perfect overlap in both Figure 5.1 and in Figure 5.2. As such, one can conclude
that these two network configurations have practically equivalent responsivity. How-
ever, this is not too surprising as, in theory, a network having trained longer, with
otherwise the same hyperparameters, should only result in a different set of network
weights. Subsequently, two otherwise equivalent networks but with different weights
should still be evaluated at runtime at the same speed.

Finally, for the HOLDEN-XL configuration, it is not as visually clear whether it at
runtimes evaluates at a considerably different speed than that of the other Holden
configurations. For this, the similarity metric defined in Section 3.1.4 can be used.
This metric evaluates the absolute difference between each mean result, standardized
by the average standard deviation of the two data series. In other words, the metric
evaluates how many standard deviations two data points differ.

2|mean(a)−mean(b)|
sd(a) + sd(b) (5.1)

36


5. Result

• HOLDEN to HOLDEN-XT: 2|0.382−0.376|
0.0173+0.0167 ≈ 0.35

• HOLDEN-XL to HOLDEN-XT: 2|0.499−0.382|
0.0161+0.0167 ≈ 7.1

• AVA to HOLDEN-XT: 2|1.55−0.499|
0.0430+0.0161 ≈ 36

These calculations, together with the visualizations in both Figure 5.1 and in Fig-
ure 5.2, can be combined to suggest the relative significance of the differences in
standard deviations between the responsivity results. Even though the number of
standard deviations between the results of the HOLDEN-XL configuration and that
of the HOLDEN-XT results are much smaller than that to the results of the AVA
configuration, one can still make the argument that there is a noticeable dissimilarity
in computational time required for the HOLDEN-XL configuration. This difference
could be explained through the fact that adding another layer in a neural network
strictly increases the number of computations, and therefore the time, required for
evaluation at runtime. For further discussion on this topic, see Section 6.1.1.

5.2 Responsivity Visualizations
Video recordings of these visualizations are available here [48].

Figure 5.3: Some frames from the HOLDEN responsivity evaluation
Top left: jogging, Top right: crouching
Bottom left: strafing backwards, Bottom Right: walking
White globes are Waypoints, see Section 4.1.9.
The golden Waypoint is the next positional target of the network.

37


5. Result

Figure 5.4: Some frames from the AVA responsivity evaluation
Top left: jogging, Top right: crouching
Bottom left: strafing backwards, Bottom Right: walking
White globes are Waypoints, see Section 4.1.9.
The golden Waypoint is the next positional target of the network.

In Figure 5.3 and Figure 5.4, one can see examples of the skeleton joint position
outputs the HOLDEN and AVA network configurations produced during their re-
spective responsivity evaluations.

For the AVA configuration, certain errors occurred, potentially as a result of the
network not being trained on sufficient amount of data frames, see Section 6.1.3
and 6.4.1 for further discussion. For an example, notice how poorly the produced
joint skeleton appears to be crouching in the bottom right photograph in Figure
5.4. Additionally, the AVA configuration failed to adapt to tight turns, making the
outputted skeleton overshoot the target, see Figure 5.5.

Figure 5.5: Directional overshoot during the AVA responsivity evaluation
The goal of the network is to move the skeletal character towards the golden

Waypoint. However, the AVA configuration fails to sufficiently turn the character
towards this goal before the character has passed it.

38


5. Result

5.3 Accuracy Results
All accuracy data, which is used to produce the figures and tables presented in this
section, is available in Appendix A.5 and Appendix A.6.

The accuracy results presented in Figure 5.6 shows the average error per motion
capture data file for each of the four network configurations. The error calculation
is defined as presented in Section 3.1.2:

1
|j|
∑

j

|tj − pj|
tj

(5.2)

Where |j| is the total number of frames in this file, tj is the three-dimensional
position of joint j as defined in the motion capture database, and where pj is the
three-dimensional network predicted output position of joint j. The values taken
from the motion capture database, including tj, is referred to as the ground truth.

Additionally, the fifth data series ‘AVA-M’ shown in this figure represents the results
of the AVA network limited to the subset of network outputs equivalent to those
joints present in the original motion capture data made public by Holden et al. [45],
see Figure 3.1 and Section 3.1.3.

Similarly, Figure 5.7 presents the standard deviations, a measurement of spread in
the data distribution, of the per motion capture file network outputs for each of the
network configurations. A smaller standard deviation equates to little difference be-
tween data points within a data series, whereas a higher standard deviation equates
to more fluctuating data points.

The same accuracy data is presented as a box plot in Figure 5.8, and summarized
in Table 5.2.

39


5. Result

Figure 5.6: Line chart of mean results of accuracy evaluation
The error is defined as mean absolute error compared to the training data.
For full definition of error, see Section 3.1.2.
For indexing of motion capture files, see Appendix A.2.

What can be seen in Figure 5.6 is that the results of the three Holden configura-
tions are greatly overlapping throughout the training data set. The blue diamond
HOLDEN data series is almost perfectly obscured by the green triangle HOLDEN-
XT series.

Somewhat similarly, the two Ava results appear to follow a slightly similar curvature,
however vertically translated to a lower error level than that of the Holden config-
urations. Internally, however, the curvature of the two Ava data series is highly
similar, though also vertically translated. In other words, if the AVA-M data series
would be shifted downwards in the chart, there would be almost constant visual
overlap between it and the AVA data series. However, visually there is almost no
similarity in the curvatures of the Holden and Ava configurations.

Throughout the entirety of Figure 5.6, the Ava data series produce a lower error
than that of the Holden configurations. This is visualized through how the AVA
and AVA-M data series are consistently below the other three.

The motion capture files that all network configurations performed the worst at,
data files indexed at 72-75, were that of the files containing movement interacting
with more advanced terrain and environments, such as balancing on elevated narrow
beams and dynamic crouching beneath low ceilings.

40


5. Result

Figure 5.7: Line chart of standard deviation results of accuracy evaluation
The error is defined as mean absolute error compared to the training data.
For full definition of error, see Section 3.1.2. For indexing of motion capture files,
see Appendix A.2.

As a similar trend to the means presented in Figure 5.6, the standard deviations
shown in Figure 5.7 has almost perfect overlap for the three Holden configurations.
Once again, the blue diamond HOLDEN data series is almost perfectly obscured by
the green triangle HOLDEN-XT series.

However, the curvature of the AVA-M data series appears to be a a midpoint to
that of the Holden configurations and that of the AVA data series. Visually, the
AVA-M has local maxima and minima similar to both of aforementioned series.
Additionally, the values of the AVA-M series are positionally closer to that of the
Holden configurations than to that of the AVA data series. In other words, data
points along the turqoise circle AVA-M line are further away from that of the purple
crossed AVA line than to those of the other three data series.

41


5. Result

Figure 5.8: Boxplot of accuracy results
The error is defined as mean absolute error compared to the training data.
For full definition of error, see Section 3.1.2.

As a reminder: an error of 0.01 equates to an average prediction error of 1%, see
Section 3.1.2. As a concrete example; the predicted three-dimensional joint positions
that the HOLDEN network configuration produced had, on average, a translational
error of ≈ 4.4%.

HOLDEN HOLDEN-XL HOLDEN-XT AVA AVA-M
Mean 0.0439 0.0448 0.0446 0.0148 0.0258
SD 0.00954 0.00980 0.00981 0.00283 0.00483

Table 5.2: Mean and standard deviation results of accuracy evaluation
Values are rounded to three significant digits.

For all three Holden configurations, the results have almost perfect overlap in both
Figure 5.6 and in Figure 5.7. As such, one can conclude that these three network
configurations have practically equivalent accuracy. This is rather interesting as
both the HOLDEN-XL and HOLDEN-XT configurations each respectively have a
specific advantage, in the form of extra network depth and extra training time,
compared to the default HOLDEN configuration. These results suggest that there
is no benefit to these specific network design alterations.

For the AVA configuration, by combining the visual results of the line chart in Figure
5.6 and the box plot in Figure 5.8, one can conclude that there is a significantly lower
error in the AVA prediction than that of the Holden configurations.

Lastly, the AVA-M data series appear to share some similarity to both the Holden
and AVA data series. To measure this similarity, one may utilize the difference
metric used in Section 5.1 and originally presented in 3.1.1; calculating the number
of standard deviations between the means. This produces the following results:

42


5. Result

• HOLDEN to HOLDEN-XT: 2|0.0439−0.0446|
0.00954+0.00981 ≈ 0.072

• HOLDEN to HOLDEN-XL: 2|0.0439−0.0448|
0.00954+0.00980 ≈ 0.0093

• HOLDEN-XT to HOLDEN-XL: 2|0.0446−0.0448|
0.0448+0.00980 ≈ 0.0073

• AVA-M to HOLDEN: 2|0.0258−0.0439|
0.00483+0.00954 ≈ 2.5

• AVA to AVA-M: 2|0.0148−0.0258|
0.00283+0.00483 ≈ 2.9

These differences can be summarized as the three Holden network configurations
producing practically equivalent results, with an especially large overlap between
HOLDEN and HOLDEN-XT, and the AVA-M results being slightly closer to that
of the Holden configurations than to that of the AVA configuration.

5.4 Accuracy Visualizations
Video recordings of these visualizations are available here [48].

Figure 5.9: Some frames from the HOLDEN accuracy evaluation
Gray: Ground truth joint positions.
Blue: HOLDEN joint positions.

Figure 5.10: Some frames from the AVA accuracy evaluation
Gray: Ground truth joint positions.
Magenta: AVA joint positions.

In Figure 5.9 and Figure 5.10, one can see examples of the skeleton joint position

43


5. Result

outputs the HOLDEN and AVA network configurations produced during their re-
spective accuracy evaluations.

5.5 Training Process
During the training process, the prediction mean-squared error of the full output
vector was recorded after each epoch. Do note the difference in error definition, and
the fact that the entire output vector is used rather than just the predicted skeleton
joint positions, compared to the one used in Section 5.3. This data is available in in
full in Appendix A.7, and presented as a line chart in Figure 5.11.

Figure 5.11: Mean-squared error of entire output vector during training
HOLDEN-XT is hidden during the first half of its training process as its design, and
therefore results, is entirely equivalent to that of the HOLDEN configuration.

Figure 5.11 shows similarity through overlap between the HOLDEN and HOLDEN-
XL configurations throughout their training period. Additionally, during the extra
training period of the HOLDEN data series, here equivalent to that of the HOLDEN-
XT configuration, the mean-squared error remains relatively unchanged. This fur-
ther reemphasizes the similarity in accuracy argued in Section 5.3.

As a reminder; the green triangle HOLDEN-XT was trained for twice the number
of epochs than the other network configurations, which results in a twice as long
output error result.

For the AVA configuration, one can visibly determine a larger mean-squared error
during the entirety of its training process compared to the Holden configurations.
This is visualized through the fact that the purple crossed AVA data series lies
relatively significantly above the others in Figure 5.11. Additionally, the mean-
squared error of the entire output vector appears visibly more irregular between
epochs during the training process than that of the Holden configurations.

44


5. Result

5.6 Pipeline Overview
This section aims to present the computation time, see Table 5.3, and the data size,
see Table 5.4, for each step of the full phase-functioned neural network pipeline,
presented in Section 3.3.

Step HOLDEN HOLDEN-XL HOLDEN-XT AVA
Generate Patches 28min - - -
Generate Database 90min - - 55min
Network Training 43h 47h 91h 31h
Neural Network 0.38ms 0.50ms 0.38ms 1.5ms

Table 5.3: Computational time required throughout the pipeline
Entries marked ‘-’ share the HOLDEN results.

In summary, for the HOLDEN-XL configuration, Table 5.3 shows that there is little
difference in training time when adding a new network layer to the Holden network.

The much longer training time of the HOLDEN-XT configuration was not unex-
pected, as a result of it being trained for twice the number of epochs, see Section
3.1.3.

Additionally, the table shows that the database generation, and to some extent
the network training, is considerably quicker for the AVA configuration. This means
that even though the AVA configuration had six times the number of skeleton joints,
see Section 3.1.3, the fact that it only had a fourth of the frames compared to the
Holden configurations, see Section 3.1.3, resulted in it being trained considerably
faster.

Data HOLDEN HOLDEN-XL HOLDEN-XT AVA
Height Fields 134MB - - -

Terrain Patches 606MB - - -
Motion Capture Files 848MB* - - 7.62GB*

Phase Labels 7.04MB - - -
Other Labels 56.9MB - - -

Training Database 7.12GB - - 10.4GB
Network Weights 122MB 176MB 122MB 374MB

Table 5.4: Size of different data throughout the pipeline
Entries marked ‘-’ share the HOLDEN results
*: Motion Capture Files are in 120Hz.

In summary, Table 5.3 shows the considerable increase in memory size between that
of the Holden, to that of the AVA, network configurations. As mentioned in 3.1.3,
do note that the AVA database only contains a number of frame data points equal
to a quarter of that of the Holden configurations.

45


5. Result

Additionally, as discussed previously, given that the HOLDEN-XT configuration
differs from the default HOLDEN configuration solely through training time, it is
expected that the network weights produced by the two have the same memory size.

In contrast, given that the HOLDEN-XL configuration has more network nodes
than that of the HOLDEN configuration, it is expected that there are more weights,
requiring more memory, for the former.

5.7 Skinning Visualization
Video recordings of these visualizations are available here [48].

These were the in-engine results of the network orientational outputs after switching
the X- and Z-rotations, and inverting the Y- and switched X-rotations.

Figure 5.12: HOLDEN positional and orientational output skinned
Left: The skinned HOLDEN character model in default stance.
Middle: HOLDEN output skinned.
Right: HOLDEN output skinned with visible skeleton.

Figure 5.13: AVA positional and orientational output skinned
Left: The skinned AVA character model in default stance.
Middle: AVA output skinned.
Right: AVA output skinned with visible skeleton.

In Figure 5.12 and Figure 5.13, one can see examples of the skeleton joint translation

46


5. Result

and orientation outputs the HOLDEN and AVA network configurations produced
during their respective responsivity evaluations skinned to 3D character models.

The HOLDEN skinning, at a quick glance, appears correct in general. Occasionally,
certain specific joints experience single-frame orientation errors. For an example,
the head and torso sometimes rotate over 180 degrees around an axis in a single
frame. As of writing, this error is still being investigated.

Similarly, the skinned AVA results also produce certain erroneous orientations, how-
ever much more frequently and for more than only two joints. In the middle panel
of Figure 5.13, one can see the torso joint being rotated over 180 degrees. As of
writing, this error is still being investigated. Additionally, the original skeleton file,
on which the motion capture data was retargeted using, was lost and replaced with
a new skeleton file for the runtime process. This new skeleton is perfectly equivalent
to the old one, except for the facial structure. As such, the facial contortions shown
in Figure 5.13 is to be expected.

However even though the character model occasionally appears incorrect, the un-
derlying skeleton still moves correctly. This shows that the positional outputs of the
neural networks are correct, however that is not always the case for the orientational.
This is presumably a result of the fact that the network is trained in another set
of world-axis orientations, see Section 3.3.4, compared to that of the Apex engine.
The positional outputs of the neural networks are manually converted in runtime
to match that of the engine, hence the apparently correct skeletal output. This
inconsistency in world-axis orientations, however, does not affect the results of the
responsivity or accuracy evaluations. This is further discussed in Section 6.4.2.

47


5. Result

48


6
Discussion

6.1 Discussing the Research Questions
This section aims to, through discussion, answer the research questions as presented
in Section 1.3.

6.1.1 Discussing Responsivity
The research question regarding responsivity asked how much computational time is
required for procedural single-character locomotive animations, see Section 1.3. This
is answered through testing an engine-integrated solution using the four different
network configurations.

The responsivity results presented in Section 5.1 reveals a considerable computa-
tional difference between the Avalanche and the Holden network configurations.

This significant distinction in frametime could be explained by the difference in
number of skeleton joints. As mentioned in Section 3.1.3, the two skeleton types
have 31 and 191 joints, respectively. Since all skeleton joints are fed into the neural
network during the runtime prediction process, the width of the networks, and in
turn the number of computations each frame, depend on the number of joints.

Since the AVA network required a size of 1’302x512x1’751, whereas the two shallower
configurations required only a size of 342x512x311, the number of computations
required each frame to propagate the character pose through the network is therefore
greatly increased in the former configuration.

A naive, since it assumes sequential computing, way of counting computations in a
simple feed-forward network like the one used in this project is to sum the number
of inter-layer network connections and the number of biases, like so:

∑
l

nl(1 + nl+1) (6.1)

Where: ‘l’ is the index of all non-output network layers, ‘l + 1’ is the index of the
subsequent network layer following layer ‘l’, and ‘nl’ is the number of nodes (width)
in layer ‘l’.

49


6. Discussion

The number of computations per network configuration using the above formula can
be seen in Table 6.1.

HOLDEN HOLDEN-XL HOLDEN-XT AVA
computations 335’501 598’157 335’501 1’566’701

Table 6.1: The number of computations required per network configuration

These data points can then be inserted into a line chart to display the linear rela-
tionship between the number of skeleton joints and the frametime, such as Figure
6.1.

Figure 6.1: Linear regression of computational time over network calculations
Beware; this is a gross simplification simply to show a possible relation between
network size and computational load at runtime.
Regression line: y ≈ 9.6x · 10−7 + 0.018

6.1.2 Discussing Accuracy
The second subsidiary research question asked how accurate the generated locomo-
tive character animations are to the original animation data, see Section 1.3. This
is answered here through relative comparisons between the network configurations.

difference between the holdens
The three Holden network configurations maintain close resemblance with equivalent
results throughout the error evaluation process. This can be seen through consistent
overlap between the data shown in both Figure 5.6, Figure 5.7, and Figure 5.8, and
with a near zero divergence metric, as presented in Section 3.1.1. Additionally, the
mean-squared-error evaluation during the training process also shows no significant
difference between the Holden configurations, see Figure 5.11.

50


6. Discussion

Through these statistics, one can conclude that there was insignificant gain in either
doubling the number of hidden layers in the neural network, as was the case for the
HOLDEN-XL configuration, or doubling the duration of the training process, as was
the case for the HOLDEN-XT configuration.

Additionally, the lack of improvement in results between the default HOLDEN and
the long trained HOLDEN-XT configuration is evidence that the original imple-
mentation by Holden et al. does not suffer from underfitting, see Section 2.1.4 and
Section 3.1.3.

ava versus ava-m
Additionally, the AVA and AVA-M data series closely follow the same curvature in
Figure 5.6, however translated vertically. This curvature is distinctively different
from that of the Holden configurations.

Given that the AVA data series has consistently less of an error than the AVA-M
series, one can deduce that the masked joints, those not present in AVA-M, give rise
to a constantly lower error in comparison.

The mean absolute error definition used in this project includes a normalizing de-
nominator, see Equation 3.1, as to allow the error to be relative to the ground truth
positional values. For an example, an error of 0.01 equates to a predicted joint
position that is 1% off the ground truth, see Section 3.1.2. However, this means
that further away joints, joints at positions with high positional values, would need
a larger absolute error to produce the same effective evaluated error than that of a
joint closer to the axis origin.

As a concrete example, consider a joint a with a ground truth position at at = 100.
For this joint to contribute with an error of 0.1, the predicted position would need
to be at ap = 101. However, if we consider a different joint b with a ground truth
position at bt = 1, the predicted position would need only to be bp = 1.01 to
contribute with the same error.

Out of the the masked AVA joints, a considerable number of these are highly con-
centration within the character head and face, see Section 3.1.3. As these joints are
further away from the axis origin, the discrepancy in error between AVA and AVA-M
may be a result of the inherent relativity design of the mean absolute error definition
used in this project. As such, AVA-M might be more suitable for comparisons with
the Holden configurations.

ava-m versus the holdens
Also, even though the mean-absolute prediction error at runtime of the AVA con-
figuration was significantly lower than that of the Holden configurations, see Figure
A.5, the mean-squared prediction error during the training process of the former
was considerably higher than that of the latters, see Figure 5.11.

This difference could be a result of the fact that the runtime evaluation only con-
sidered the three-dimensional skeleton joint positions, wheras the training process
evaluation considered the entire output vector. This means that it is possible that

51


6. Discussion

the AVA network configuration is comparatively much better at joint position pre-
diction than at predicting the other output features, such as: joint orientations and
velocities, and sample point positions and trajectories, see Section 3.2.3.

Additionally, another potential reason behind this difference is the fact that the
two evaluation processes used different error definitions; mean-absolute error and
mean-squared error. The fact that the absolute error was lower than the squared
error could be an indicator that the data, in this case the accuracy of the AVA
network, had many extreme outliers. This, as the mean-squared error definition
squares the per data point error, meaning that errors smaller than one get reduced
and errors larger than one get amplified, compared to that of the mean-absolute
error definition.

However, a potential source of the difference in error between the AVA and Holden
configurations was the choice of evaluation frames per motion capture file, see Section
3.1.2. As a result of insufficient system memory for keeping each joint position in
each frame in memory, a decision to only consider the first 500’000 three-dimensional
joint positions in each datafile was made. This in combination with the fact that the
AVA configuration only considered every fourth frame and that the AVA skeleton
had more than six times the skeleton joints, see Section 3.1.3, means that it is
probable that if the maximum joint position number is met, the two configurations
would consider different blocks of frames within the motion capture data.

In other words, if the Holden configurations reached the memory limit of 500’000
joint positions, it will have considered the first 500′000

31 ≈ 16′000 frames. On the other
hand, if the AVA configuration reached the memory limit of 500’000 joint positions,
it would have instead only considered the first 500′000·4

191 ≈ 10′000 frames. If the
different network configurations was evaluated on a different set of motion frames,
then that could make for an unfair comparison.

6.1.3 Discussing Architecture
The final subsidiary research question asked about how the phase-functioned neural
network architecture presented in Holden et al. could be improved, see Section 1.3.

An initial hypothesis may be that allowing a neural network to train for a longer
period of time, or to have a deeper network, may increase the predictive accuracy
of the network. However, when comparing the accuracy results between that of the
Holden configurations, see Section 6.1.2, one can conclude that is not the case, at
least not for this particular model. This was shown through insignificant difference
in accuracy between the three network configurations; default (HOLDEN), double
the hidden layers (HOLDEN-XL), and double the training time (HOLDEN-XT),
see Section 6.1.2.

On the other hand, not just that there was no strictly positive sides of any of the two
altered Holden configurations, there was still strictly negative ones. For an example,
given that the HOLDEN-XL configuration required an additional layer of network
weights and biases, it was shown to require both longer computational time, see

52


6. Discussion

Figure 5.1 and Table 5.3, and more memory, see Table 5.4, at runtime. For the
HOLDEN-XT configuration, the only clear downside of it, compared to the default
HOLDEN network, was the increase in training time, see Table 5.4. As such, the
altered Holden configurations have been shown to only perform equally, or worse,
compared to that of the default HOLDEN configuration.

The fourth network configuration, AVA, that in one aspect may be used to evaluate
the generalizability of the phase-functioned neural network approach, had its own
fair share of issues. For the responsivity evaluation, the AVA network was shown to
be significantly much slower to evaluate at runtime, compared to all other network
configurations. This, however, may be quite unsuprising as it required many more
computations for each prediction, see Figure 6.1, as a result of it being based on
a character skeleton using more than six times the joints than that of the skeleton
used for the other network c